Answering Question About Custom Documents Using LangChain (and OpenAI)

Question Answering Using LangChain

Large Language Models (LLMs) such as OpenAI’s GPT are taking the world by storm. In particular, the release of ChatGPT has catapulted both AI and (transformer-based) LLMs, which have been discussed for years (remember Attention Is All You Need?), into public discourse.

Of course, there also most likely isn’t a single industry not considering the impact of these systems right now. The education sector is, that’s for sure!

Being interested not just in the social, cultural, economic etc. impact – which will be extremely important –, I’m also very much looking at the technological developments and tooling around LLMs. Within that field, a particularly interesting Python library is LangChain. While LangChain isn’t adding anything “new,” at least from a research perspective, to the table, it makes working with LLMs and related technologies so, so, so much easier. Aside from many other things, LangChain provides easy access to a number of models and APIs, and, more importantly, makes combining LLMs and other sources of information extremely easy.

Recently, I built a small example to demonstrate how LangChain can be used to leverage general-purpose language models for more specific use cases involving organizational data. As the example, which is a very common one,seemed to be informative, I decided to provide a short writeup for others. Doing so, I will provide additional context for each step, which will hopefully allow you to understand the underlying mechanics better.

Answering Question About Custom Documents Using LangChain and OpenAI

Large Language Models such as GPT-3 or ChatGPT are very useful on their own. For example, they are great at generating ideas! However, in many cases, they only become truly powerful when used in conjunction with other sources of information. Given contextual knowledge, their output becomes meaingful to your context.

Hence, in the following, we’re going to use LangChain and OpenAI’s API and models, text-davinci-003 in particular, to build a system that can answer questions about custom documents provided by us. The idea is simple: You have a repository of documents, essentially knowledge, and you want to ask an AI system questions about it. Doing so, you don’t want generic answers but answers based on these documents.

Please be aware that in this example, we will use OpenAI’s API for simplicity’s sake. Using LangChain, you can also use other, fully local, models. This might be highly relevant for your use case, especially if you want to ensure that no data, e.g., confidential documents, leave your system.

Our Toy Example – A Single Article

For this example, we are going to use a single document as our knowledge base. In particular, we’re going to use a machine-translated (DeepL) version of an article on LLMs in education I recently co-authored. Of course, this is arbitrary, but the paper fits the theme of this article and certainly hasn’t been used to train the underlying model we will use.

Our goal will be to answer a simple, at least for humans, question about the article. In the paper, we have outlined two guidelines for the use of text-generating AI systems in education. Hence, we expect our system to correctly answer the question: What are the two guidelines?.

Baseline – A Generic LLM Without Contextual Knowledge

Let’s start by looking at a baseline, text-davinci-003 by OpenAI, without any further contextual information or knowledge. Of course, even though this model is OpenAI’s “most capable GPT-3 model,” there is no way that it will be able to answer our question. It has no context, and it has no knowledge about the paper! One way of thinking about this is to imagine a new colleague who is really smart but has yet to learn about the organization, projects, etc.

That said, please note that we are not finetuning the model. Instead, we are providing the model additional information, parts of our document, to answer the question.

The following code will contain all the necessary imports for the whole example. Technically, for this next step, we only need OpenAI.

You will also need to install the following Python packages: openai, chromadb, and – of course – langchain.

OPENAI_API_KEY = 'YOUR OPENAI API KEY'

from langchain import OpenAI, VectorDBQA

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.prompts import PromptTemplate

# Using text-davinci-003 and a temperature of 0
llm = OpenAI(model_name='text-davinci-003', temperature=0, openai_api_key=OPENAI_API_KEY)

answer = answer = llm('What are the two guidelines?')
print(answer)

1. Respect the privacy of others.
2. Be mindful of the language you use.

As expected, the answer is interesting but unrelated to our paper! How could it – the model has no knowledge of what we’re talking about!

As a side note: If you’ve never worked with GPT, you might be wondering what the temperature refers to. Put simply, with a lower temperate, GPT will choose words – at its core GPT is a next-word prediction machine – with a higher probability. Put differently, the higher the temperature, the more “creative” GPT will become.

Preparing Our Custom Data

As a first step, we will be preparing our custom document. Please assume that our article resides in the same folder as the Python script and is called article.txt. It’s simply a plain text file containing the article in English.

First, we are going to load and then split our data into chunks.

doc_loader = TextLoader('article.txt')
documents = doc_loader.load()

LangChain has a variety of so-called document loaders which help with bringing in external information. Here, we are using a very simple TextLoader, which reads a single file. That said, there are, e.g., loaders for Notion and PDFs available for you to use.

After loading, we will have a list of documents. Well, in this case, we have one document. At least for now!

[Document(page_content='Text-generating AI systems ....', lookup_str='', metadata={'source': 'article.txt'}, lookup_index=0)]

Now, we will be using a simple CharacterTextSplitter to split our documents into a list of smaller texts. By default, this is done by splitting at separators (\n\n). We are doing this to produce chunks of text which we can use later. Ultimately, as you will see later, we want to find chunks of text which we can provide to the LLM as context. As we’re limited in how much context we can provide, we will work with smaller texts, i.e., chunks, and not (long) documents.

text_splitter = CharacterTextSplitter(chunk_overlap=0, chunk_size=1000)
texts = text_splitter.split_documents(documents)

Please note that from now on, document will refer to these texts. While we had one initial document, the article, we are now working with multiple texts/documents after chunking.

It’s Time for Embeddings (Vectors)

Now that we have the article loaded and chunked up, it’s time to get embeddings for our new texts/documents. An embedding is a numerical representation, in this case a vector, of a text. We will be using OpenAPI’s embeddings API to get them.

Furthermore, we will be using LangChains’s Chroma, a wrapper around ChromaDB. ChromaDB is an open-source embedding database that makes working with embeddings and LLMs a lot easier.

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
docsearch = Chroma.from_documents(texts, embeddings)

Using just these two lines of code – and the mighty OpenAI API in the background – we have embeddings for our documents in a ChromaDB.

These embeddings will allow us to, for example, perform semantic similarity searches. We will use them to identify documents, or parts of documents, that match our question. Put simply, if we have numerical representations of texts and if we assume that these representations encode meaning (see e.g., Distributional Semantics), we can compare texts by comparing their vector representations.

Using a Chain for Question Answering Against a Vector Database

Having our embeddings, we can leverage LangChains powerful chains to perform our questions answering. For this example, we will use VectorDBQA, a “chain for question-answering against a vector database” (see Documentation). Of course, in this case, the vector database is ChromaDB.

llm = OpenAI(model_name='text-davinci-003', temperature=0, openai_api_key=OPENAI_API_KEY)
qa_chain = VectorDBQA.from_chain_type(llm=llm, chain_type='stuff', vectorstore=docsearch)

qa_chain.run('What are the two guidelines?')

Alternatively, we could also run the chain more explicitly using:

qa_chain({'query': 'What are the two guidelines?'}, return_only_outputs=True)

The two guidelines are (1) AI-based writing and related (didactic) issues should be proactively placed, negotiated, and explored, and (2) the use of AI systems must take place with maximum transparency.

As you can see, the system works perfectly and we got exactly what we wanted! 😁

However, what happened behind the scenes?

First of all, we are using a so-called stuff chain (see CombineDocuments Chains). Stuffing is one answer to how we can provide information to the LLM. Using stuffing, we simply “stuff” all the information into the LLMs prompt. Of course, this only works with shorter documents as most LLMs have an upper limit for context length.

Furthermore, similarity search (VectorDBQA supports both similarity_search and max_marginal_relevance_search), using the embeddings, is performed to find matching documents to feed to the LLM as context. Of course, with only one document this is not particularly useful at first glance. However, as we “chunked” our text, we are, technically, working with multiple documents. Selecting the best documents beforehand, based on semantic similarity, allows us to feed the model (via the prompt) meaningful knowledge, allowing us to stay within the allowed context size. We simply cannot provide the model with all of the information available!

By default, VectorDBQA will pick four documents. Of course, we could change this behavior by passing the k parameter.

qa = VectorDBQA.from_chain_type(llm=llm, chain_type='stuff', vectorstore=docsearch, k=1)

Now, only one document will be chosen! However, in this particular example, a single document was not enough context to answer the question.

Finally, we can also have a look at the underlying default prompt for our VectorDBQA, which can be found by looking at LangChains codebase:

prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""

{context} will be replaced with the selected documents, {question} with our question. As you can see, there is prompt engineering baked into LangChain! Of course, these default prompts can be changed at will.

In summary, we are picking fittings documents (chunks) based on similarity to our question. Then, we are “stuffing” this knowledge into the LLM and ask it to answer our question.

If you are interested in exploring what is going on behind the scenes, also have a look at my article on tracking and inspecting prompts in LangChain.

Be aware: As I hinted to above, there are privacy implications here! To get the embeddings, the documents are sent to OpenAI. Furthermore, using OpenAI’s GPT via the API, parts of your texts, the selected documents, will be sent as part of the prompt.

Conclusions

LLMs such as (Chat)GPT are extremely powerful and can almost work wonders if they have the right prompts and the right contextual information.

As the example above shows, LangChain is extremely helpful in interfacing with LLMs such as OpenAI’s GPT. Here, we have used LangChain to construct a prompt including matching context information in order to answer a question about our documents. Instead of finetuning the model, we have selected meaningful information and asked the LLM to work with it.

That said, LangChain can do a lot more, and it is absolutely worth exploring! Most importantly, however, the above example demonstrates how general-purpose LLMs, models trained on “everything,” can be used productively for special cases if used correctly, e.g., by looking at prompt engineering.