Tutorial - Building a Retrieval Augmented Generation (RAG) System

Attachments: article_code.txt | documents.zip

Sections

What Is Retrieval Augmented Generation?
Selecting a Pool of Documents
Chunking the Documents
Embedding and Vector Databases
Retrieval and Augmentation
RAG In Action

This article will walk through the entire process of building a Retrieval Augmented Generation (RAG) system. This is also a hands-on tutorial that you can hopefully follow along with to build a very simple RAG system.

There are probably infinite ways RAG systems can differ, but what I am going through here will be the major components that are commonly found in the vast majority of them.

What Is Retrieval Augmented Generation?

Retrieval Augmented Generation (RAG) refers to retrieving information from data sources external to the Large Language Model (LLM). This information is then incorporated into the LLM’s response.

This allows the LLM to go beyond its training data and also have access to more up-to-date data. For example, current news or the latest regulations. Searching the internet is a form of RAG. But the most common industry implementation of RAG is retrieving documents from an internal company database.

Selecting a Pool of Documents

What information to retrieve depends on your use case. For example, if we have a company chatbot answering customer questions about the company’s products, we would start with a pool of documents containing relevant details on those products.

For this simple tutorial, we will just be using a very small handful of documents as a demonstration. These are:

Snapshots of two Wikipedia articles. One on the 2025 Game Awards, the other on the 98th Academy Awards.
A snapshot of web-scrapped descriptions of 100 meetup events taking place in Chicago. This snapshot was generously provided by the event aggregator site Chiirl.

These documents can be found in the documents.zip file, which is listed in the attachment section at the top of this article.

Chunking the Documents

Each document in the pool has to be divided into smaller chunks. These chunks are what we will be retrieving and adding to our LLM prompt. There are several reasons why we do not just retrieve the entire document. Here are a few:

We do not want to flood the context window with too much possibly irrelevant text.
If we are embedding the documents, we want clusters with clear separation. Long documents might overlap too much.
Small, precise chunks should have better accuracy.
Retrieving small chunks uses fewer computing resources than entire documents.

LangChain offers several convenient packages for splitting text using different methods. We will be using the recursive character text splitter that will attempt to keep paragraphs intact.

Below is an example of its usage in the code. The full code for this entire tutorial can be found in the article_code.txt file in the attachment section at the top of this article.

1# pip install langchain-text-splitters
2from langchain_text_splitters import RecursiveCharacterTextSplitter
3
4text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
5
6with open('wiki_game_awards_2025.txt', encoding='utf-8') as f:
7    text1 = text_splitter.split_text(f.read())

Note that the chiirl_events.txt file contains horizontal lines that give us a convenient way to split it into chunks.

1with open('chiirl_events.txt', encoding='utf-8') as f:
2    doc = f.read()
3    text3 = doc.split('____________________')

Embedding and Vector Databases

The next step is to convert these text chunks into vectors and store them in a vector database. We will use the all-mpnet-base-v2 model for this. There are many choices, but according to a quick Google search, this one appears to be relatively well-received.

all-mpnet-base-v2 turns chunks of text into a 768-dimensional vector. It is a sentence transformer model, also known as sentence-BERT (SBERT). There are other ways to turn text into vectors. For example, doc2vec is a vectorization method based on the well-known word2vec algorithm.

1# pip install langchain-huggingface
2# pip install sentence_transformers
3from langchain_huggingface.embeddings import HuggingFaceEmbeddings
4
5embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

After vectorizing the document chunks, we need to store these vectors and their corresponding text in a vector database.

There is a long list of vector databases. The open-source Chroma appears to do well in benchmarks and is relatively popular. Another is pgvector, which adds vector similarity search for PostgreSQL and is also open-source. Pgvector is very popular due to the popularity of PostgreSQL, but it doesn’t appear to do as well in benchmarks, probably because of the PostgreSQL overhead.

I can keep talking about vector databases and comparing them with each other for a long time. For this article, we are just going to use Facebook’s open-source Faiss database.

 1# pip install langchain-community
 2# pip install faiss-cpu
 3import faiss
 4from langchain_community.docstore.in_memory import InMemoryDocstore
 5from langchain_community.vectorstores import FAISS
 6
 7index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
 8
 9vector_store = FAISS(
10    embedding_function=embeddings,
11    index=index,
12    docstore=InMemoryDocstore(),
13    index_to_docstore_id={},
14)

Again, the code shown here is just an example. The full code for this entire tutorial can be found in the article_code.txt file in the attachment section at the top of this article.

You might have noticed that everything we used exists as separate, standalone Python packages. LangChain isn’t actually necessary to run them. There are discussions on whether there are any benefits to using them from within LangChain. For this tutorial, we are just going to do everything in LangChain.

Retrieval and Augmentation

Again, there are many different strategies for implementing RAG. We will just use a straightforward method:

The user inputs a prompt into the LLM.
Their prompt is converted into a vector.
This vector is used to search for relevant text chunks using some form of nearest neighbor search.
Relevant chunks are added to the user’s prompt.

It is easy to test inserting a document into our Faiss database, then retrieving it by its ID.

 1from langchain_core.documents import Document
 2
 3document = Document(
 4    page_content='testing document'
 5)
 6
 7vector_store.add_documents(
 8    documents=[document],
 9    ids=['test']
10)
11
12print(vector_store.get_by_ids(['test']))

But of course, the actual retrieval is done by vector similarity search and not just by ID. Let’s try this on our test document, using a test prompt.

1results = vector_store.similarity_search_with_score(
2    'Will it be hot tomorrow?', k=1
3)
4
5print('score = ', results[0][1])

The default for Faiss is to return \( k \) documents closest to our prompt based on Euclidean distance. For the example above, we set \( k = 1 \). The score is the square of the Euclidean distance between the vector representation of the prompt and that of the retrieved document.

Interestingly, Langchain does not allow us to retrieve the vector representation of the document. But we can still check that the score we got is indeed the squared Euclidean distance.

1import math
2
3t1 = embeddings.embed_query('testing document')
4t2 = embeddings.embed_query('Will it be hot tomorrow?')
5
6print('euclidean dist = ', math.dist(t1,t2) ** 2)
7
8vector_store.delete(ids=['test'])

RAG In Action

Now that we have all the pieces in place, we can add the three documents to our vector database.

 1doc_list = []
 2
 3for chunk in text1:
 4    doc_list.append(Document(page_content=chunk))
 5
 6for chunk in text2:
 7    doc_list.append(Document(page_content=chunk))
 8
 9for chunk in text3:
10    doc_list.append(Document(page_content=chunk))
11
12keys = vector_store.add_documents(documents=doc_list)

Finally, let’s see a few hands-on examples of how RAG can help the LLM answer questions that it wouldn’t be able to otherwise. The code below shows how to prompt gemini-2.5-flash with and without RAG.

 1from google import genai
 2
 3client = genai.Client(api_key='your api key here')
 4
 5response = client.models.generate_content(
 6    model='gemini-2.5-flash', contents='who performed at the 2025 game awards'
 7)
 8
 9print('* * * no RAG * * *')
10print('\n')
11print(response.text)
12print('\n')
13
14# - - -
15
16prompt = 'who performed at the 2025 game awards'
17rag_string = ''
18x = vector_store.similarity_search_with_score(prompt, k=5)
19
20for doc in x:
21    y = doc[0].model_dump()['page_content']
22    rag_string += y
23    rag_string += '\n\n'
24
25new_prompt = f'''
26
27using this information:
28
29{rag_string}
30
31answer this question: {prompt}
32
33'''
34
35rag_response = client.models.generate_content(
36    model='gemini-2.5-flash', contents=new_prompt
37)
38
39print('* * * with RAG * * *')
40print('\n')
41print(rag_response.text)
42print('\n')

Here is the output of the code above.

Congrats, you have built a working RAG system! Here are two more screenshots comparing prompts with RAG vs no RAG.