Retrieval Augmented Generation (or RAG, as everyone calls it) is one of those design patterns that actually delivers on its promise. I've seen it power everything from customer support chatbots to internal knowledge assistants. The best part? It tackles that persistent problem we all face with LLMs. It grounds their responses in real information, which dramatically cuts down on hallucinations.

RAG is the backbone of data-aware chatbots and support tools everywhere. If you want to see how this works in practice, I put together a tutorial on building a multi-agent chatbot with ChromaDB RAG that shows the whole process end to end.

Now, RAG has two main components: retrieval and generation. This guide focuses on RAG 101, specifically the retrieval part. We're going to zero in on finding the right context for your model. You'll build an index, search through it, first manually to really understand what's happening, then using LangChain to speed things up. Once you've got retrieval down, plugging the results into your LLM is straightforward.

Uploaded image

The retrieval process is really the heart of what we're doing here. It's all about finding and selecting the most relevant documents for your use case. There are several approaches you can take, but vector store retrieval has become the go-to method for most applications.

Let me walk you through the two essential retrieval steps. First, you'll create the index. Then you'll search it. I want you to see the mechanics clearly, so we'll do it manually first. After that, I'll show you how LangChain makes the whole process much smoother.

Semantic Similarity Search Done Manually

Semantic Similarity Search is probably the most widely used retrieval approach out there. The idea is simple: find documents that best match your input based on meaning, not just keywords. This approach really shines when you need to match semantic content across text. Say you have a user query and you want to find the most relevant sentences or paragraphs in your documentation. It's particularly powerful for retrieving internal knowledge to enhance customer support or adding grounded context to LLM responses.

We're going to implement Semantic Similarity Search manually using Sentence Transformers for embeddings and ChromaDB as our vector store. If you're curious about how transformer models generate embeddings and why they work so well for semantic search, I've written a detailed guide that explains the underlying mechanics.

Setup

Before diving in, let's get the required Python libraries installed. Here's what we'll use:

  • chromadb: A fast and efficient vector database that's become my go-to choice

  • sentence-transformers: For generating sentence embeddings

  • PyPDF: A modern library for reading and processing PDF files

  • langchain: The framework for building LLM-powered applications

  • langchain-community: Extensions and integrations for LangChain

  • langchain-huggingface: Tools to use Hugging Face models within LangChain

By the way, if you're interested in integrating LLMs into your Python workflow for tasks beyond retrieval, I've found some practical techniques for LLM pair programming in Jupyter AI that might help.

%pip install chromadb
%pip install sentence-transformers
%pip install pypdf
%pip install langchain
%pip install langchain-community
%pip install langchain-huggingface
%pip install langchain-chroma

Step 1: Create the Index

Alright, with everything installed, let's prepare an index for your documents. The process breaks down into four main parts:

  1. Load the documents: Extract text from your source file

  2. Split the text: Break it into semantically meaningful chunks

  3. Generate embeddings: Use a pre-trained model to represent each chunk as a vector

  4. Store data in a vector database: Save chunks, embeddings, and metadata for retrieval

I'm going to use the Minutes of the Federal Open Market Committee as our example document. You can find it here: https://www.federalreserve.gov/monetarypolicy/files/fomcminutes20241107.pdf

Here's the step-by-step implementation:

# Step 1: Import required libraries
from sentence_transformers import SentenceTransformer
from chromadb import Client
from chromadb.config import Settings
from pypdf import PdfReader

# Step 2: Define functions for processing

def load_pdf(filepath):
    """
    Load text from a PDF document.
    """
    reader = PdfReader(filepath)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    print(f"Text with {len(text.split())} words extracted from pdf {filepath}.")
    return text

def split_text(text, chunk_size=500, overlap=100):
    """
    Split the text into chunks of a specified size with overlap.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    print(f"Text split in {len(chunks)} chunks.")
    return chunks

# Step 3: Load and process the document
pdf_path = "fomcminutes20241107.pdf"  # Replace with your PDF path
document_text = load_pdf(pdf_path)
document_chunks = split_text(document_text)

# Step 4: Generate embeddings
# Load a pre-trained SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')  # A lightweight and fast embedding model
embeddings = model.encode(document_chunks)
print(f"Embeddings generated for {len(embeddings)} chunks.")
print(f"Each embedding is of length {len(embeddings[0])}.")

# Step 5: Set up ChromaDB and create the index
persist_directory = "./chroma_db"
chroma_client = Client(Settings(persist_directory=persist_directory))

collection_name = "fomc_minutes_20241107"
collection = chroma_client.create_collection(collection_name)
print(f"ChromaDB collection {collection_name} created.")

# Prepare bulk data for adding to the collection
ids = [f"chunk_{i}" for i in range(len(document_chunks))]
metadatas = [{"chunk_id": i, "source": "sample_product_document.pdf"} for i in range(len(document_chunks))]

# Add chunks, embeddings, and metadata in bulk
collection.add(
    documents=document_chunks,
    metadatas=metadatas,
    ids=ids,
    embeddings=embeddings.tolist(),  # Convert numpy array to list
)

# Step 7: Confirm the index
print(f"All {collection.count()} documents added to the collection {collection_name}.")
Text with 7385 words extracted from pdf fomcminutes20241107.pdf.
Text split in 127 chunks.
Embeddings generated for 127 chunks.
Each embedding is of length 384.
ChromaDB collection fomc_minutes_20241107 created.
All 127 documents added to the collection fomc_minutes_20241107.

Step 2: Search the Index

Now that we've populated our vector database, let's perform some searches. The process is straightforward:

  • Load the collection: Initialize the database client and load your collection

  • Generate a query embedding: Convert your search query into a vector

  • Retrieve results: Use the vector database to find the most similar chunks

# Step 1: Initialize ChromaDB client and load the collection
# We won't run this since the chroma client was already created in this session

# chroma_client = Client(Settings(persist_directory="chroma_db")) 
# collection_name = "fomc_minutes_20241107"
# collection = chroma_client.get_collection(collection_name)
# print(f"Collection '{collection_name}' loaded successfully.")

# Step 2: Define the search query
search_query = "What was discussed about monetary policy?"  # Replace with your query
print(f"Search Query: {search_query}")

# Step 3: Generate embeddings for the search query
query_embedding = model.encode([search_query])  # Generate embedding for the query
print(f"Embedding generated for search query.")
print(f"Embedding is of length {len(query_embedding[0])}.")

# Step 4: Perform the search
# Set the number of top results to retrieve
top_k = 5
results = collection.query(
    query_embeddings=query_embedding.tolist(),  # Convert numpy array to list
    n_results=top_k,
)

# Access the first (and only) batch of results
documents = results['documents'][0]
metadatas = results['metadatas'][0]
distances = results['distances'][0]

# Print each result
print(f"\nTop {top_k} Results:")
for i, (doc, metadata, distance) in enumerate(zip(documents, metadatas, distances), start=1):
    print(f"\nResult {i}:")
    print(f"Document Chunk: {doc}")
    print(f"Metadata: {metadata}")
    print(f"Distance: {distance}")
Search Query: What was discussed about monetary policy?
Embedding generated for search query.
Embedding is of length 384.

Top 5 Results:

Result 1:
Document Chunk: to
continue the process of reducing the Federal Reserve’s securities holdings.
In discussing the outlook for monetary policy, participants anticipated that if the data came in about
as expected, with inflation continuing to move down sustainably to 2 percent and the economy
remaining near maximum employment, it would likely be appropriate to move gradually toward a more
neutral stance of policy over time. Participants noted that monetary policy decisions were not on a
preset course and w
Metadata: {'chunk_id': 84, 'source': 'sample_product_document.pdf'}
Distance: 0.6293219923973083

Result 2:
Document Chunk: icy over time. Participants noted that monetary policy decisions were not on a
preset course and were conditional on the evolution of the economy and the implications for the
economic outlook and the balance of risks; they stressed that it would be important for the Committee
to make this clear as it adjusted its policy stance. While emphasizing that monetary policy would be
data dependent, many participants noted the volatility of recent economic data and highlighted the
importance of fo
Metadata: {'chunk_id': 85, 'source': 'sample_product_document.pdf'}
Distance: 0.7037546634674072

Result 3:
Document Chunk: percent objective.
Members agreed that, in assessing the appropriate stance of monetary policy, they would continue to
monitor the implications of incoming information for the economic outlook. They would be prepared to Minutes of the Federal Open Market Committee 13

adjust the stance of monetary policy as appropriate if risks emerged that could impede the attainment
of the Committee’s goals. Members also agreed that their assessments would take into account a
wide range of informati
Metadata: {'chunk_id': 95, 'source': 'sample_product_document.pdf'}
Distance: 0.736047089099884

Result 4:
Document Chunk: tered. Many participants observed
that uncertainties concerning the level of the neutral rate of interest complicated the assessment of
the degree of restrictiveness of monetary policy and, in their view, made it appropriate to reduce policy
restraint gradually.
Committee Policy Actions
In their discussions of monetary policy for this meeting, members agreed that economic activity had
continued to expand at a solid pace. Labor market conditions had generally eased since earlier in the
y
Metadata: {'chunk_id': 90, 'source': 'sample_product_document.pdf'}
Distance: 0.736309289932251

Result 5:
Document Chunk: t and inflation goals
remained roughly in balance. Some participants judged that downside risks to economic activity or
the labor market had diminished. Participants noted that monetary policy would need to balance the
risks of easing policy too quickly, thereby possibly hindering further progress on inflation, with the risks
of easing policy too slowly, thereby unduly weakening economic activity and employment. In
discussing the positioning of monetary policy in response to potential ch
Metadata: {'chunk_id': 88, 'source': 'sample_product_document.pdf'}
Distance: 0.8098248243331909

Streamlining with LangChain: RAG 101 Retrieval Made Simpler

The manual approach gives you complete control, which is great for understanding what's happening under the hood. But honestly, it can be verbose and requires quite a bit of setup. This is where LangChain really shines. It abstracts many of these steps and helps you build retrieval-augmented applications much faster. Let me show you how to replicate the same process using LangChain's high-level utilities.

Step 1: Create the Index

With LangChain, creating an index becomes much more intuitive. The library provides tools for document loading, text splitting, and embedding generation all in one place. Here's how to create an index for the same Federal Open Market Committee document:

# Step 1: Import required libraries
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings 
from langchain.vectorstores import Chroma

# Step 2: Load the document
pdf_path = "fomcminutes20241107.pdf"  # Replace with your PDF path
loader = PyPDFLoader(pdf_path)  # LangChain's PDF loader
documents = loader.load()  # Load text from the PDF
print(f"Loaded {len(documents)} page(s) from the PDF.")

# Step 3: Split the text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
print(f"Split all text into {len(chunks)} chunks.")

# Step 4: Generate embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Step 5: Store in a Vector Database
persist_directory = "fomc_vector_db"
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=persist_directory,  # Persistence is automatic in Chroma >= 0.4.x
)

print(f"Vector database created and stored at '{persist_directory}'.")
Loaded 17 page(s) from the PDF.
Split all text into 131 chunks.
Vector database created and stored at 'fomc_vector_db'.

Step 2: Search the Index

LangChain also simplifies the querying process by managing the search internally. Here's how to perform a semantic search on your vector database:

# Step 1: Import required libraries
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings 

# Step 2: Load the vector database
persist_directory = "fomc_vector_db"
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_db = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding_model,
)
print(f"Vector database loaded from '{persist_directory}'.")

# Step 3: Define the search query
query = "What were the key points discussed about monetary policy?"
print(f"Search Query: {query}")

# Step 4: Perform the search
top_k = 5  # Number of top results to retrieve
results = vector_db.similarity_search(query, k=top_k)

# Step 5: Display the results
print(f"\nTop {top_k} Results:")
for i, result in enumerate(results, start=1):
    print(f"\nResult {i}:")
    print(f"Document Chunk: {result.page_content}")
    print(f"Metadata: {result.metadata}")
Search Query: What were the key points discussed about monetary policy?

Top 5 Results:

Result 1:
Document Chunk: 25 basis points to 4½ to 4¾ percent. Participants observed that such a further recalibration of the
monetary policy stance would help maintain the strength in the economy and the labor market while
continuing to enable further progress on inflation. Participants judged that it was appropriate to
continue the process of reducing the Federal Reserve’s securities holdings.
In discussing the outlook for monetary policy, participants anticipated that if the data came in about
Metadata: {'page': 10, 'source': 'fomcminutes20241107.pdf'}

Result 2:
Document Chunk: as expected, with inflation continuing to move down sustainably to 2 percent and the economy
remaining near maximum employment, it would likely be appropriate to move gradually toward a more
neutral stance of policy over time. Participants noted that monetary policy decisions were not on a
preset course and were conditional on the evolution of the economy and the implications for the
economic outlook and the balance of risks; they stressed that it would be important for the Committee
Metadata: {'page': 10, 'source': 'fomcminutes20241107.pdf'}

Result 3:
Document Chunk: would be prepared to adjust the stance of monetary policy as appropriate if risks emerge that
could impede the attainment of the Committee’s goals. The Committee’s assessments will
take into account a wide range of information, including readings on labor market conditions,
inflation pressures and inflation expectations, and financial and international developments.”
Voting for this action: Jerome H. Powell, John C. Williams, Thomas I. Barkin, Michael S. Barr,
Metadata: {'page': 13, 'source': 'fomcminutes20241107.pdf'}

Result 4:
Document Chunk: commencement of policy easing in September and therefore was no longer needed. Almost all
members agreed that the risks to achieving the Committee’s employment and inflation goals were
roughly in balance. Members viewed the economic outlook as uncertain and agreed that they were
attentive to the risks to both sides of the Committee’s dual mandate.
In support of its goals, the Committee decided to lower the target range for the federal funds rate by
Metadata: {'page': 11, 'source': 'fomcminutes20241107.pdf'}

Result 5:
Document Chunk: American countries, notably Brazil, inflation increased, partly because of renewed food price
pressures.
Many foreign central banks eased policy during the intermeeting period, including the Bank of Canada
and the European Central Bank among the AFEs and the central banks of Colombia, Mexico, Korea,
the Philippines, and Thailand among the emerging market economies.
Staff Review of the Financial Situation
Metadata: {'page': 3, 'source': 'fomcminutes20241107.pdf'}

Conclusion

Retrieval really is the foundation of RAG. This is RAG 101 at its core. You've learned how to create an index by processing and embedding documents, and how to search that index to find relevant information based on queries. We worked through both methods using the Minutes of the Federal Open Market Committee. The manual path gave you transparency and control, while the LangChain path showed how abstraction can save time and simplify your workflows.

You're now ready to apply retrieval in your own applications. Use it to enhance chatbot responses. Build powerful search experiences. Remember, retrieval is the heart of RAG. It's what enables meaningful, grounded interactions in LLM-powered applications. And if you want to further improve the quality and reliability of your LLM responses, check out my guide on prompt engineering strategies for reliable LLM outputs.