RAG: Retrieval Augmented Generation
Why RAG Exists
LLMs have two fundamental limitations: their knowledge is frozen at training time (knowledge cutoff), and they hallucinate when they don't know something. RAG solves both by retrieving relevant documents at query time and giving them to the model as context — grounding the response in real, up-to-date data.
The RAG Pipeline
INDEXING (run once / on update)
─────────────────────────────────────────────────────────────
Documents → Loader → Splitter → Embedder → Vector Store
│
RETRIEVAL + GENERATION (run per query) │
─────────────────────────────────────────────────────────┘
User Query → Embedder → Vector Store (similarity search)
│
Top-K chunks
│
Prompt Template (question + context)
│
LLM
│
Final Answer
Document Loaders & Text Splitters
LangChain ships with over 100 document loaders. Choosing the right one ensures metadata (page numbers, source URLs, section titles) is preserved alongside text content.
from langchain_community.document_loaders import (
PyPDFLoader,
WebBaseLoader,
CSVLoader,
DirectoryLoader,
)
# PDF — one Document per page, metadata includes page number
pdf_docs = PyPDFLoader("annual_report.pdf").load()
print(f"Loaded {len(pdf_docs)} pages")
print(pdf_docs[0].metadata) # {'source': 'annual_report.pdf', 'page': 0}
# Web pages
web_docs = WebBaseLoader("https://docs.langchain.com/").load()
# CSV — one Document per row, columns become metadata
csv_docs = CSVLoader("products.csv", metadata_columns=["sku", "category"]).load()
# Entire directory of files
dir_docs = DirectoryLoader("./docs", glob="**/*.md").load()
print(f"Loaded {len(dir_docs)} markdown files")
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# RecursiveCharacterTextSplitter — the default, works for most text
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # characters per chunk
chunk_overlap=200, # overlap helps avoid cutting mid-sentence
separators=["\n\n", "\n", ". ", " ", ""], # tries these in order
)
chunks = splitter.split_documents(pdf_docs)
print(f"{len(pdf_docs)} pages → {len(chunks)} chunks")
# SemanticChunker — groups sentences by semantic similarity
# Better for narrative text; slower (requires embedding each sentence)
semantic_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile", # split where similarity drops
breakpoint_threshold_amount=95,
)
semantic_chunks = semantic_splitter.split_documents(pdf_docs)
Too small (under 200 chars): chunks lack context, retrieval scores are noisy.
Too large (over 2000 chars): fewer chunks but each may contain irrelevant content that dilutes the relevant part. A good starting point is 800–1000 chars with 150–200 overlap. Always evaluate with your actual queries.
Embeddings & Vector Stores
An embedding model converts text to a dense float vector. Vectors for semantically similar texts are close together in high-dimensional space, enabling similarity search.
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
embedder = OpenAIEmbeddings(model="text-embedding-3-small")
# Create vector store from documents (builds index)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embedder,
persist_directory="./chroma_db", # saves to disk
collection_name="enterprise_docs",
)
# Load existing store
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embedder,
collection_name="enterprise_docs",
)
# Similarity search
results = vectorstore.similarity_search("What is our refund policy?", k=4)
for doc in results:
print(doc.page_content[:200])
print(doc.metadata)
# Search with relevance scores
results_with_scores = vectorstore.similarity_search_with_relevance_scores(
"refund policy", k=4
)
for doc, score in results_with_scores:
print(f"score={score:.3f}: {doc.page_content[:100]}")
Retrieval Strategies
Basic similarity search is often not enough. These strategies improve retrieval quality for production RAG.
Hybrid Search (BM25 + Semantic)
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# BM25 — keyword-based, great for exact product codes or proper nouns
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4
# Vector — semantic, great for conceptual questions
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Combine with Reciprocal Rank Fusion (RRF)
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6], # weight semantic higher for general queries
)
docs = ensemble.invoke("SKU-4821 warranty terms")
Multi-Query Retrieval
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
multi_retriever = MultiQueryRetriever.from_llm(
retriever=vector_retriever,
llm=llm,
)
# Internally generates 3 paraphrases of the query, retrieves for each,
# then deduplicates results — improves recall on ambiguous questions
docs = multi_retriever.invoke("cancellation terms for enterprise plan")
Advanced RAG Patterns
Re-ranking with Cohere
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
# Stage 1: retrieve 20 candidates
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
# Stage 2: re-rank with a cross-encoder, keep top 4
reranker = CohereRerank(model="rerank-english-v3.0", top_n=4)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever,
)
# Returns only the 4 most relevant of the 20 candidates
docs = compression_retriever.invoke("What is the SLA for P1 incidents?")
Full End-to-End RAG Chain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
model = ChatOpenAI(model="gpt-4o", temperature=0)
RAG_PROMPT = ChatPromptTemplate.from_messages([
("system", """You are an enterprise support assistant.
Answer the question using ONLY the provided context.
If the answer is not in the context, say "I don't have that information."
Always cite the source document and page number.
Context:
{context}"""),
("human", "{question}"),
])
def format_docs(docs):
return "\n\n".join(
f"[Source: {d.metadata.get('source','?')}, "
f"Page: {d.metadata.get('page','?')}]\n{d.page_content}"
for d in docs
)
rag_chain = (
RunnableParallel(
context = compression_retriever | format_docs,
question = RunnablePassthrough(),
)
| RAG_PROMPT
| model
| StrOutputParser()
)
answer = rag_chain.invoke("What is the maximum file upload size?")
print(answer)
RAG Evaluation with RAGAS
Building a RAG pipeline is easy. Knowing whether it's good requires systematic evaluation. RAGAS provides four core metrics:
| Metric | Measures | Good Score |
|---|---|---|
| Faithfulness | Is every claim in the answer supported by the retrieved context? | > 0.85 |
| Answer Relevancy | Does the answer actually address the question? | > 0.85 |
| Context Precision | Of the retrieved chunks, how many were actually useful? | > 0.80 |
| Context Recall | Were all relevant chunks retrieved? | > 0.75 |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# Build an evaluation dataset: questions + ground truth answers
eval_data = {
"question": [
"What is the maximum upload size?",
"How do I reset my password?",
],
"ground_truth": [
"The maximum upload size is 100MB per file.",
"Click 'Forgot Password' on the login page.",
],
"answer": [], # filled by running your RAG chain
"contexts": [], # filled by capturing retrieved docs
}
# Run RAG chain for each question and collect contexts
for q, gt in zip(eval_data["question"], eval_data["ground_truth"]):
# Retrieve context explicitly
docs = compression_retriever.invoke(q)
context_texts = [d.page_content for d in docs]
answer = rag_chain.invoke(q)
eval_data["answer"].append(answer)
eval_data["contexts"].append(context_texts)
dataset = Dataset.from_dict(eval_data)
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, ...}
Create at least 20–50 representative question/answer pairs before tuning chunk sizes, retrieval strategies, or prompts. Without a fixed eval set, you're optimising blindly — a change that improves one query may silently break five others.