Module 02

RAG: Retrieval Augmented Generation

⏱ ~4 hours ❓ 12-question quiz 🎯 Unlock Module 03

In This Module

Why RAG Exists
The RAG Pipeline
Document Loaders & Splitters
Embeddings & Vector Stores
Retrieval Strategies
Advanced RAG Patterns
RAG Evaluation with RAGAS

Why RAG Exists

LLMs have two fundamental limitations: their knowledge is frozen at training time (knowledge cutoff), and they hallucinate when they don't know something. RAG solves both by retrieving relevant documents at query time and giving them to the model as context — grounding the response in real, up-to-date data.

📚

Knowledge Grounding

Answer questions about internal documents, policies, or product data the model was never trained on.

🔄

Always Fresh

Update the knowledge base without retraining the model — just re-index new documents.

🎯

Reduce Hallucination

When the model must answer from provided context, it is far less likely to fabricate facts.

📋

Auditability

Every answer can be traced back to specific source documents — critical for enterprise compliance.

The RAG Pipeline

  INDEXING (run once / on update)
  ─────────────────────────────────────────────────────────────
  Documents  →  Loader  →  Splitter  →  Embedder  →  Vector Store
                                                           │
  RETRIEVAL + GENERATION (run per query)                   │
  ─────────────────────────────────────────────────────────┘
  User Query  →  Embedder  →  Vector Store (similarity search)
                                    │
                               Top-K chunks
                                    │
                    Prompt Template (question + context)
                                    │
                                  LLM
                                    │
                               Final Answer

Document Loaders & Text Splitters

LangChain ships with over 100 document loaders. Choosing the right one ensures metadata (page numbers, source URLs, section titles) is preserved alongside text content.

python Loading documents from various sources

from langchain_community.document_loaders import (
    PyPDFLoader,
    WebBaseLoader,
    CSVLoader,
    DirectoryLoader,
)

# PDF — one Document per page, metadata includes page number
pdf_docs = PyPDFLoader("annual_report.pdf").load()
print(f"Loaded {len(pdf_docs)} pages")
print(pdf_docs[0].metadata)  # {'source': 'annual_report.pdf', 'page': 0}

# Web pages
web_docs = WebBaseLoader("https://docs.langchain.com/").load()

# CSV — one Document per row, columns become metadata
csv_docs = CSVLoader("products.csv", metadata_columns=["sku", "category"]).load()

# Entire directory of files
dir_docs = DirectoryLoader("./docs", glob="**/*.md").load()
print(f"Loaded {len(dir_docs)} markdown files")

python Splitting documents into chunks

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# RecursiveCharacterTextSplitter — the default, works for most text
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # characters per chunk
    chunk_overlap=200,     # overlap helps avoid cutting mid-sentence
    separators=["\n\n", "\n", ". ", " ", ""],  # tries these in order
)
chunks = splitter.split_documents(pdf_docs)
print(f"{len(pdf_docs)} pages → {len(chunks)} chunks")

# SemanticChunker — groups sentences by semantic similarity
# Better for narrative text; slower (requires embedding each sentence)
semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # split where similarity drops
    breakpoint_threshold_amount=95,
)
semantic_chunks = semantic_splitter.split_documents(pdf_docs)

⚠️

Chunk size trade-offs

Too small (under 200 chars): chunks lack context, retrieval scores are noisy.
Too large (over 2000 chars): fewer chunks but each may contain irrelevant content that dilutes the relevant part. A good starting point is 800–1000 chars with 150–200 overlap. Always evaluate with your actual queries.

Embeddings & Vector Stores

An embedding model converts text to a dense float vector. Vectors for semantically similar texts are close together in high-dimensional space, enabling similarity search.

python Building a ChromaDB vector store

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embedder = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store from documents (builds index)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedder,
    persist_directory="./chroma_db",  # saves to disk
    collection_name="enterprise_docs",
)

# Load existing store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embedder,
    collection_name="enterprise_docs",
)

# Similarity search
results = vectorstore.similarity_search("What is our refund policy?", k=4)
for doc in results:
    print(doc.page_content[:200])
    print(doc.metadata)

# Search with relevance scores
results_with_scores = vectorstore.similarity_search_with_relevance_scores(
    "refund policy", k=4
)
for doc, score in results_with_scores:
    print(f"score={score:.3f}: {doc.page_content[:100]}")

Retrieval Strategies

Basic similarity search is often not enough. These strategies improve retrieval quality for production RAG.

Hybrid Search (BM25 + Semantic)

python Ensemble retriever: BM25 + vector search

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# BM25 — keyword-based, great for exact product codes or proper nouns
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4

# Vector — semantic, great for conceptual questions
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Combine with Reciprocal Rank Fusion (RRF)
ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6],   # weight semantic higher for general queries
)

docs = ensemble.invoke("SKU-4821 warranty terms")

Multi-Query Retrieval

python MultiQueryRetriever — rewrite query N ways

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_retriever,
    llm=llm,
)
# Internally generates 3 paraphrases of the query, retrieves for each,
# then deduplicates results — improves recall on ambiguous questions
docs = multi_retriever.invoke("cancellation terms for enterprise plan")

Advanced RAG Patterns

Re-ranking with Cohere

python Two-stage retrieval: retrieve wide, re-rank narrow

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Stage 1: retrieve 20 candidates
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

# Stage 2: re-rank with a cross-encoder, keep top 4
reranker = CohereRerank(model="rerank-english-v3.0", top_n=4)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever,
)

# Returns only the 4 most relevant of the 20 candidates
docs = compression_retriever.invoke("What is the SLA for P1 incidents?")

Full End-to-End RAG Chain

python Production RAG chain with source citation

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

model = ChatOpenAI(model="gpt-4o", temperature=0)

RAG_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are an enterprise support assistant.
Answer the question using ONLY the provided context.
If the answer is not in the context, say "I don't have that information."
Always cite the source document and page number.

Context:
{context}"""),
    ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n".join(
        f"[Source: {d.metadata.get('source','?')}, "
        f"Page: {d.metadata.get('page','?')}]\n{d.page_content}"
        for d in docs
    )

rag_chain = (
    RunnableParallel(
        context  = compression_retriever | format_docs,
        question = RunnablePassthrough(),
    )
    | RAG_PROMPT
    | model
    | StrOutputParser()
)

answer = rag_chain.invoke("What is the maximum file upload size?")
print(answer)

RAG Evaluation with RAGAS

Building a RAG pipeline is easy. Knowing whether it's good requires systematic evaluation. RAGAS provides four core metrics:

Metric	Measures	Good Score
Faithfulness	Is every claim in the answer supported by the retrieved context?	> 0.85
Answer Relevancy	Does the answer actually address the question?	> 0.85
Context Precision	Of the retrieved chunks, how many were actually useful?	> 0.80
Context Recall	Were all relevant chunks retrieved?	> 0.75

python RAGAS evaluation script

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Build an evaluation dataset: questions + ground truth answers
eval_data = {
    "question": [
        "What is the maximum upload size?",
        "How do I reset my password?",
    ],
    "ground_truth": [
        "The maximum upload size is 100MB per file.",
        "Click 'Forgot Password' on the login page.",
    ],
    "answer":        [],   # filled by running your RAG chain
    "contexts":      [],   # filled by capturing retrieved docs
}

# Run RAG chain for each question and collect contexts
for q, gt in zip(eval_data["question"], eval_data["ground_truth"]):
    # Retrieve context explicitly
    docs = compression_retriever.invoke(q)
    context_texts = [d.page_content for d in docs]
    answer = rag_chain.invoke(q)
    eval_data["answer"].append(answer)
    eval_data["contexts"].append(context_texts)

dataset = Dataset.from_dict(eval_data)

results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, ...}

💡

Build your eval set before optimising

Create at least 20–50 representative question/answer pairs before tuning chunk sizes, retrieval strategies, or prompts. Without a fixed eval set, you're optimising blindly — a change that improves one query may silently break five others.

📝 Knowledge Check

Module 02 — Quiz

Score 80% or higher (10 out of 12) to unlock Module 03.

0 of 12 answered