What is RAG?
Module 1 · ~8 min read
Core idea: RAG lets you give an LLM access to your own documents without retraining it. The model stays the same — you simply augment each query with retrieved knowledge.
Definition
Retrieval-Augmented Generation (RAG) is an architecture that combines a knowledge retrieval system with a Large Language Model. Instead of relying solely on what the LLM learned during training, RAG first searches a curated knowledge base for relevant passages, then injects those passages into the LLM prompt as context.
The result: answers grounded in real, up-to-date, domain-specific content — with citations back to the source documents.
The Problem RAG Solves
LLMs have three fundamental limitations when used alone:
- Training cutoff: The model knows nothing that happened after its training data was collected. Ask Claude about last week's earnings report and it will either confabulate or refuse.
- Hallucination: LLMs generate plausible-sounding text even when they do not know the answer. Without grounding in real documents, confident-sounding wrong answers are common.
- Private / proprietary data: Your internal policies, contracts, product specs, and customer data were never in the training corpus. The LLM simply does not know them.
RAG addresses all three by putting the relevant text directly in the prompt at query time.
RAG vs Fine-Tuning
| Dimension |
RAG |
Fine-Tuning |
| Knowledge update speed | Instant — add documents to the store | Hours/days — retrain the model |
| Cost | Storage + inference | GPU training + inference |
| Interpretability | High — sources are visible | Low — knowledge baked into weights |
| Hallucination control | Strong — answers cite passages | Moderate — style improves, facts may not |
| Best for | Domain knowledge, live data, Q&A over docs | Style, tone, structured output format |
The Three-Step RAG Loop
1
Index
Parse your documents into chunks, embed each chunk as a dense vector, and store the vectors in a vector database. This happens offline, before any user query arrives.
2
Retrieve
When a user asks a question, embed the question and search the vector store for the most semantically similar chunks. Power RAG also runs a parallel keyword search and merges the two result sets with Reciprocal Rank Fusion.
3
Generate
Inject the retrieved chunks as context into the LLM prompt. The LLM reads the context, answers the question, and cites the sources. The answer is grounded in your actual documents.
The Power RAG 9-Stage Pipeline
Power RAG extends the basic three-step loop with safety, caching, and observability layers. The full pipeline has nine stages:
User Question
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 0: Input Guardrail │
│ Gemini 2.5 Flash checks for unsafe content │
│ BLOCK ──► 403 response + flag logged │
└──────────────────────────┬──────────────────────────────┘
│ PASS
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Semantic Cache Lookup │
│ Redis vector search (threshold 0.92) │
│ HIT ──► return cached answer immediately │
└──────────────────────────┬──────────────────────────────┘
│ MISS
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Hybrid Retrieval │
│ Dense search (Qdrant) + FTS (PostgreSQL) → RRF merge │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 3: Confidence Scoring │
│ RRF score average → 0.0–1.0 │
│ <0.1 → use general knowledge, skip citations │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 4: Context Assembly │
│ Format chunks as [SOURCE N] blocks (24k char cap) │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 5: LLM Call │
│ Resolve model → build prompt → stream/call → answer │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 6: Output Guardrail │
│ Regex PII detection → redact if email/SSN/CC found │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 7: Cache Store │
│ Store answer in Redis with 24h TTL │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 8: Audit Log │
│ Persist interaction to PostgreSQL interactions table │
└──────────────────────────┬──────────────────────────────┘
│
▼
Answer + Sources
Each stage is explored in depth in subsequent topics. The full implementation lives in
RagService.java — see Topic 17: The Full RAG Pipeline
for a stage-by-stage walkthrough.