What is RAG?

Module 1 · ~8 min read
Core idea: RAG lets you give an LLM access to your own documents without retraining it. The model stays the same — you simply augment each query with retrieved knowledge.

Definition

Retrieval-Augmented Generation (RAG) is an architecture that combines a knowledge retrieval system with a Large Language Model. Instead of relying solely on what the LLM learned during training, RAG first searches a curated knowledge base for relevant passages, then injects those passages into the LLM prompt as context.

The result: answers grounded in real, up-to-date, domain-specific content — with citations back to the source documents.

The Problem RAG Solves

LLMs have three fundamental limitations when used alone:

RAG addresses all three by putting the relevant text directly in the prompt at query time.

RAG vs Fine-Tuning

Dimension RAG Fine-Tuning
Knowledge update speedInstant — add documents to the storeHours/days — retrain the model
CostStorage + inferenceGPU training + inference
InterpretabilityHigh — sources are visibleLow — knowledge baked into weights
Hallucination controlStrong — answers cite passagesModerate — style improves, facts may not
Best forDomain knowledge, live data, Q&A over docsStyle, tone, structured output format

The Three-Step RAG Loop

1
Index
Parse your documents into chunks, embed each chunk as a dense vector, and store the vectors in a vector database. This happens offline, before any user query arrives.
2
Retrieve
When a user asks a question, embed the question and search the vector store for the most semantically similar chunks. Power RAG also runs a parallel keyword search and merges the two result sets with Reciprocal Rank Fusion.
3
Generate
Inject the retrieved chunks as context into the LLM prompt. The LLM reads the context, answers the question, and cites the sources. The answer is grounded in your actual documents.

The Power RAG 9-Stage Pipeline

Power RAG extends the basic three-step loop with safety, caching, and observability layers. The full pipeline has nine stages:

User Question │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 0: Input Guardrail │ │ Gemini 2.5 Flash checks for unsafe content │ │ BLOCK ──► 403 response + flag logged │ └──────────────────────────┬──────────────────────────────┘ │ PASS ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 1: Semantic Cache Lookup │ │ Redis vector search (threshold 0.92) │ │ HIT ──► return cached answer immediately │ └──────────────────────────┬──────────────────────────────┘ │ MISS ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 2: Hybrid Retrieval │ │ Dense search (Qdrant) + FTS (PostgreSQL) → RRF merge │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 3: Confidence Scoring │ │ RRF score average → 0.0–1.0 │ │ <0.1 → use general knowledge, skip citations │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 4: Context Assembly │ │ Format chunks as [SOURCE N] blocks (24k char cap) │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 5: LLM Call │ │ Resolve model → build prompt → stream/call → answer │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 6: Output Guardrail │ │ Regex PII detection → redact if email/SSN/CC found │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 7: Cache Store │ │ Store answer in Redis with 24h TTL │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 8: Audit Log │ │ Persist interaction to PostgreSQL interactions table │ └──────────────────────────┬──────────────────────────────┘ │ ▼ Answer + Sources

Each stage is explored in depth in subsequent topics. The full implementation lives in RagService.java — see Topic 17: The Full RAG Pipeline for a stage-by-stage walkthrough.