What is RAG?

Module 1 · ~8 min read

Core idea: RAG lets you give an LLM access to your own documents without retraining it. The model stays the same — you simply augment each query with retrieved knowledge.

Definition

Retrieval-Augmented Generation (RAG) is an architecture that combines a knowledge retrieval system with a Large Language Model. Instead of relying solely on what the LLM learned during training, RAG first searches a curated knowledge base for relevant passages, then injects those passages into the LLM prompt as context.

The result: answers grounded in real, up-to-date, domain-specific content — with citations back to the source documents.

The Problem RAG Solves

LLMs have three fundamental limitations when used alone:

Training cutoff: The model knows nothing that happened after its training data was collected. Ask Claude about last week's earnings report and it will either confabulate or refuse.
Hallucination: LLMs generate plausible-sounding text even when they do not know the answer. Without grounding in real documents, confident-sounding wrong answers are common.
Private / proprietary data: Your internal policies, contracts, product specs, and customer data were never in the training corpus. The LLM simply does not know them.

RAG addresses all three by putting the relevant text directly in the prompt at query time.

RAG vs Fine-Tuning

Dimension	RAG	Fine-Tuning
Knowledge update speed	Instant — add documents to the store	Hours/days — retrain the model
Cost	Storage + inference	GPU training + inference
Interpretability	High — sources are visible	Low — knowledge baked into weights
Hallucination control	Strong — answers cite passages	Moderate — style improves, facts may not
Best for	Domain knowledge, live data, Q&A over docs	Style, tone, structured output format

The Three-Step RAG Loop

Index

Parse your documents into chunks, embed each chunk as a dense vector, and store the vectors in a vector database. This happens offline, before any user query arrives.

Retrieve

When a user asks a question, embed the question and search the vector store for the most semantically similar chunks. Power RAG also runs a parallel keyword search and merges the two result sets with Reciprocal Rank Fusion.

Generate

Inject the retrieved chunks as context into the LLM prompt. The LLM reads the context, answers the question, and cites the sources. The answer is grounded in your actual documents.

The Power RAG 9-Stage Pipeline

Power RAG extends the basic three-step loop with safety, caching, and observability layers. The full pipeline has nine stages:

User Question │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 0: Input Guardrail │ │ Gemini 2.5 Flash checks for unsafe content │ │ BLOCK ──► 403 response + flag logged │ └──────────────────────────┬──────────────────────────────┘ │ PASS ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 1: Semantic Cache Lookup │ │ Redis vector search (threshold 0.92) │ │ HIT ──► return cached answer immediately │ └──────────────────────────┬──────────────────────────────┘ │ MISS ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 2: Hybrid Retrieval │ │ Dense search (Qdrant) + FTS (PostgreSQL) → RRF merge │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 3: Confidence Scoring │ │ RRF score average → 0.0–1.0 │ │ <0.1 → use general knowledge, skip citations │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 4: Context Assembly │ │ Format chunks as [SOURCE N] blocks (24k char cap) │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 5: LLM Call │ │ Resolve model → build prompt → stream/call → answer │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 6: Output Guardrail │ │ Regex PII detection → redact if email/SSN/CC found │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 7: Cache Store │ │ Store answer in Redis with 24h TTL │ └──────────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 8: Audit Log │ │ Persist interaction to PostgreSQL interactions table │ └──────────────────────────┬──────────────────────────────┘ │ ▼ Answer + Sources

Each stage is explored in depth in subsequent topics. The full implementation lives in RagService.java — see Topic 17: The Full RAG Pipeline for a stage-by-stage walkthrough.

← Home Next →