The Full RAG Pipeline

Module 5 · ~15 min read

This topic is a stage-by-stage walkthrough of RagService.query() — the entry point for every chat request. All 9 stages are covered in execution order.

RagService.java — full pipeline View source ↗

Stage 0: Input Guardrail

Input safety check via Gemini 2.5 Flash

GuardrailService.checkInput(question) sends the user's message to Google Gemini (gemini-2.5-flash by default). The model classifies the text as safe or unsafe; categories are parsed from the model reply.

BLOCK path: If the input is flagged, the service returns a rejection response and logs a record to the guardrail_flags table. The request never reaches retrieval or the LLM.

PASS path: The classification returns "safe" and execution continues.

Stage 1: Semantic Cache Lookup

Redis vector cache lookup

semanticCache.lookup(question, lang) embeds the question and searches the Redis vector index for a cached answer with cosine similarity ≥ 0.92.

HIT path: Return the cached answer, sources, and confidence immediately. A cache hit skips stages 1.5 through 8. Typical savings: 2–10 seconds of LLM latency.

MISS path: Continue to stage 1.5.

Stage 1.5: Image Generation Detection

1.5

Detect image generation intent

imageGenerationService.isImageGenerationRequest(question) checks for generation verb + image noun pairs ("generate a picture of...").

Detected: Calls Imagen 3 (or Gemini Flash fallback), returns the image as a base64 string in generatedImageBase64. Pipeline exits after image generation — no RAG retrieval needed.

Not detected: Continue to stage 2.

Stage 2: Hybrid Retrieval

Dense + keyword search → RRF merge

HybridRetriever.retrieve(question) runs two searches in parallel:
• Dense: vectorStore.similaritySearch(topK*2) via Qdrant
• Keyword: documentChunkRepository.fullTextSearch(topK*2) via PostgreSQL FTS

Results are merged with Reciprocal Rank Fusion (k=60) and capped at topK (default: 10).

Stage 3: Confidence Scoring

Assess retrieval relevance

ConfidenceScorer.score(chunks) averages the RRF scores of the top chunks and normalises to 0–1.

If confidence < 0.1 or chunks are empty: hasRelevantDocs = false. The prompt will tell the LLM to answer from general knowledge without citing sources.

Stage 4: Context Assembly

Format chunks as [SOURCE N] blocks

ContextAssembler.assemble(chunks) formats the retrieved chunks into a [SOURCE N] filename § section\ntext\n\n block, capped at 24,000 characters.

ContextAssembler.extractSources(chunks) builds the List<SourceRef> for the response.

Stage 5: LLM Call

Build prompt → call LLM → receive answer

MultilingualPromptBuilder.buildUserMessage() assembles the final user message: image instruction (if any) + context block + question + language instruction.

resolveClient(provider, modelId) selects the correct ChatClient bean. Model-specific options are injected if needed.

baseSpec.user(userMessage).call().content() executes the LLM call and returns the raw answer text.

Stage 6: Output Guardrail

PII detection and redaction

GuardrailService.checkOutput(answer) applies regex patterns for email addresses, SSNs, and credit card numbers.

If PII is detected: the PII is redacted from the answer and a flag is logged. The redacted answer is returned to the user.

Stage 7: Cache Store

Store answer in Redis semantic cache

semanticCache.store(question, lang, answer, confidence, sources, modelId) embeds the question and stores the answer with a 24-hour TTL in the Redis vector index.

Future queries with semantically similar questions (≥ 0.92 similarity) will hit this cache entry.

Stage 8: Audit Log

Persist interaction to PostgreSQL

interactionRepository.save(interaction) writes a record to the interactions table containing: user ID, question, answer, model used, confidence, source count, language, duration, and timestamp.

This data powers usage analytics, quality monitoring, and compliance auditing.

Timing Summary

After stage 8, the total elapsed time is logged at INFO level:

Log output — successful RAG query

INFO  RAG query completed in 3241ms — provider=ANTHROPIC model=claude-sonnet-4-6
      confidence=0.72 sources=5 lang=en

← Previous Next →