Semantic Caching

Module 5 · ~10 min read

A traditional cache keys on exact string equality — "What is RAG?" and "Can you explain RAG?" would be two cache misses. A semantic cache keys on meaning: if two questions have vectors with cosine similarity ≥ 0.92, they are considered equivalent and the cached answer is returned. This dramatically improves hit rates for paraphrased questions.

The SemanticCache Interface

SemanticCache.java View source ↗

public interface SemanticCache {
    Optional<CacheHit> lookup(String query, String language);
    void store(String query, String language, String answer,
               double confidence, List<SourceRef> sources, String modelId);
}

How Redis Vector Search Works

Power RAG uses Redis Stack 7.x with the RedisVectorStore from Spring AI. The lookup process:

Embed the query

The shared EmbeddingModel (gemini-embedding-001, 768 dimensions) converts the query string into a float vector.

Search the Redis index

Search the powerrag:cache:{lang} Redis index for the nearest neighbour vector using cosine similarity.

Threshold check

If the nearest neighbour has cosine similarity ≥ 0.92 → return the cached answer as a CacheHit. Otherwise → miss, return Optional.empty().

Return or continue

On a HIT, the full pipeline (retrieval, LLM call, guardrails) is bypassed entirely. Typical latency: <50ms vs 2–10s for a full call.

Threshold Choice: 0.92

The 0.92 threshold is deliberately high. Consider these examples:

"Who is the CEO?" and "Who is the Chief Executive Officer?" → cosine similarity ~0.97 → HIT (same meaning)
"Who is the CEO?" and "What year was the company founded?" → cosine similarity ~0.61 → MISS (different topic)
"What is our leave policy?" and "How many days of annual leave do I get?" → ~0.94 → HIT

A threshold below 0.90 would risk serving a cached answer about a subtly different question, potentially misleading users.

Language Scoping

The Redis index is scoped by language: powerrag:cache:en, powerrag:cache:fr, etc. An English query will never hit a French cache entry, even if they are semantically equivalent — the answers are in different languages.

TTL: 24 Hours

Cached answers expire after 24 hours. This ensures stale answers (from outdated documents) do not persist indefinitely. If you update a document and re-ingest it, old cached answers about that document will naturally expire within a day.

NoOpSemanticCacheService for Tests

NoOpSemanticCacheService.java View source ↗

@Profile("test")
@Component
public class NoOpSemanticCacheService implements SemanticCache {
    @Override public Optional<CacheHit> lookup(String q, String l) { return Optional.empty(); }
    @Override public void store(...) { /* no-op */ }
}

The @Profile("test") annotation activates this bean only when running with the test Spring profile. Unit and integration tests don't need Redis — the no-op implementation always returns a cache miss and discards stores silently.

← Previous Next →