Intent Routing
Without intent routing, every chat message triggers a full knowledge-base retrieval, whether or not the user is asking about their uploaded documents. A user asking "What time is it in Tokyo?" does not need Qdrant results — it needs a time-zone lookup. Retrieving irrelevant chunks wastes latency and can pollute the context window with noise that confuses the model.
Intent routing adds a fast decision step before retrieval: a small LLM call (or a pattern-matching
fallback) classifies each question into a QueryIntent that tells the pipeline exactly
what resources to activate.
The QueryIntent Record
The outcome of classification is a simple Java record with two boolean fields:
public record QueryIntent(boolean retrieveDocuments, boolean allowMcpTools) {
/** When intent routing is disabled: always retrieve;
MCP follows global powerrag.mcp.rag-enabled flag. */
public static QueryIntent legacy(boolean mcpRagEnabled) {
return new QueryIntent(true, mcpRagEnabled);
}
}
| Field | When true | When false |
|---|---|---|
retrieveDocuments |
Run Qdrant + PostgreSQL FTS hybrid retrieval | Skip retrieval entirely; use only LLM general knowledge (or MCP tools) |
allowMcpTools |
Attach MCP tool callbacks to this ChatClient call | No tools attached; model answers from context or KB only |
allowMcpTools in the intent is an additional gate on top
of the global powerrag.mcp.rag-enabled flag. Tools are only attached when both
the global flag is true AND the per-query intent allows them. This lets you enable MCP
globally but have the router selectively disable tools for simple KB-only queries.
The Two Classification Paths
QueryIntentClassifier tries the LLM-based router first. If it fails (timeout, parse
error, API error), it falls back to pattern-matching heuristics. The fallback is deliberately
conservative: when in doubt, it retrieves documents and enables MCP tools.
public QueryIntent classify(String question, String language,
ChatClient client, String modelProvider, String modelId) {
if (question == null || question.isBlank()) {
return fallback(question);
}
try {
ChatClient.ChatClientRequestSpec spec = client.prompt()
.system(ROUTER_SYSTEM)
.user("language=" + (language != null ? language : "en") + "\n\n" + question);
spec = applyRouterOptions(spec, modelProvider, modelId); // temp=0.0, max_tokens=350
String raw = spec.call().content();
return parseModelJson(raw, question);
} catch (Exception e) {
log.warn("Intent classification failed, using heuristic fallback: {}", e.getMessage());
return fallback(question);
}
}
The Router System Prompt
The classifier uses the same ChatClient as the main answer call, but overrides
temperature to 0.0 and caps output at 350 tokens. The model is only allowed to return a JSON object
— no markdown fences, no prose. This makes the response fast (<300 ms typically) and easy to parse.
You route questions for a RAG assistant with optional MCP tools (web fetch, time, weather,
Jira, GitHub code/repo, GCP Cloud Logging, mailbox IMAP/SMTP).
Reply with ONLY one JSON object (no markdown fences, no prose):
{"retrieveDocuments":true|false,"allowMcpTools":true|false}
retrieveDocuments — true if the user likely needs their uploaded knowledge base (policies,
internal docs, PDFs they ingested, "my documents", company-specific material).
false for pure general knowledge, small talk, standalone math/logic,
generic coding help with no doc context, or questions that clearly do not depend on private uploads.
allowMcpTools — true if any live tool may help: URL or fetch/read a webpage; current time or
timezone; weather or forecast; Jira/support tickets; GitHub (search code, read files in a repo);
Google Cloud Logging / Stackdriver; email/mailbox.
false when a static KB or general knowledge answer suffices.
Default false; do not enable for purely hypothetical or historical trivia with no live data need.
If unsure about retrieveDocuments, prefer true.
The final instruction — "if unsure about retrieveDocuments, prefer true" — is intentional. A false negative on retrieval (skipping the KB when it was needed) produces a worse answer than a false positive (retrieving when it wasn't needed). The cost of an unnecessary retrieval is a few milliseconds; the cost of missing KB context is a hallucinated answer.
Heuristic Fallback Patterns
When the LLM router fails or is disabled, a set of compiled regular expressions provide a fast fallback. These cover the most common signals that live data is needed:
// Any http/https URL in the question
private static final Pattern URL_PATTERN = Pattern.compile(
"https?://[^\\s<>\"{}|\\\\^`\\[\\]]+", Pattern.CASE_INSENSITIVE);
// Jira issue keys (e.g. KAN-5), "jira", "atlassian", "support ticket"
private static final Pattern JIRA_LIVE_PATTERN = Pattern.compile(
"(?i)\\b([a-z][a-z0-9]+-\\d+|\\bjira\\b|\\batlassian\\b|\\bsupport ticket\\b|\\bkan-\\d+)");
// Weather terms
private static final Pattern WEATHER_LIVE_PATTERN = Pattern.compile(
"(?i)\\b(weather|forecast|temperature|rain|humidity|wind chill|feels like)\\b");
// Time and timezone queries
private static final Pattern TIME_LIVE_PATTERN = Pattern.compile(
"(?i)\\b(what time|current time|time now|timezone|time in |utc\\b|gmt\\b|zulu time)\\b");
// GitHub: repo search, code search, PR/commit queries
private static final Pattern GITHUB_MCP_PATTERN = Pattern.compile(
"(?i)\\b(github\\b|repo:\\s*\\S+/\\S+|code\\s+search|pull\\s+request|\\bcommit(s)?\\b)\\b");
// GCP / Stackdriver / Cloud Logging
private static final Pattern GCP_LOGGING_MCP_PATTERN = Pattern.compile(
"(?i)\\b(gcp\\s+logs?|google\\s+cloud\\s+logging|stackdriver|gke\\s+logs)\\b");
// Email / inbox / mailbox
private static final Pattern EMAIL_MCP_PATTERN = Pattern.compile(
"(?i)\\b(email|inbox|unread\\s+mail|mailbox|imap|gmail|summarize.*email|draft.*reply)\\b");
The fallback returns retrieveDocuments=true in all cases (safe default), and
allowMcpTools=true only when one of the patterns matches.
Integration in RagService
Intent routing runs immediately after the semantic cache check (to avoid unnecessary classification for cached responses) and before hybrid retrieval. Images in the request bypass routing since the main LLM must handle vision — tools would be a distraction.
// ── 2. Intent routing (before retrieval & main LLM) ─────────────────
boolean imagePresent = imageBase64 != null && !imageBase64.isBlank();
QueryIntent intent;
if (!intentRoutingEnabled || imagePresent) {
// Legacy mode: always retrieve; MCP follows global flag
intent = QueryIntent.legacy(mcpRagEnabled);
} else {
intent = intentClassifier.classify(question, lang, client, modelProvider, reqModelId);
}
boolean attachMcpTools = mcpRagEnabled && intent.allowMcpTools();
log.info("Query intent: retrieveDocuments={} attachMcpTools={} (routingActive={})",
intent.retrieveDocuments(), attachMcpTools, intentRoutingEnabled && !imagePresent);
// ── 3. Retrieval (optional) ─────────────────────────────────────────
List<RetrievedChunk> chunks;
if (!intent.retrieveDocuments()) {
chunks = List.of();
log.info("Skipping knowledge-base retrieval (intent.retrieveDocuments=false)");
} else {
chunks = retriever.retrieve(question);
}
Configuration
powerrag:
rag:
# One small LLM call to decide KB retrieval vs general knowledge,
# and whether MCP tools (e.g. fetch) are needed.
intent-routing-enabled: true
mcp:
# Attach MCP tools to the main RAG ChatClient when
# spring.ai.mcp.client.enabled=true (see application-dev.yml).
rag-enabled: false
Set intent-routing-enabled: false to fall back to the pre-MCP behaviour: always
retrieve from the knowledge base. This is useful when you want to disable the extra LLM call for
cost reasons, or when debugging retrieval quality without the routing layer in the way.
Example Routing Decisions
| User question | retrieveDocuments | allowMcpTools | Reasoning |
|---|---|---|---|
| "Summarise the EU AI Act risk categories" | true | false | Clearly refers to an uploaded document |
| "What time is it in Singapore?" | false | true | Live data needed; no document relevance |
| "Fetch https://example.com/data.json and summarise it" | false | true | URL detected — fetch_url tool required |
| "What are the open KAN tickets?" | false | true | Jira issue key pattern detected |
| "Write a Python function to sort a list" | false | false | General coding help; no docs or live data needed |
| "According to my documents, what is our GDPR policy on retention?" | true | false | "my documents" phrase triggers KB retrieval |
| "Search the repo for the RagService class" | false | true | GitHub code search pattern detected |
Provider-Aware Router Options
The classifier uses the same provider/model as the main chat request, but forces deterministic output by setting temperature to 0.0. Provider-specific option builders are used to avoid the options from one provider being applied to another:
private ChatClient.ChatClientRequestSpec applyRouterOptions(
ChatClient.ChatClientRequestSpec spec, String provider, String modelId) {
String p = provider != null ? provider : "ANTHROPIC";
if ("OLLAMA".equalsIgnoreCase(p) && modelId != null) {
return spec.options(OllamaChatOptions.builder()
.model(modelId).temperature(0.0).build());
}
if ("GEMINI".equalsIgnoreCase(p) && modelId != null) {
return spec.options(GoogleGenAiChatOptions.builder()
.model(modelId).temperature(0.0).build());
}
// Default: Anthropic
AnthropicChatOptions.Builder b = AnthropicChatOptions.builder()
.temperature(0.0).maxTokens(350);
if (modelId != null) b.model(modelId);
return spec.options(b.build());
}
maxTokens=350 for Anthropic, which is enough for
a short JSON object. This keeps the router call fast and cheap. If the LLM returns more tokens
(e.g. wrapped in markdown), extractJsonObject() strips the surrounding text using
brace-matching before JSON parsing.
intentRoutingEnabled=true, every non-cached chat request
makes two LLM calls: one for routing (~100–300 ms) and one for the answer. This is usually
worth the cost because it avoids unnecessary Qdrant scans, but monitor your API usage if you have
many concurrent users on a paid tier with expensive models.