Intent Routing

Module 10 · ~20 min read

Without intent routing, every chat message triggers a full knowledge-base retrieval, whether or not the user is asking about their uploaded documents. A user asking "What time is it in Tokyo?" does not need Qdrant results — it needs a time-zone lookup. Retrieving irrelevant chunks wastes latency and can pollute the context window with noise that confuses the model.

Intent routing adds a fast decision step before retrieval: a small LLM call (or a pattern-matching fallback) classifies each question into a QueryIntent that tells the pipeline exactly what resources to activate.

The QueryIntent Record

The outcome of classification is a simple Java record with two boolean fields:

backend/src/main/java/com/powerrag/rag/intent/QueryIntent.java View source ↗

public record QueryIntent(boolean retrieveDocuments, boolean allowMcpTools) {

    /** When intent routing is disabled: always retrieve;
        MCP follows global powerrag.mcp.rag-enabled flag. */
    public static QueryIntent legacy(boolean mcpRagEnabled) {
        return new QueryIntent(true, mcpRagEnabled);
    }
}

Field	When true	When false
`retrieveDocuments`	Run Qdrant + PostgreSQL FTS hybrid retrieval	Skip retrieval entirely; use only LLM general knowledge (or MCP tools)
`allowMcpTools`	Attach MCP tool callbacks to this ChatClient call	No tools attached; model answers from context or KB only

Key concept: allowMcpTools in the intent is an additional gate on top of the global powerrag.mcp.rag-enabled flag. Tools are only attached when both the global flag is true AND the per-query intent allows them. This lets you enable MCP globally but have the router selectively disable tools for simple KB-only queries.

The Two Classification Paths

QueryIntentClassifier tries the LLM-based router first. If it fails (timeout, parse error, API error), it falls back to pattern-matching heuristics. The fallback is deliberately conservative: when in doubt, it retrieves documents and enables MCP tools.

backend/src/main/java/com/powerrag/rag/intent/QueryIntentClassifier.java — classify() View source ↗

public QueryIntent classify(String question, String language,
                            ChatClient client, String modelProvider, String modelId) {
    if (question == null || question.isBlank()) {
        return fallback(question);
    }
    try {
        ChatClient.ChatClientRequestSpec spec = client.prompt()
                .system(ROUTER_SYSTEM)
                .user("language=" + (language != null ? language : "en") + "\n\n" + question);
        spec = applyRouterOptions(spec, modelProvider, modelId);  // temp=0.0, max_tokens=350
        String raw = spec.call().content();
        return parseModelJson(raw, question);
    } catch (Exception e) {
        log.warn("Intent classification failed, using heuristic fallback: {}", e.getMessage());
        return fallback(question);
    }
}

The Router System Prompt

The classifier uses the same ChatClient as the main answer call, but overrides temperature to 0.0 and caps output at 350 tokens. The model is only allowed to return a JSON object — no markdown fences, no prose. This makes the response fast (<300 ms typically) and easy to parse.

QueryIntentClassifier.java — router system prompt (ROUTER_SYSTEM constant) View source ↗

You route questions for a RAG assistant with optional MCP tools (web fetch, time, weather,
Jira, GitHub code/repo, GCP Cloud Logging, mailbox IMAP/SMTP).
Reply with ONLY one JSON object (no markdown fences, no prose):
{"retrieveDocuments":true|false,"allowMcpTools":true|false}

retrieveDocuments — true if the user likely needs their uploaded knowledge base (policies,
internal docs, PDFs they ingested, "my documents", company-specific material).
false for pure general knowledge, small talk, standalone math/logic,
generic coding help with no doc context, or questions that clearly do not depend on private uploads.

allowMcpTools — true if any live tool may help: URL or fetch/read a webpage; current time or
timezone; weather or forecast; Jira/support tickets; GitHub (search code, read files in a repo);
Google Cloud Logging / Stackdriver; email/mailbox.
false when a static KB or general knowledge answer suffices.
Default false; do not enable for purely hypothetical or historical trivia with no live data need.

If unsure about retrieveDocuments, prefer true.

The final instruction — "if unsure about retrieveDocuments, prefer true" — is intentional. A false negative on retrieval (skipping the KB when it was needed) produces a worse answer than a false positive (retrieving when it wasn't needed). The cost of an unnecessary retrieval is a few milliseconds; the cost of missing KB context is a hallucinated answer.

Heuristic Fallback Patterns

When the LLM router fails or is disabled, a set of compiled regular expressions provide a fast fallback. These cover the most common signals that live data is needed:

QueryIntentClassifier.java — regex patterns for MCP signal detection View source ↗

// Any http/https URL in the question
private static final Pattern URL_PATTERN = Pattern.compile(
    "https?://[^\\s<>\"{}|\\\\^`\\[\\]]+", Pattern.CASE_INSENSITIVE);

// Jira issue keys (e.g. KAN-5), "jira", "atlassian", "support ticket"
private static final Pattern JIRA_LIVE_PATTERN = Pattern.compile(
    "(?i)\\b([a-z][a-z0-9]+-\\d+|\\bjira\\b|\\batlassian\\b|\\bsupport ticket\\b|\\bkan-\\d+)");

// Weather terms
private static final Pattern WEATHER_LIVE_PATTERN = Pattern.compile(
    "(?i)\\b(weather|forecast|temperature|rain|humidity|wind chill|feels like)\\b");

// Time and timezone queries
private static final Pattern TIME_LIVE_PATTERN = Pattern.compile(
    "(?i)\\b(what time|current time|time now|timezone|time in |utc\\b|gmt\\b|zulu time)\\b");

// GitHub: repo search, code search, PR/commit queries
private static final Pattern GITHUB_MCP_PATTERN = Pattern.compile(
    "(?i)\\b(github\\b|repo:\\s*\\S+/\\S+|code\\s+search|pull\\s+request|\\bcommit(s)?\\b)\\b");

// GCP / Stackdriver / Cloud Logging
private static final Pattern GCP_LOGGING_MCP_PATTERN = Pattern.compile(
    "(?i)\\b(gcp\\s+logs?|google\\s+cloud\\s+logging|stackdriver|gke\\s+logs)\\b");

// Email / inbox / mailbox
private static final Pattern EMAIL_MCP_PATTERN = Pattern.compile(
    "(?i)\\b(email|inbox|unread\\s+mail|mailbox|imap|gmail|summarize.*email|draft.*reply)\\b");

The fallback returns retrieveDocuments=true in all cases (safe default), and allowMcpTools=true only when one of the patterns matches.

Integration in RagService

Intent routing runs immediately after the semantic cache check (to avoid unnecessary classification for cached responses) and before hybrid retrieval. Images in the request bypass routing since the main LLM must handle vision — tools would be a distraction.

backend/src/main/java/com/powerrag/rag/service/RagService.java — intent routing step View source ↗

// ── 2. Intent routing (before retrieval & main LLM) ─────────────────
boolean imagePresent = imageBase64 != null && !imageBase64.isBlank();
QueryIntent intent;
if (!intentRoutingEnabled || imagePresent) {
    // Legacy mode: always retrieve; MCP follows global flag
    intent = QueryIntent.legacy(mcpRagEnabled);
} else {
    intent = intentClassifier.classify(question, lang, client, modelProvider, reqModelId);
}
boolean attachMcpTools = mcpRagEnabled && intent.allowMcpTools();

log.info("Query intent: retrieveDocuments={} attachMcpTools={} (routingActive={})",
        intent.retrieveDocuments(), attachMcpTools, intentRoutingEnabled && !imagePresent);

// ── 3. Retrieval (optional) ─────────────────────────────────────────
List<RetrievedChunk> chunks;
if (!intent.retrieveDocuments()) {
    chunks = List.of();
    log.info("Skipping knowledge-base retrieval (intent.retrieveDocuments=false)");
} else {
    chunks = retriever.retrieve(question);
}

Configuration

application.yml — intent routing config

powerrag:
  rag:
    # One small LLM call to decide KB retrieval vs general knowledge,
    # and whether MCP tools (e.g. fetch) are needed.
    intent-routing-enabled: true

  mcp:
    # Attach MCP tools to the main RAG ChatClient when
    # spring.ai.mcp.client.enabled=true (see application-dev.yml).
    rag-enabled: false

Set intent-routing-enabled: false to fall back to the pre-MCP behaviour: always retrieve from the knowledge base. This is useful when you want to disable the extra LLM call for cost reasons, or when debugging retrieval quality without the routing layer in the way.

Example Routing Decisions

User question	retrieveDocuments	allowMcpTools	Reasoning
"Summarise the EU AI Act risk categories"	true	false	Clearly refers to an uploaded document
"What time is it in Singapore?"	false	true	Live data needed; no document relevance
"Fetch https://example.com/data.json and summarise it"	false	true	URL detected — fetch_url tool required
"What are the open KAN tickets?"	false	true	Jira issue key pattern detected
"Write a Python function to sort a list"	false	false	General coding help; no docs or live data needed
"According to my documents, what is our GDPR policy on retention?"	true	false	"my documents" phrase triggers KB retrieval
"Search the repo for the RagService class"	false	true	GitHub code search pattern detected

Provider-Aware Router Options

The classifier uses the same provider/model as the main chat request, but forces deterministic output by setting temperature to 0.0. Provider-specific option builders are used to avoid the options from one provider being applied to another:

QueryIntentClassifier.java — applyRouterOptions()

private ChatClient.ChatClientRequestSpec applyRouterOptions(
        ChatClient.ChatClientRequestSpec spec, String provider, String modelId) {
    String p = provider != null ? provider : "ANTHROPIC";
    if ("OLLAMA".equalsIgnoreCase(p) && modelId != null) {
        return spec.options(OllamaChatOptions.builder()
                .model(modelId).temperature(0.0).build());
    }
    if ("GEMINI".equalsIgnoreCase(p) && modelId != null) {
        return spec.options(GoogleGenAiChatOptions.builder()
                .model(modelId).temperature(0.0).build());
    }
    // Default: Anthropic
    AnthropicChatOptions.Builder b = AnthropicChatOptions.builder()
            .temperature(0.0).maxTokens(350);
    if (modelId != null) b.model(modelId);
    return spec.options(b.build());
}

Tip: The router uses maxTokens=350 for Anthropic, which is enough for a short JSON object. This keeps the router call fast and cheap. If the LLM returns more tokens (e.g. wrapped in markdown), extractJsonObject() strips the surrounding text using brace-matching before JSON parsing.

Warning: With intentRoutingEnabled=true, every non-cached chat request makes two LLM calls: one for routing (~100–300 ms) and one for the answer. This is usually worth the cost because it avoids unnecessary Qdrant scans, but monitor your API usage if you have many concurrent users on a paid tier with expensive models.

← Previous Next →