Dynamic Model Routing

Module 2 · ~8 min read

The Power RAG frontend lets users choose their LLM provider and model at runtime. The backend must resolve the correct ChatClient bean and — for providers like Ollama and Gemini that share a single underlying bean — inject the correct model identifier per request via ChatOptions.

The Challenge

Spring beans are created at startup. You cannot create a new ChatClient bean for every possible Ollama model. The solution is two-level routing:

Resolve the client bean — select the correct pre-built ChatClient for the provider
Override the model per-request — inject the specific model ID via ChatOptions on the request spec

resolveClient()

RagService.java — resolveClient() View source ↗

private ChatClient resolveClient(String provider, String modelId) {
    if ("OLLAMA".equalsIgnoreCase(provider)) return ollamaBaseClient;
    if ("GEMINI".equalsIgnoreCase(provider)) return geminiBaseClient;
    ChatClient c = clientsByKey.get(provider.toUpperCase() + ":" + modelId);
    if (c != null) return c;
    return clientsByKey.get("ANTHROPIC:claude-sonnet-4-6"); // default
}

The logic:

Ollama and Gemini share one bean each — the specific model is set later via options
Anthropic clients are registered in a Map<String, ChatClient> keyed as "ANTHROPIC:model-id"
If no match is found, fall back to the default Claude Sonnet bean

Dynamic Model Override with ChatOptions

Once the right client is selected, the specific model ID is injected via provider-specific ChatOptions. This overrides the model that was configured in application.yml for that particular request only — the next request starts fresh.

RagService.java — per-request model options View source ↗

if ("OLLAMA".equalsIgnoreCase(provider) && modelId != null) {
    baseSpec = baseSpec.options(
        OllamaChatOptions.builder().model(modelId).build());
} else if ("GEMINI".equalsIgnoreCase(provider) && modelId != null) {
    baseSpec = baseSpec.options(
        GoogleGenAiChatOptions.builder().model(modelId).build());
}

Gemini and Ollama share one underlying model bean each. The specific model is injected per-request via options(), keeping the bean count manageable. This avoids the need to pre-register a separate bean for every possible Ollama model (there could be dozens installed locally).

The Full Routing Flow

Request: provider="OLLAMA", modelId="qwen2.5-coder:32b" │ ▼ resolveClient("OLLAMA", "qwen2.5-coder:32b") │ │ returns ollamaBaseClient (geminiFlash bean) ▼ baseSpec = ollamaBaseClient.prompt() │ │ override model for this request ▼ baseSpec = baseSpec.options( OllamaChatOptions.builder().model("qwen2.5-coder:32b").build()) │ ▼ baseSpec.user(userMessage).call().content() │ ▼ Ollama API → qwen2.5-coder:32b → answer

Default Fallback

If the frontend sends an unknown provider/model combination, resolveClient() returns the Anthropic Claude Sonnet bean. This ensures the pipeline never breaks — it degrades gracefully to the primary LLM.

When adding a new Anthropic model, register it in clientsByKey with the key "ANTHROPIC:new-model-id". For Ollama and Gemini, no code change is needed — the model ID is passed at request time.

← Previous Next →