Local Open-Source LLMs
gemini-embedding-001 (768-dim); input safety uses Gemini 2.5 Flash
(gemini-2.5-flash). Ollama is for optional local chat (e.g. Qwen, DeepSeek)
when the user picks an Ollama model. This topic still maps open-source models to pipeline roles
if you move embeddings or guardrails off the cloud, and gives hardware guidance for local LLMs.
Why Run LLMs Locally?
Cloud APIs are the fastest way to get started, but local models become attractive once any of these apply:
| Reason | Details |
|---|---|
| Data privacy | Documents never leave your network — essential for legal, medical, or government data |
| Cost at scale | After ~1 M tokens/day, GPU hardware pays for itself vs. cloud API bills |
| Offline / air-gapped | Factory floors, secure facilities, or fieldwork with no internet access |
| Latency control | Avoid round-trip network latency; local GPU can be faster than remote API for small models |
| No rate limits | Burst as hard as your GPU allows |
application.yml) and Ollama only for chat, or reconfigure to run more of the
stack locally — at the cost of maintaining compatible embedding dimensions and extra VRAM.
The Five LLM Roles in Power RAG
Before picking a model, identify which role it needs to fill. Each role has a different demand profile:
Default: Google
gemini-embedding-001 at 768 dimensions (requires GOOGLE_API_KEY).Required by: document ingestion, semantic cache, Qdrant hybrid retrieval.
Demand: high throughput, stable dimensionality aligned with
spring.ai.vectorstore.qdrant.dimensions.
Default: Gemini 2.5 Flash via
GuardrailService (powerrag.guardrails.input-model-id).Demand: fast classification and a consistent
safe / unsafe response format.
Required by:
RagService.callLlm() for document Q&A.Demand: strong instruction following, 24 000+ token context window, ability to cite [SOURCE N] inline. Document parsing generates long contexts — the model must not lose track of citations halfway through.
Required by:
TextToSqlService.Demand: very strong SQL generation, PostgreSQL dialect, understanding of schema descriptions and enum hints. This is where weak models fail most visibly — incorrect SQL causes execution errors that the user sees immediately.
Required by:
ImageParser during document ingestion.Demand: vision capability, descriptive output. Runs only when image files are ingested, not on every query — latency is less critical than for roles 1–4.
Role 1 — Embedding Models
EmbeddingModel, you must remove the Ollama embedding autoconfig
exclusion (or equivalent), set Qdrant dimensions to match the new model, and re-ingest all documents.
| Model | Dimensions | Size | Speed (CPU) | Notes |
|---|---|---|---|---|
| gemini-embedding-001 ⭐ default (cloud) | 768 (configured) | API | network-bound | Used in Power RAG via Spring AI Google GenAI; same model for Qdrant + Redis cache. |
| nomic-embed-text (local option) | 768 | 274 MB | ~10 ms/chunk | Common Ollama choice if you self-host embeddings; must re-enable Ollama embedding autoconfig and match dimensions. |
| mxbai-embed-large | 1024 | 670 MB | ~25 ms/chunk | Higher quality MTEB scores; requires changing Qdrant dimensions to 1024. |
| all-minilm:l6-v2 | 384 | 23 MB | ~3 ms/chunk | Extremely fast and tiny. Lower recall — only for high-volume, latency-sensitive setups. |
| snowflake-arctic-embed2 | 1024 | 560 MB | ~20 ms/chunk | State-of-the-art retrieval quality; good upgrade path from smaller local embedders. |
spring.ai.vectorstore.qdrant.dimensions, and full re-ingestion.
# You would remove OllamaEmbeddingAutoConfiguration from spring.autoconfigure.exclude,
# then align dimensions, e.g.:
spring:
ai:
ollama:
embedding:
options:
model: mxbai-embed-large
vectorstore:
qdrant:
dimensions: 1024
Role 2 — Safety Guardrail Models
Shipped default: Gemini 2.5 Flash (gemini-2.5-flash) — fast,
low temperature, and suitable for a short rubric response. The table below lists local
Ollama classifiers if you replace the geminiGuard client with an Ollama-based flow.
| Model | Size | VRAM (GPU) | Latency | Notes |
|---|---|---|---|---|
| Gemini 2.5 Flash ⭐ default (cloud) | API | — | network-bound | Power RAG default via GoogleGenAiChatOptions in GuardrailService. |
| llama-guard3:8b (local option) | 4.9 GB | 6 GB | ~300 ms GPU / ~2 s CPU | Meta safety classifier; viable if you refactor guardrails back to Ollama. |
| llama-guard3:1b | 0.8 GB | 2 GB | ~80 ms GPU / ~500 ms CPU | Smaller, faster, slightly lower accuracy. Good for high-traffic deployments. |
| shieldgemma:2b | 1.6 GB | 3 GB | ~120 ms GPU / ~800 ms CPU | Google's safety classifier. Comparable accuracy to llama-guard3:1b. |
llama-guard3:1b).
// inputModelId defaults to gemini-2.5-flash; override with POWERRAG_GUARDRAIL_MODEL
.options(GoogleGenAiChatOptions.builder()
.model(inputModelId)
.temperature(0.0)
.build())
Role 3 — General RAG Chat Models
The general chat model handles document Q&A — the core use case of the application. Power RAG sends it up to 24,000 characters of context (retrieved document chunks) plus the user's question. The model must follow citation instructions ([SOURCE N] inline) and maintain coherence across the full context window.
- Context window ≥ 32K tokens (24 000 chars ≈ 6 000–8 000 tokens)
- Strong instruction following (must cite sources, respond in requested language)
- Multilingual output (en / zh-CN / zh-TW supported by Power RAG)
- Reasoning quality sufficient for document synthesis
| Model | Size | VRAM | Context | Speed (GPU tok/s) | Quality |
|---|---|---|---|---|---|
| qwen2.5:7b | 4.7 GB | 6 GB | 128K | ~60 tok/s | ★★★★☆ — Excellent for size. Strong multilingual (English + Chinese). |
| qwen2.5:14b ⭐ recommended | 9.0 GB | 12 GB | 128K | ~35 tok/s | ★★★★★ — Near-cloud quality. Best open-source option for RAG chat. |
| qwen2.5:32b | 20 GB | 24 GB | 128K | ~18 tok/s | ★★★★★ — Cloud-grade quality. Requires high-end GPU. |
| llama3.1:8b | 4.9 GB | 6 GB | 128K | ~55 tok/s | ★★★☆☆ — Good English quality. Weaker multilingual than Qwen. |
| llama3.3:70b | 42 GB | 48 GB | 128K | ~8 tok/s | ★★★★★ — Exceptional. Requires multi-GPU or Apple M3 Ultra. |
| phi4:14b | 8.9 GB | 12 GB | 16K | ~38 tok/s | ★★★★☆ — Microsoft research model. Outstanding reasoning per parameter. |
| mistral:7b | 4.1 GB | 5 GB | 32K | ~65 tok/s | ★★★☆☆ — Fast. Good instruction following. Limited multilingual. |
| gemma3:12b | 8.1 GB | 10 GB | 128K | ~40 tok/s | ★★★★☆ — Google's open-source model. Strong reasoning and multilingual. |
qwen2.5:14b on a 16 GB VRAM GPU
(e.g. RTX 4080, RTX 3090). It has native strong Chinese support (important for Power RAG's
zh-CN/zh-TW modes), a 128K context window that comfortably holds the 24 000-char RAG context,
and near-cloud quality output. Token speed of ~35 tok/s produces a 300-token response in under
10 seconds — an acceptable user experience.
Role 4 — Text-to-SQL Models
Text-to-SQL is the most demanding task for a local model. TextToSqlService
sends a full schema description (table names, column types, enum values) plus a natural
language question, and expects a correct, syntactically valid PostgreSQL SELECT statement
in return. Errors are immediately visible to the user as execution failures.
| Model | Size | VRAM | SQL Accuracy | Notes |
|---|---|---|---|---|
| qwen2.5-coder:32b ⭐ recommended | 19 GB | 24 GB | ★★★★★ | Best open-source SQL generation. Handles complex JOINs, CTEs, window functions. Already registered in Power RAG as OLLAMA:qwen2.5-coder:32b. |
| deepseek-coder-v2:16b ⭐ used | 9.1 GB | 12 GB | ★★★★☆ | Excellent SQL quality at 16B. The balanced choice between VRAM and accuracy. Already registered in Power RAG. |
| qwen2.5-coder:14b | 9.0 GB | 12 GB | ★★★★☆ | Strong SQL generation. Comparable to deepseek-coder-v2:16b. Good if you want a single Qwen family across all roles. |
| qwen2.5-coder:7b | 4.7 GB | 6 GB | ★★★☆☆ | Acceptable for simple single-table SELECTs. Struggles with multi-table JOINs and schema-specific enum constraints. |
| codellama:34b | 19 GB | 24 GB | ★★★★☆ | Strong but older (Meta 2023). Qwen2.5-coder and DeepSeek outperform it on modern benchmarks. |
| sqlcoder:7b | 3.8 GB | 5 GB | ★★★☆☆ | Purpose-built for SQL by Defog AI. Good for MySQL/Postgres basic queries; limited on complex schemas or PostgreSQL-specific syntax. |
TextToSqlService uses Gemini 2.5 Pro by default — a cloud model
with state-of-the-art SQL generation. If you need a fully local Text-to-SQL pipeline,
replace the @Qualifier("geminiPro") in TextToSqlService
with a dedicated local coder client.
// Current: uses Gemini 2.5 Pro (cloud)
public TextToSqlService(SchemaIntrospector schemaIntrospector,
SqlValidator sqlValidator,
JdbcTemplate jdbcTemplate,
@Qualifier("geminiPro") ChatClient chatClient) {
// To use local deepseek-coder-v2:16b instead:
// 1. Register a dedicated bean in SpringAiConfig.java:
// @Bean @Qualifier("ollamaDeepSeekSql")
// public ChatClient ollamaDeepSeekSqlClient(OllamaChatModel model) {
// return ChatClient.builder(model).build();
// }
// 2. Change qualifier here:
// @Qualifier("ollamaDeepSeekSql") ChatClient chatClient
// 3. Override model at call time:
// .options(OllamaChatOptions.builder()
// .model("deepseek-coder-v2:16b").build())
}
Role 5 — Image Description (Vision Models)
ImageParser uses a multimodal LLM to generate text descriptions of image files
during document ingestion. The description becomes a text chunk that is embedded and indexed
— making image content searchable via RAG. This role runs only during ingestion, not on
every query.
| Model | Size | VRAM | Vision Quality | Notes |
|---|---|---|---|---|
| qwen2.5vl:7b ⭐ recommended | 5.5 GB | 8 GB | ★★★★★ | State-of-the-art open-source vision-language model (2025). Excellent at diagrams, charts, screenshots, and technical illustrations. |
| llava:13b | 8.0 GB | 10 GB | ★★★☆☆ | Classic multimodal model. Good general image description but weak on technical charts and data tables. |
| minicpm-v:8b | 5.5 GB | 7 GB | ★★★★☆ | Compact and fast. Surprisingly good on UI screenshots and document images. |
| llava-phi3:3.8b | 2.9 GB | 4 GB | ★★★☆☆ | Very small and fast. Good for simple photos. Weak on complex technical diagrams or dense text. |
| gemma3:12b | 8.1 GB | 10 GB | ★★★★☆ | Google's multimodal model. Strong at reading text within images (OCR-like) and understanding charts. |
qwen2.5vl:7b produces significantly richer
descriptions than older llava models — resulting in better retrieval when users ask about
visual content.
Hardware Requirements
GPU VRAM Tiers
| VRAM | Example GPUs | What fits | User Experience |
|---|---|---|---|
| 4–6 GB | RTX 3060, RTX 4060 | qwen2.5:7b (q4), small local guard classifiers if used | ⚠️ Marginal — chat ~40s/response on CPU. OK for demos with cloud embeddings/guardrails. |
| 8–12 GB | RTX 3080, RTX 4070, RTX 4080 | qwen2.5:14b, deepseek-coder-v2:16b (q4), larger local guard models if used | ✅ Good — chat responses in 8–15s. This is the minimum for a comfortable user experience. |
| 16–24 GB | RTX 4090, RTX 3090, A5000 | All above + qwen2.5-coder:32b, qwen2.5:32b, all vision models | ✅✅ Excellent — responses in 5–10s. Can run all 5 roles locally simultaneously. |
| 48–80 GB | A100, H100, 2× RTX 4090 | llama3.3:70b, qwen2.5:72b | ✅✅✅ Cloud-grade — full 70B models locally. Near-zero latency for smaller models. |
Apple Silicon (Unified Memory)
Apple M-series chips use unified memory shared between CPU and GPU. This is excellent for LLMs because the full memory pool is available for model weights:
| Chip / RAM | Recommended Models | Notes |
|---|---|---|
| M1 / M2 (8 GB) | qwen2.5:7b, small local guard classifiers if used | ⚠️ Tight. Chat model will be slow (~15 tok/s). Embed + guardrail are fine. |
| M2 / M3 (16 GB) | qwen2.5:14b, deepseek-coder-v2:16b (q4), larger local guard models if used | ✅ Good. This is a popular developer configuration. ~25–30 tok/s for 14B models. |
| M2 / M3 Pro (32 GB) | qwen2.5:32b, qwen2.5-coder:32b, all vision models | ✅✅ Excellent. Comfortable for all roles. ~18–22 tok/s for 32B models. |
| M2 / M3 Max / Ultra (64–192 GB) | llama3.3:70b, qwen2.5:72b | ✅✅✅ Exceptional. Full 70B models run at 12–18 tok/s — production-ready. |
RAM (System Memory) — CPU-Only
If no dedicated GPU is available, Ollama falls back to CPU inference. Minimum requirements:
| Model Size | Min RAM | CPU Speed | Practical Use |
|---|---|---|---|
| small local embedder (~300 MB) | 4 GB | ~15 ms/embed | ✅ CPU OK for tiny embedding models only |
| 1–3B models | 8 GB | ~8 tok/s | ✅ Guardrails only — acceptable latency |
| 7–8B models | 16 GB | ~3–5 tok/s | ⚠️ Very slow for chat (~60–100s/response) |
| 13–14B models | 32 GB | ~1–2 tok/s | ❌ Not viable for interactive use |
Recommended Configurations by Use Case
Configuration A — Minimum Viable (Developer Laptop, No Dedicated GPU)
# Default stack: set GOOGLE_API_KEY (embeddings + input guard + Gemini chat as needed)
# Ollama optional — only if you want local chat models, e.g.:
# ollama pull qwen2.5-coder:7b
# Hybrid: cloud embeddings/guardrails + local Ollama chat — no nomic/llama-guard pulls required
Configuration B — Balanced Local (16 GB VRAM / M3 Pro 32 GB)
# If fully local beyond chat, add your own embedding + guard Ollama models and reconfigure.
ollama pull qwen2.5:14b # General RAG chat (primary)
ollama pull deepseek-coder-v2:16b # Text-to-SQL
ollama pull qwen2.5vl:7b # Image description (ingestion)
Full local isolation requires replacing Google embedding and guardrail beans, not just emptying API keys. For chat-only local default while keeping Google embeddings/guardrails, point Ollama at your preferred model:
spring:
ai:
ollama:
base-url: http://localhost:11434
chat:
options:
model: qwen2.5:14b
google:
genai:
api-key: ${GOOGLE_API_KEY} # still required for gemini-embedding-001 + guardrails
Configuration C — Maximum Quality (24 GB VRAM / M3 Max 64 GB)
ollama pull qwen2.5:32b # General RAG chat
ollama pull qwen2.5-coder:32b # Text-to-SQL (excellent accuracy)
ollama pull qwen2.5vl:7b # Image description
Performance Tuning for User Experience
A RAG response chains several steps: input guard (Gemini by default) → cache lookup (embedding API) → retrieval (embedding again) → main LLM. When using Ollama for chat, keep models warm — latency compounds.
Ollama unloads models after 5 minutes of inactivity by default. The first request after unload takes 5–30s to reload. For production, set:
services:
ollama:
environment:
OLLAMA_KEEP_ALIVE: "24h" # keep all recently-used models in VRAM
Ollama automatically uses Q4_K_M quantisation for most models. This halves VRAM usage with <5% quality loss. A 14B model at Q4 fits in 9 GB VRAM instead of 28 GB. Always prefer quantised over full-precision for interactive use.
The guardrail check (Stage 0) and the semantic cache lookup (Stage 1) are independent. If you refactor
RagService to call them via CompletableFuture,
you save the guardrail latency on cache-hit paths.
The cache threshold of 0.92 is conservative. On repeated-query workloads (e.g. a support chatbot), lowering it to 0.88 can dramatically increase cache hit rates at acceptable semantic accuracy. Lower it in
application.yml:
powerrag:
models:
cache:
similarity-threshold: 0.88 # was 0.92 — more aggressive caching
Quick Reference — Role to Model Mapping
| Role | Current (Power RAG) | Local Minimum | Local Recommended | Local Maximum Quality |
|---|---|---|---|---|
| Embedding | gemini-embedding-001 (cloud, 768-dim) | nomic-embed-text | nomic-embed-text / mxbai-embed-large | snowflake-arctic-embed2 |
| Guardrail | Gemini 2.5 Flash (gemini-2.5-flash) |
llama-guard3:1b | llama-guard3:8b | llama-guard3:8b |
| RAG Chat | claude-sonnet-4-6 (cloud) | qwen2.5:7b | qwen2.5:14b | qwen2.5:32b |
| Text-to-SQL | gemini-2.5-pro (cloud) | qwen2.5-coder:7b | deepseek-coder-v2:16b | qwen2.5-coder:32b |
| Image Description | claude-sonnet-4-6 (cloud) | llava-phi3:3.8b | qwen2.5vl:7b | gemma3:12b |