Local Open-Source LLMs

Module 9 · ~18 min read
Default stack: Knowledge-base and cache embeddings use Google gemini-embedding-001 (768-dim); input safety uses Gemini 2.5 Flash (gemini-2.5-flash). Ollama is for optional local chat (e.g. Qwen, DeepSeek) when the user picks an Ollama model. This topic still maps open-source models to pipeline roles if you move embeddings or guardrails off the cloud, and gives hardware guidance for local LLMs.

Why Run LLMs Locally?

Cloud APIs are the fastest way to get started, but local models become attractive once any of these apply:

ReasonDetails
Data privacyDocuments never leave your network — essential for legal, medical, or government data
Cost at scaleAfter ~1 M tokens/day, GPU hardware pays for itself vs. cloud API bills
Offline / air-gappedFactory floors, secure facilities, or fieldwork with no internet access
Latency controlAvoid round-trip network latency; local GPU can be faster than remote API for small models
No rate limitsBurst as hard as your GPU allows
Hybrid approach: You can use cloud APIs for embeddings + guardrails (as in the reference application.yml) and Ollama only for chat, or reconfigure to run more of the stack locally — at the cost of maintaining compatible embedding dimensions and extra VRAM.

The Five LLM Roles in Power RAG

Before picking a model, identify which role it needs to fill. Each role has a different demand profile:

Role 1 — Embedding (every chunk at ingest, every query for dense search + semantic cache)
Default: Google gemini-embedding-001 at 768 dimensions (requires GOOGLE_API_KEY).
Required by: document ingestion, semantic cache, Qdrant hybrid retrieval.
Demand: high throughput, stable dimensionality aligned with spring.ai.vectorstore.qdrant.dimensions.
Role 2 — Safety Guardrail (every user query, before retrieval / main LLM)
Default: Gemini 2.5 Flash via GuardrailService (powerrag.guardrails.input-model-id).
Demand: fast classification and a consistent safe / unsafe response format.
Role 3 — General RAG Chat (one call per non-cached query)
Required by: RagService.callLlm() for document Q&A.
Demand: strong instruction following, 24 000+ token context window, ability to cite [SOURCE N] inline. Document parsing generates long contexts — the model must not lose track of citations halfway through.
Role 4 — Text-to-SQL (one call per SQL query)
Required by: TextToSqlService.
Demand: very strong SQL generation, PostgreSQL dialect, understanding of schema descriptions and enum hints. This is where weak models fail most visibly — incorrect SQL causes execution errors that the user sees immediately.
Role 5 — Image Description (one call per uploaded image file)
Required by: ImageParser during document ingestion.
Demand: vision capability, descriptive output. Runs only when image files are ingested, not on every query — latency is less critical than for roles 1–4.

Role 1 — Embedding Models

In the reference project, embeddings are not local: they go through Google GenAI. If you switch to a local EmbeddingModel, you must remove the Ollama embedding autoconfig exclusion (or equivalent), set Qdrant dimensions to match the new model, and re-ingest all documents.
Model Dimensions Size Speed (CPU) Notes
gemini-embedding-001 ⭐ default (cloud) 768 (configured) API network-bound Used in Power RAG via Spring AI Google GenAI; same model for Qdrant + Redis cache.
nomic-embed-text (local option) 768 274 MB ~10 ms/chunk Common Ollama choice if you self-host embeddings; must re-enable Ollama embedding autoconfig and match dimensions.
mxbai-embed-large 1024 670 MB ~25 ms/chunk Higher quality MTEB scores; requires changing Qdrant dimensions to 1024.
all-minilm:l6-v2 384 23 MB ~3 ms/chunk Extremely fast and tiny. Lower recall — only for high-volume, latency-sensitive setups.
snowflake-arctic-embed2 1024 560 MB ~20 ms/chunk State-of-the-art retrieval quality; good upgrade path from smaller local embedders.
Important: Changing embedding model or dimension count requires deleting the Qdrant collection, updating spring.ai.vectorstore.qdrant.dimensions, and full re-ingestion.
Example — local Ollama embedding (illustrative; not the repo default)
# You would remove OllamaEmbeddingAutoConfiguration from spring.autoconfigure.exclude,
# then align dimensions, e.g.:
spring:
  ai:
    ollama:
      embedding:
        options:
          model: mxbai-embed-large
    vectorstore:
      qdrant:
        dimensions: 1024

Role 2 — Safety Guardrail Models

Shipped default: Gemini 2.5 Flash (gemini-2.5-flash) — fast, low temperature, and suitable for a short rubric response. The table below lists local Ollama classifiers if you replace the geminiGuard client with an Ollama-based flow.

Model Size VRAM (GPU) Latency Notes
Gemini 2.5 Flash ⭐ default (cloud) API network-bound Power RAG default via GoogleGenAiChatOptions in GuardrailService.
llama-guard3:8b (local option) 4.9 GB 6 GB ~300 ms GPU / ~2 s CPU Meta safety classifier; viable if you refactor guardrails back to Ollama.
llama-guard3:1b 0.8 GB 2 GB ~80 ms GPU / ~500 ms CPU Smaller, faster, slightly lower accuracy. Good for high-traffic deployments.
shieldgemma:2b 1.6 GB 3 GB ~120 ms GPU / ~800 ms CPU Google's safety classifier. Comparable accuracy to llama-guard3:1b.
Input guardrails run on every chat query. With the default Gemini path, latency is mostly one round-trip to the Google API. If you use a local Ollama classifier instead, CPU-only hardware can add seconds per request unless you size the model down (e.g. llama-guard3:1b).
GuardrailService.java — model id at call time (excerpt) View source ↗
// inputModelId defaults to gemini-2.5-flash; override with POWERRAG_GUARDRAIL_MODEL
.options(GoogleGenAiChatOptions.builder()
        .model(inputModelId)
        .temperature(0.0)
        .build())

Role 3 — General RAG Chat Models

The general chat model handles document Q&A — the core use case of the application. Power RAG sends it up to 24,000 characters of context (retrieved document chunks) plus the user's question. The model must follow citation instructions ([SOURCE N] inline) and maintain coherence across the full context window.

Key requirements for this role:
  • Context window ≥ 32K tokens (24 000 chars ≈ 6 000–8 000 tokens)
  • Strong instruction following (must cite sources, respond in requested language)
  • Multilingual output (en / zh-CN / zh-TW supported by Power RAG)
  • Reasoning quality sufficient for document synthesis
Model Size VRAM Context Speed (GPU tok/s) Quality
qwen2.5:7b 4.7 GB 6 GB 128K ~60 tok/s ★★★★☆ — Excellent for size. Strong multilingual (English + Chinese).
qwen2.5:14b ⭐ recommended 9.0 GB 12 GB 128K ~35 tok/s ★★★★★ — Near-cloud quality. Best open-source option for RAG chat.
qwen2.5:32b 20 GB 24 GB 128K ~18 tok/s ★★★★★ — Cloud-grade quality. Requires high-end GPU.
llama3.1:8b 4.9 GB 6 GB 128K ~55 tok/s ★★★☆☆ — Good English quality. Weaker multilingual than Qwen.
llama3.3:70b 42 GB 48 GB 128K ~8 tok/s ★★★★★ — Exceptional. Requires multi-GPU or Apple M3 Ultra.
phi4:14b 8.9 GB 12 GB 16K ~38 tok/s ★★★★☆ — Microsoft research model. Outstanding reasoning per parameter.
mistral:7b 4.1 GB 5 GB 32K ~65 tok/s ★★★☆☆ — Fast. Good instruction following. Limited multilingual.
gemma3:12b 8.1 GB 10 GB 128K ~40 tok/s ★★★★☆ — Google's open-source model. Strong reasoning and multilingual.
Best choice for most setups: qwen2.5:14b on a 16 GB VRAM GPU (e.g. RTX 4080, RTX 3090). It has native strong Chinese support (important for Power RAG's zh-CN/zh-TW modes), a 128K context window that comfortably holds the 24 000-char RAG context, and near-cloud quality output. Token speed of ~35 tok/s produces a 300-token response in under 10 seconds — an acceptable user experience.

Role 4 — Text-to-SQL Models

Text-to-SQL is the most demanding task for a local model. TextToSqlService sends a full schema description (table names, column types, enum values) plus a natural language question, and expects a correct, syntactically valid PostgreSQL SELECT statement in return. Errors are immediately visible to the user as execution failures.

General 7B models are often insufficient for Text-to-SQL. They generate plausible-looking SQL that fails on JOINs with aliases, PostgreSQL-specific syntax (ILIKE, ::cast, CTEs), or schema-specific enum values. The performance cliff between a 7B and a 14B+ coder model is significant for this task specifically.
Model Size VRAM SQL Accuracy Notes
qwen2.5-coder:32b ⭐ recommended 19 GB 24 GB ★★★★★ Best open-source SQL generation. Handles complex JOINs, CTEs, window functions. Already registered in Power RAG as OLLAMA:qwen2.5-coder:32b.
deepseek-coder-v2:16b ⭐ used 9.1 GB 12 GB ★★★★☆ Excellent SQL quality at 16B. The balanced choice between VRAM and accuracy. Already registered in Power RAG.
qwen2.5-coder:14b 9.0 GB 12 GB ★★★★☆ Strong SQL generation. Comparable to deepseek-coder-v2:16b. Good if you want a single Qwen family across all roles.
qwen2.5-coder:7b 4.7 GB 6 GB ★★★☆☆ Acceptable for simple single-table SELECTs. Struggles with multi-table JOINs and schema-specific enum constraints.
codellama:34b 19 GB 24 GB ★★★★☆ Strong but older (Meta 2023). Qwen2.5-coder and DeepSeek outperform it on modern benchmarks.
sqlcoder:7b 3.8 GB 5 GB ★★★☆☆ Purpose-built for SQL by Defog AI. Good for MySQL/Postgres basic queries; limited on complex schemas or PostgreSQL-specific syntax.
Power RAG's TextToSqlService uses Gemini 2.5 Pro by default — a cloud model with state-of-the-art SQL generation. If you need a fully local Text-to-SQL pipeline, replace the @Qualifier("geminiPro") in TextToSqlService with a dedicated local coder client.
TextToSqlService.java — swapping to a local model View source ↗
// Current: uses Gemini 2.5 Pro (cloud)
public TextToSqlService(SchemaIntrospector schemaIntrospector,
                        SqlValidator sqlValidator,
                        JdbcTemplate jdbcTemplate,
                        @Qualifier("geminiPro") ChatClient chatClient) {

// To use local deepseek-coder-v2:16b instead:
//   1. Register a dedicated bean in SpringAiConfig.java:
//      @Bean @Qualifier("ollamaDeepSeekSql")
//      public ChatClient ollamaDeepSeekSqlClient(OllamaChatModel model) {
//          return ChatClient.builder(model).build();
//      }
//   2. Change qualifier here:
//        @Qualifier("ollamaDeepSeekSql") ChatClient chatClient
//   3. Override model at call time:
//        .options(OllamaChatOptions.builder()
//            .model("deepseek-coder-v2:16b").build())
}

Role 5 — Image Description (Vision Models)

ImageParser uses a multimodal LLM to generate text descriptions of image files during document ingestion. The description becomes a text chunk that is embedded and indexed — making image content searchable via RAG. This role runs only during ingestion, not on every query.

Model Size VRAM Vision Quality Notes
qwen2.5vl:7b ⭐ recommended 5.5 GB 8 GB ★★★★★ State-of-the-art open-source vision-language model (2025). Excellent at diagrams, charts, screenshots, and technical illustrations.
llava:13b 8.0 GB 10 GB ★★★☆☆ Classic multimodal model. Good general image description but weak on technical charts and data tables.
minicpm-v:8b 5.5 GB 7 GB ★★★★☆ Compact and fast. Surprisingly good on UI screenshots and document images.
llava-phi3:3.8b 2.9 GB 4 GB ★★★☆☆ Very small and fast. Good for simple photos. Weak on complex technical diagrams or dense text.
gemma3:12b 8.1 GB 10 GB ★★★★☆ Google's multimodal model. Strong at reading text within images (OCR-like) and understanding charts.
For RAG applications that ingest technical documentation with diagrams, architecture charts, or data tables, qwen2.5vl:7b produces significantly richer descriptions than older llava models — resulting in better retrieval when users ask about visual content.

Hardware Requirements

The key insight: VRAM determines which models can run at acceptable speed. CPU inference is viable for small local classifiers but is 3–10× slower than GPU for 7B+ chat models, making CPU-only setups painful for interactive Ollama chat.

GPU VRAM Tiers

VRAM Example GPUs What fits User Experience
4–6 GB RTX 3060, RTX 4060 qwen2.5:7b (q4), small local guard classifiers if used ⚠️ Marginal — chat ~40s/response on CPU. OK for demos with cloud embeddings/guardrails.
8–12 GB RTX 3080, RTX 4070, RTX 4080 qwen2.5:14b, deepseek-coder-v2:16b (q4), larger local guard models if used ✅ Good — chat responses in 8–15s. This is the minimum for a comfortable user experience.
16–24 GB RTX 4090, RTX 3090, A5000 All above + qwen2.5-coder:32b, qwen2.5:32b, all vision models ✅✅ Excellent — responses in 5–10s. Can run all 5 roles locally simultaneously.
48–80 GB A100, H100, 2× RTX 4090 llama3.3:70b, qwen2.5:72b ✅✅✅ Cloud-grade — full 70B models locally. Near-zero latency for smaller models.

Apple Silicon (Unified Memory)

Apple M-series chips use unified memory shared between CPU and GPU. This is excellent for LLMs because the full memory pool is available for model weights:

Chip / RAM Recommended Models Notes
M1 / M2 (8 GB) qwen2.5:7b, small local guard classifiers if used ⚠️ Tight. Chat model will be slow (~15 tok/s). Embed + guardrail are fine.
M2 / M3 (16 GB) qwen2.5:14b, deepseek-coder-v2:16b (q4), larger local guard models if used ✅ Good. This is a popular developer configuration. ~25–30 tok/s for 14B models.
M2 / M3 Pro (32 GB) qwen2.5:32b, qwen2.5-coder:32b, all vision models ✅✅ Excellent. Comfortable for all roles. ~18–22 tok/s for 32B models.
M2 / M3 Max / Ultra (64–192 GB) llama3.3:70b, qwen2.5:72b ✅✅✅ Exceptional. Full 70B models run at 12–18 tok/s — production-ready.

RAM (System Memory) — CPU-Only

If no dedicated GPU is available, Ollama falls back to CPU inference. Minimum requirements:

Model SizeMin RAMCPU SpeedPractical Use
small local embedder (~300 MB)4 GB~15 ms/embed✅ CPU OK for tiny embedding models only
1–3B models8 GB~8 tok/s✅ Guardrails only — acceptable latency
7–8B models16 GB~3–5 tok/s⚠️ Very slow for chat (~60–100s/response)
13–14B models32 GB~1–2 tok/s❌ Not viable for interactive use

Recommended Configurations by Use Case

Configuration A — Minimum Viable (Developer Laptop, No Dedicated GPU)

Ollama pull commands — minimum viable local stack
# Default stack: set GOOGLE_API_KEY (embeddings + input guard + Gemini chat as needed)
# Ollama optional — only if you want local chat models, e.g.:
# ollama pull qwen2.5-coder:7b

# Hybrid: cloud embeddings/guardrails + local Ollama chat — no nomic/llama-guard pulls required

Configuration B — Balanced Local (16 GB VRAM / M3 Pro 32 GB)

Ollama pull commands — balanced all-local stack
# If fully local beyond chat, add your own embedding + guard Ollama models and reconfigure.
ollama pull qwen2.5:14b             # General RAG chat (primary)
ollama pull deepseek-coder-v2:16b   # Text-to-SQL
ollama pull qwen2.5vl:7b            # Image description (ingestion)

Full local isolation requires replacing Google embedding and guardrail beans, not just emptying API keys. For chat-only local default while keeping Google embeddings/guardrails, point Ollama at your preferred model:

application.yml — emphasise local chat (embeddings still Google unless you change them)
spring:
  ai:
    ollama:
      base-url: http://localhost:11434
      chat:
        options:
          model: qwen2.5:14b
    google:
      genai:
        api-key: ${GOOGLE_API_KEY}  # still required for gemini-embedding-001 + guardrails

Configuration C — Maximum Quality (24 GB VRAM / M3 Max 64 GB)

Ollama pull commands — maximum quality local stack
ollama pull qwen2.5:32b              # General RAG chat
ollama pull qwen2.5-coder:32b        # Text-to-SQL (excellent accuracy)
ollama pull qwen2.5vl:7b             # Image description

Performance Tuning for User Experience

A RAG response chains several steps: input guard (Gemini by default) → cache lookup (embedding API) → retrieval (embedding again) → main LLM. When using Ollama for chat, keep models warm — latency compounds.

Tip 1 — Keep models loaded in memory
Ollama unloads models after 5 minutes of inactivity by default. The first request after unload takes 5–30s to reload. For production, set:
docker-compose.yml — keep models warm
services:
  ollama:
    environment:
      OLLAMA_KEEP_ALIVE: "24h"   # keep all recently-used models in VRAM
Tip 2 — Use quantised models (Q4_K_M)
Ollama automatically uses Q4_K_M quantisation for most models. This halves VRAM usage with <5% quality loss. A 14B model at Q4 fits in 9 GB VRAM instead of 28 GB. Always prefer quantised over full-precision for interactive use.
Tip 3 — Parallelise embedding and guardrail
The guardrail check (Stage 0) and the semantic cache lookup (Stage 1) are independent. If you refactor RagService to call them via CompletableFuture, you save the guardrail latency on cache-hit paths.
Tip 4 — Tune the semantic cache threshold
The cache threshold of 0.92 is conservative. On repeated-query workloads (e.g. a support chatbot), lowering it to 0.88 can dramatically increase cache hit rates at acceptable semantic accuracy. Lower it in application.yml:
application.yml — cache threshold
powerrag:
  models:
    cache:
      similarity-threshold: 0.88   # was 0.92 — more aggressive caching

Quick Reference — Role to Model Mapping

Role Current (Power RAG) Local Minimum Local Recommended Local Maximum Quality
Embedding gemini-embedding-001 (cloud, 768-dim) nomic-embed-text nomic-embed-text / mxbai-embed-large snowflake-arctic-embed2
Guardrail Gemini 2.5 Flash (gemini-2.5-flash) llama-guard3:1b llama-guard3:8b llama-guard3:8b
RAG Chat claude-sonnet-4-6 (cloud) qwen2.5:7b qwen2.5:14b qwen2.5:32b
Text-to-SQL gemini-2.5-pro (cloud) qwen2.5-coder:7b deepseek-coder-v2:16b qwen2.5-coder:32b
Image Description claude-sonnet-4-6 (cloud) llava-phi3:3.8b qwen2.5vl:7b gemma3:12b