Local Open-Source LLMs

Module 9 · ~18 min read

Default stack: Knowledge-base and cache embeddings use Google gemini-embedding-001 (768-dim); input safety uses Gemini 2.5 Flash (gemini-2.5-flash). Ollama is for optional local chat (e.g. Qwen, DeepSeek) when the user picks an Ollama model. This topic still maps open-source models to pipeline roles if you move embeddings or guardrails off the cloud, and gives hardware guidance for local LLMs.

Why Run LLMs Locally?

Cloud APIs are the fastest way to get started, but local models become attractive once any of these apply:

Reason	Details
Data privacy	Documents never leave your network — essential for legal, medical, or government data
Cost at scale	After ~1 M tokens/day, GPU hardware pays for itself vs. cloud API bills
Offline / air-gapped	Factory floors, secure facilities, or fieldwork with no internet access
Latency control	Avoid round-trip network latency; local GPU can be faster than remote API for small models
No rate limits	Burst as hard as your GPU allows

Hybrid approach: You can use cloud APIs for embeddings + guardrails (as in the reference application.yml) and Ollama only for chat, or reconfigure to run more of the stack locally — at the cost of maintaining compatible embedding dimensions and extra VRAM.

The Five LLM Roles in Power RAG

Before picking a model, identify which role it needs to fill. Each role has a different demand profile:

Role 1 — Embedding (every chunk at ingest, every query for dense search + semantic cache)
Default: Google gemini-embedding-001 at 768 dimensions (requires GOOGLE_API_KEY).
Required by: document ingestion, semantic cache, Qdrant hybrid retrieval.
Demand: high throughput, stable dimensionality aligned with spring.ai.vectorstore.qdrant.dimensions.

Role 2 — Safety Guardrail (every user query, before retrieval / main LLM)
Default: Gemini 2.5 Flash via GuardrailService (powerrag.guardrails.input-model-id).
Demand: fast classification and a consistent safe / unsafe response format.

Role 3 — General RAG Chat (one call per non-cached query)
Required by: RagService.callLlm() for document Q&A.
Demand: strong instruction following, 24 000+ token context window, ability to cite [SOURCE N] inline. Document parsing generates long contexts — the model must not lose track of citations halfway through.

Role 4 — Text-to-SQL (one call per SQL query)
Required by: TextToSqlService.
Demand: very strong SQL generation, PostgreSQL dialect, understanding of schema descriptions and enum hints. This is where weak models fail most visibly — incorrect SQL causes execution errors that the user sees immediately.

Role 5 — Image Description (one call per uploaded image file)
Required by: ImageParser during document ingestion.
Demand: vision capability, descriptive output. Runs only when image files are ingested, not on every query — latency is less critical than for roles 1–4.

Role 1 — Embedding Models

In the reference project, embeddings are not local: they go through Google GenAI. If you switch to a local EmbeddingModel, you must remove the Ollama embedding autoconfig exclusion (or equivalent), set Qdrant dimensions to match the new model, and re-ingest all documents.

Model	Dimensions	Size	Speed (CPU)	Notes
gemini-embedding-001 ⭐ default (cloud)	768 (configured)	API	network-bound	Used in Power RAG via Spring AI Google GenAI; same model for Qdrant + Redis cache.
nomic-embed-text (local option)	768	274 MB	~10 ms/chunk	Common Ollama choice if you self-host embeddings; must re-enable Ollama embedding autoconfig and match dimensions.
mxbai-embed-large	1024	670 MB	~25 ms/chunk	Higher quality MTEB scores; requires changing Qdrant dimensions to 1024.
all-minilm:l6-v2	384	23 MB	~3 ms/chunk	Extremely fast and tiny. Lower recall — only for high-volume, latency-sensitive setups.
snowflake-arctic-embed2	1024	560 MB	~20 ms/chunk	State-of-the-art retrieval quality; good upgrade path from smaller local embedders.

Important: Changing embedding model or dimension count requires deleting the Qdrant collection, updating spring.ai.vectorstore.qdrant.dimensions, and full re-ingestion.

Example — local Ollama embedding (illustrative; not the repo default)

# You would remove OllamaEmbeddingAutoConfiguration from spring.autoconfigure.exclude,
# then align dimensions, e.g.:
spring:
  ai:
    ollama:
      embedding:
        options:
          model: mxbai-embed-large
    vectorstore:
      qdrant:
        dimensions: 1024

Role 2 — Safety Guardrail Models

Shipped default: Gemini 2.5 Flash (gemini-2.5-flash) — fast, low temperature, and suitable for a short rubric response. The table below lists local Ollama classifiers if you replace the geminiGuard client with an Ollama-based flow.

Model	Size	VRAM (GPU)	Latency	Notes
Gemini 2.5 Flash ⭐ default (cloud)	API	—	network-bound	Power RAG default via `GoogleGenAiChatOptions` in `GuardrailService`.
llama-guard3:8b (local option)	4.9 GB	6 GB	~300 ms GPU / ~2 s CPU	Meta safety classifier; viable if you refactor guardrails back to Ollama.
llama-guard3:1b	0.8 GB	2 GB	~80 ms GPU / ~500 ms CPU	Smaller, faster, slightly lower accuracy. Good for high-traffic deployments.
shieldgemma:2b	1.6 GB	3 GB	~120 ms GPU / ~800 ms CPU	Google's safety classifier. Comparable accuracy to llama-guard3:1b.

Input guardrails run on every chat query. With the default Gemini path, latency is mostly one round-trip to the Google API. If you use a local Ollama classifier instead, CPU-only hardware can add seconds per request unless you size the model down (e.g. llama-guard3:1b).

GuardrailService.java — model id at call time (excerpt) View source ↗

// inputModelId defaults to gemini-2.5-flash; override with POWERRAG_GUARDRAIL_MODEL
.options(GoogleGenAiChatOptions.builder()
        .model(inputModelId)
        .temperature(0.0)
        .build())

Role 3 — General RAG Chat Models

The general chat model handles document Q&A — the core use case of the application. Power RAG sends it up to 24,000 characters of context (retrieved document chunks) plus the user's question. The model must follow citation instructions ([SOURCE N] inline) and maintain coherence across the full context window.

Key requirements for this role:

Context window ≥ 32K tokens (24 000 chars ≈ 6 000–8 000 tokens)
Strong instruction following (must cite sources, respond in requested language)
Multilingual output (en / zh-CN / zh-TW supported by Power RAG)
Reasoning quality sufficient for document synthesis

Model	Size	VRAM	Context	Speed (GPU tok/s)	Quality
qwen2.5:7b	4.7 GB	6 GB	128K	~60 tok/s	★★★★☆ — Excellent for size. Strong multilingual (English + Chinese).
qwen2.5:14b ⭐ recommended	9.0 GB	12 GB	128K	~35 tok/s	★★★★★ — Near-cloud quality. Best open-source option for RAG chat.
qwen2.5:32b	20 GB	24 GB	128K	~18 tok/s	★★★★★ — Cloud-grade quality. Requires high-end GPU.
llama3.1:8b	4.9 GB	6 GB	128K	~55 tok/s	★★★☆☆ — Good English quality. Weaker multilingual than Qwen.
llama3.3:70b	42 GB	48 GB	128K	~8 tok/s	★★★★★ — Exceptional. Requires multi-GPU or Apple M3 Ultra.
phi4:14b	8.9 GB	12 GB	16K	~38 tok/s	★★★★☆ — Microsoft research model. Outstanding reasoning per parameter.
mistral:7b	4.1 GB	5 GB	32K	~65 tok/s	★★★☆☆ — Fast. Good instruction following. Limited multilingual.
gemma3:12b	8.1 GB	10 GB	128K	~40 tok/s	★★★★☆ — Google's open-source model. Strong reasoning and multilingual.

Best choice for most setups: qwen2.5:14b on a 16 GB VRAM GPU (e.g. RTX 4080, RTX 3090). It has native strong Chinese support (important for Power RAG's zh-CN/zh-TW modes), a 128K context window that comfortably holds the 24 000-char RAG context, and near-cloud quality output. Token speed of ~35 tok/s produces a 300-token response in under 10 seconds — an acceptable user experience.

Role 4 — Text-to-SQL Models

Text-to-SQL is the most demanding task for a local model. TextToSqlService sends a full schema description (table names, column types, enum values) plus a natural language question, and expects a correct, syntactically valid PostgreSQL SELECT statement in return. Errors are immediately visible to the user as execution failures.

General 7B models are often insufficient for Text-to-SQL. They generate plausible-looking SQL that fails on JOINs with aliases, PostgreSQL-specific syntax (ILIKE, ::cast, CTEs), or schema-specific enum values. The performance cliff between a 7B and a 14B+ coder model is significant for this task specifically.

Model	Size	VRAM	SQL Accuracy	Notes
qwen2.5-coder:32b ⭐ recommended	19 GB	24 GB	★★★★★	Best open-source SQL generation. Handles complex JOINs, CTEs, window functions. Already registered in Power RAG as `OLLAMA:qwen2.5-coder:32b`.
deepseek-coder-v2:16b ⭐ used	9.1 GB	12 GB	★★★★☆	Excellent SQL quality at 16B. The balanced choice between VRAM and accuracy. Already registered in Power RAG.
qwen2.5-coder:14b	9.0 GB	12 GB	★★★★☆	Strong SQL generation. Comparable to deepseek-coder-v2:16b. Good if you want a single Qwen family across all roles.
qwen2.5-coder:7b	4.7 GB	6 GB	★★★☆☆	Acceptable for simple single-table SELECTs. Struggles with multi-table JOINs and schema-specific enum constraints.
codellama:34b	19 GB	24 GB	★★★★☆	Strong but older (Meta 2023). Qwen2.5-coder and DeepSeek outperform it on modern benchmarks.
sqlcoder:7b	3.8 GB	5 GB	★★★☆☆	Purpose-built for SQL by Defog AI. Good for MySQL/Postgres basic queries; limited on complex schemas or PostgreSQL-specific syntax.

Power RAG's TextToSqlService uses Gemini 2.5 Pro by default — a cloud model with state-of-the-art SQL generation. If you need a fully local Text-to-SQL pipeline, replace the @Qualifier("geminiPro") in TextToSqlService with a dedicated local coder client.

TextToSqlService.java — swapping to a local model View source ↗

// Current: uses Gemini 2.5 Pro (cloud)
public TextToSqlService(SchemaIntrospector schemaIntrospector,
                        SqlValidator sqlValidator,
                        JdbcTemplate jdbcTemplate,
                        @Qualifier("geminiPro") ChatClient chatClient) {

// To use local deepseek-coder-v2:16b instead:
//   1. Register a dedicated bean in SpringAiConfig.java:
//      @Bean @Qualifier("ollamaDeepSeekSql")
//      public ChatClient ollamaDeepSeekSqlClient(OllamaChatModel model) {
//          return ChatClient.builder(model).build();
//      }
//   2. Change qualifier here:
//        @Qualifier("ollamaDeepSeekSql") ChatClient chatClient
//   3. Override model at call time:
//        .options(OllamaChatOptions.builder()
//            .model("deepseek-coder-v2:16b").build())
}

Role 5 — Image Description (Vision Models)

ImageParser uses a multimodal LLM to generate text descriptions of image files during document ingestion. The description becomes a text chunk that is embedded and indexed — making image content searchable via RAG. This role runs only during ingestion, not on every query.

Model	Size	VRAM	Vision Quality	Notes
qwen2.5vl:7b ⭐ recommended	5.5 GB	8 GB	★★★★★	State-of-the-art open-source vision-language model (2025). Excellent at diagrams, charts, screenshots, and technical illustrations.
llava:13b	8.0 GB	10 GB	★★★☆☆	Classic multimodal model. Good general image description but weak on technical charts and data tables.
minicpm-v:8b	5.5 GB	7 GB	★★★★☆	Compact and fast. Surprisingly good on UI screenshots and document images.
llava-phi3:3.8b	2.9 GB	4 GB	★★★☆☆	Very small and fast. Good for simple photos. Weak on complex technical diagrams or dense text.
gemma3:12b	8.1 GB	10 GB	★★★★☆	Google's multimodal model. Strong at reading text within images (OCR-like) and understanding charts.

For RAG applications that ingest technical documentation with diagrams, architecture charts, or data tables, qwen2.5vl:7b produces significantly richer descriptions than older llava models — resulting in better retrieval when users ask about visual content.

Hardware Requirements

The key insight: VRAM determines which models can run at acceptable speed. CPU inference is viable for small local classifiers but is 3–10× slower than GPU for 7B+ chat models, making CPU-only setups painful for interactive Ollama chat.

GPU VRAM Tiers

VRAM	Example GPUs	What fits	User Experience
4–6 GB	RTX 3060, RTX 4060	qwen2.5:7b (q4), small local guard classifiers if used	⚠️ Marginal — chat ~40s/response on CPU. OK for demos with cloud embeddings/guardrails.
8–12 GB	RTX 3080, RTX 4070, RTX 4080	qwen2.5:14b, deepseek-coder-v2:16b (q4), larger local guard models if used	✅ Good — chat responses in 8–15s. This is the minimum for a comfortable user experience.
16–24 GB	RTX 4090, RTX 3090, A5000	All above + qwen2.5-coder:32b, qwen2.5:32b, all vision models	✅✅ Excellent — responses in 5–10s. Can run all 5 roles locally simultaneously.
48–80 GB	A100, H100, 2× RTX 4090	llama3.3:70b, qwen2.5:72b	✅✅✅ Cloud-grade — full 70B models locally. Near-zero latency for smaller models.

Apple Silicon (Unified Memory)

Apple M-series chips use unified memory shared between CPU and GPU. This is excellent for LLMs because the full memory pool is available for model weights:

Chip / RAM	Recommended Models	Notes
M1 / M2 (8 GB)	qwen2.5:7b, small local guard classifiers if used	⚠️ Tight. Chat model will be slow (~15 tok/s). Embed + guardrail are fine.
M2 / M3 (16 GB)	qwen2.5:14b, deepseek-coder-v2:16b (q4), larger local guard models if used	✅ Good. This is a popular developer configuration. ~25–30 tok/s for 14B models.
M2 / M3 Pro (32 GB)	qwen2.5:32b, qwen2.5-coder:32b, all vision models	✅✅ Excellent. Comfortable for all roles. ~18–22 tok/s for 32B models.
M2 / M3 Max / Ultra (64–192 GB)	llama3.3:70b, qwen2.5:72b	✅✅✅ Exceptional. Full 70B models run at 12–18 tok/s — production-ready.

RAM (System Memory) — CPU-Only

If no dedicated GPU is available, Ollama falls back to CPU inference. Minimum requirements:

Model Size	Min RAM	CPU Speed	Practical Use
small local embedder (~300 MB)	4 GB	~15 ms/embed	✅ CPU OK for tiny embedding models only
1–3B models	8 GB	~8 tok/s	✅ Guardrails only — acceptable latency
7–8B models	16 GB	~3–5 tok/s	⚠️ Very slow for chat (~60–100s/response)
13–14B models	32 GB	~1–2 tok/s	❌ Not viable for interactive use

Recommended Configurations by Use Case

Configuration A — Minimum Viable (Developer Laptop, No Dedicated GPU)

Ollama pull commands — minimum viable local stack

# Default stack: set GOOGLE_API_KEY (embeddings + input guard + Gemini chat as needed)
# Ollama optional — only if you want local chat models, e.g.:
# ollama pull qwen2.5-coder:7b

# Hybrid: cloud embeddings/guardrails + local Ollama chat — no nomic/llama-guard pulls required

Configuration B — Balanced Local (16 GB VRAM / M3 Pro 32 GB)

Ollama pull commands — balanced all-local stack

# If fully local beyond chat, add your own embedding + guard Ollama models and reconfigure.
ollama pull qwen2.5:14b             # General RAG chat (primary)
ollama pull deepseek-coder-v2:16b   # Text-to-SQL
ollama pull qwen2.5vl:7b            # Image description (ingestion)

Full local isolation requires replacing Google embedding and guardrail beans, not just emptying API keys. For chat-only local default while keeping Google embeddings/guardrails, point Ollama at your preferred model:

application.yml — emphasise local chat (embeddings still Google unless you change them)

spring:
  ai:
    ollama:
      base-url: http://localhost:11434
      chat:
        options:
          model: qwen2.5:14b
    google:
      genai:
        api-key: ${GOOGLE_API_KEY}  # still required for gemini-embedding-001 + guardrails

Configuration C — Maximum Quality (24 GB VRAM / M3 Max 64 GB)

Ollama pull commands — maximum quality local stack

ollama pull qwen2.5:32b              # General RAG chat
ollama pull qwen2.5-coder:32b        # Text-to-SQL (excellent accuracy)
ollama pull qwen2.5vl:7b             # Image description

Performance Tuning for User Experience

A RAG response chains several steps: input guard (Gemini by default) → cache lookup (embedding API) → retrieval (embedding again) → main LLM. When using Ollama for chat, keep models warm — latency compounds.

Tip 1 — Keep models loaded in memory
Ollama unloads models after 5 minutes of inactivity by default. The first request after unload takes 5–30s to reload. For production, set:

docker-compose.yml — keep models warm

services:
  ollama:
    environment:
      OLLAMA_KEEP_ALIVE: "24h"   # keep all recently-used models in VRAM

Tip 2 — Use quantised models (Q4_K_M)
Ollama automatically uses Q4_K_M quantisation for most models. This halves VRAM usage with <5% quality loss. A 14B model at Q4 fits in 9 GB VRAM instead of 28 GB. Always prefer quantised over full-precision for interactive use.

Tip 3 — Parallelise embedding and guardrail
The guardrail check (Stage 0) and the semantic cache lookup (Stage 1) are independent. If you refactor RagService to call them via CompletableFuture, you save the guardrail latency on cache-hit paths.

Tip 4 — Tune the semantic cache threshold
The cache threshold of 0.92 is conservative. On repeated-query workloads (e.g. a support chatbot), lowering it to 0.88 can dramatically increase cache hit rates at acceptable semantic accuracy. Lower it in application.yml:

application.yml — cache threshold

powerrag:
  models:
    cache:
      similarity-threshold: 0.88   # was 0.92 — more aggressive caching

Quick Reference — Role to Model Mapping

Role	Current (Power RAG)	Local Minimum	Local Recommended	Local Maximum Quality
Embedding	gemini-embedding-001 (cloud, 768-dim)	nomic-embed-text	nomic-embed-text / mxbai-embed-large	snowflake-arctic-embed2
Guardrail	Gemini 2.5 Flash (`gemini-2.5-flash`)	llama-guard3:1b	llama-guard3:8b	llama-guard3:8b
RAG Chat	claude-sonnet-4-6 (cloud)	qwen2.5:7b	qwen2.5:14b	qwen2.5:32b
Text-to-SQL	gemini-2.5-pro (cloud)	qwen2.5-coder:7b	deepseek-coder-v2:16b	qwen2.5-coder:32b
Image Description	claude-sonnet-4-6 (cloud)	llava-phi3:3.8b	qwen2.5vl:7b	gemma3:12b

← Previous Next →