Keyword Search with PostgreSQL FTS
Why Keyword Search Alongside Vector Search?
Consider a query like "Find invoice INV-2024-001". The embedding of "INV-2024-001" will be generic — any document about invoices will score similarly. The exact string "INV-2024-001" must be matched literally. Keyword search handles this perfectly.
Similarly for:
- People's names, company names, product SKUs
- Legal clause references ("Section 4.2.1")
- Technical identifiers ("CVE-2024-1234", "PR #42")
PostgreSQL tsvector and GIN Index
The document_chunks table stores raw chunk text in a chunk_text column. The migration adds a generated tsvector column that PostgreSQL automatically maintains, and a GIN index for fast FTS queries.
ALTER TABLE document_chunks
ADD COLUMN chunk_text_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', chunk_text)) STORED;
CREATE INDEX idx_document_chunks_fts
ON document_chunks USING GIN(chunk_text_tsv);
GENERATED ALWAYS AS (...) STORED means PostgreSQL recomputes the tsvector automatically whenever chunk_text changes — no application code needed to maintain it.
Spring Data FTS Query
@Query(value = """
SELECT * FROM document_chunks
WHERE chunk_text_tsv @@ plainto_tsquery('english', :query)
ORDER BY ts_rank(chunk_text_tsv, plainto_tsquery('english', :query)) DESC
LIMIT :limit
""", nativeQuery = true)
List<DocumentChunk> fullTextSearch(@Param("query") String query,
@Param("limit") int limit);
Query Components
plainto_tsquery('english', :query)— parses the query string into a tsquery, handling stemming and stop words automatically (e.g., "running" matches "runs")@@— the FTS match operator: returns true if the tsvector contains the tsqueryts_rank(...)— ranks results by term frequency within the chunk
Connecting FTS Results to Qdrant
The FTS query returns DocumentChunk entities from PostgreSQL. Each entity carries a qdrantId field — the UUID that was used when the chunk was inserted into Qdrant. The RRF merge uses this ID to deduplicate when both retrieval sources return the same chunk.
plainto_tsquery is simpler and safer than to_tsquery — it handles arbitrary user input without requiring the user to know FTS query syntax. Use to_tsquery only if you need Boolean operators like AND/OR/NOT in the query.