Document Parsing

Module 3 · ~9 min read
Before a document can be searched, it must be parsed into structured text sections. Power RAG defines a DocumentParser interface and provides a concrete implementation for each supported file format. Spring auto-discovers and injects all implementations as a list.

The DocumentParser Interface

DocumentParser.java View source ↗
public interface DocumentParser {
    String supportedExtension();
    List<ParsedSection> parse(InputStream input, String fileName);
}

Each implementation declares which file extension it handles via supportedExtension() and performs the actual parsing via parse(). The output is a list of ParsedSection records — each section contains the extracted text plus metadata (page number, row number, section title, etc.).

Supported Parsers

Extension Parser Class Library Section Unit
pdf PdfParser Apache PDFBox Per page — each page becomes one ParsedSection
docx WordParser Apache POI Per paragraph — preserves heading structure
xlsx ExcelParser Apache POI Per row — each data row is one section with column metadata
pptx PowerPointParser Apache POI Per slide — slide title and body text combined
java JavaSourceParser JavaParser Per method / class — extracts Javadoc and method signatures
png / jpg / webp ImageParser Claude / Gemini (vision) Full image — sends to LLM for description, stores description as text

Spring Auto-Injection of All Parsers

Each parser is annotated with @Component. Spring collects all beans implementing DocumentParser into a List<DocumentParser> automatically — no explicit registration needed. The service builds a map keyed by extension for O(1) lookup.

DocumentIngestionService.java — parser injection View source ↗
@Service
public class DocumentIngestionService {
    public DocumentIngestionService(List<DocumentParser> parsers, ...) {
        // Spring injects all @Component beans implementing DocumentParser
        this.parsersByExt = parsers.stream()
            .collect(Collectors.toMap(
                DocumentParser::supportedExtension, p -> p));
    }
}

Adding a New Parser

To support a new format (e.g., csv), you only need to:

  1. Create a class implementing DocumentParser
  2. Annotate it with @Component
  3. Implement supportedExtension() returning "csv"
  4. Implement parse()

Spring discovers and registers it automatically. No configuration change needed.

The image parser is special: it calls an LLM vision API to generate a text description, which is then stored as if it were parsed text. This allows semantic search to find images based on their content — not just their filename.