Document Parsing

Module 3 · ~9 min read

Before a document can be searched, it must be parsed into structured text sections. Power RAG defines a DocumentParser interface and provides a concrete implementation for each supported file format. Spring auto-discovers and injects all implementations as a list.

The DocumentParser Interface

DocumentParser.java View source ↗

public interface DocumentParser {
    String supportedExtension();
    List<ParsedSection> parse(InputStream input, String fileName);
}

Each implementation declares which file extension it handles via supportedExtension() and performs the actual parsing via parse(). The output is a list of ParsedSection records — each section contains the extracted text plus metadata (page number, row number, section title, etc.).

Supported Parsers

Extension	Parser Class	Library	Section Unit
`pdf`	`PdfParser`	Apache PDFBox	Per page — each page becomes one ParsedSection
`docx`	`WordParser`	Apache POI	Per paragraph — preserves heading structure
`xlsx`	`ExcelParser`	Apache POI	Per row — each data row is one section with column metadata
`pptx`	`PowerPointParser`	Apache POI	Per slide — slide title and body text combined
`java`	`JavaSourceParser`	JavaParser	Per method / class — extracts Javadoc and method signatures
`png` / `jpg` / `webp`	`ImageParser`	Claude / Gemini (vision)	Full image — sends to LLM for description, stores description as text

Spring Auto-Injection of All Parsers

Each parser is annotated with @Component. Spring collects all beans implementing DocumentParser into a List<DocumentParser> automatically — no explicit registration needed. The service builds a map keyed by extension for O(1) lookup.

DocumentIngestionService.java — parser injection View source ↗

@Service
public class DocumentIngestionService {
    public DocumentIngestionService(List<DocumentParser> parsers, ...) {
        // Spring injects all @Component beans implementing DocumentParser
        this.parsersByExt = parsers.stream()
            .collect(Collectors.toMap(
                DocumentParser::supportedExtension, p -> p));
    }
}

Adding a New Parser

To support a new format (e.g., csv), you only need to:

Create a class implementing DocumentParser
Annotate it with @Component
Implement supportedExtension() returning "csv"
Implement parse()

Spring discovers and registers it automatically. No configuration change needed.

The image parser is special: it calls an LLM vision API to generate a text description, which is then stored as if it were parsed text. This allows semantic search to find images based on their content — not just their filename.

← Previous Next →