Document Parsing
Before a document can be searched, it must be parsed into structured text sections. Power RAG defines a DocumentParser interface and provides a concrete implementation for each supported file format. Spring auto-discovers and injects all implementations as a list.
The DocumentParser Interface
DocumentParser.java
View source ↗
public interface DocumentParser {
String supportedExtension();
List<ParsedSection> parse(InputStream input, String fileName);
}
Each implementation declares which file extension it handles via supportedExtension() and performs the actual parsing via parse(). The output is a list of ParsedSection records — each section contains the extracted text plus metadata (page number, row number, section title, etc.).
Supported Parsers
| Extension | Parser Class | Library | Section Unit |
|---|---|---|---|
pdf |
PdfParser |
Apache PDFBox | Per page — each page becomes one ParsedSection |
docx |
WordParser |
Apache POI | Per paragraph — preserves heading structure |
xlsx |
ExcelParser |
Apache POI | Per row — each data row is one section with column metadata |
pptx |
PowerPointParser |
Apache POI | Per slide — slide title and body text combined |
java |
JavaSourceParser |
JavaParser | Per method / class — extracts Javadoc and method signatures |
png / jpg / webp |
ImageParser |
Claude / Gemini (vision) | Full image — sends to LLM for description, stores description as text |
Spring Auto-Injection of All Parsers
Each parser is annotated with @Component. Spring collects all beans implementing DocumentParser into a List<DocumentParser> automatically — no explicit registration needed. The service builds a map keyed by extension for O(1) lookup.
DocumentIngestionService.java — parser injection
View source ↗
@Service
public class DocumentIngestionService {
public DocumentIngestionService(List<DocumentParser> parsers, ...) {
// Spring injects all @Component beans implementing DocumentParser
this.parsersByExt = parsers.stream()
.collect(Collectors.toMap(
DocumentParser::supportedExtension, p -> p));
}
}
Adding a New Parser
To support a new format (e.g., csv), you only need to:
- Create a class implementing
DocumentParser - Annotate it with
@Component - Implement
supportedExtension()returning"csv" - Implement
parse()
Spring discovers and registers it automatically. No configuration change needed.
The image parser is special: it calls an LLM vision API to generate a text description, which is then stored as if it were parsed text. This allows semantic search to find images based on their content — not just their filename.