Multimodal Image Input
Spring AI's Media API lets you attach images to a
UserMessage. The framework handles the provider-specific encoding automatically — the same code works for both Claude (Anthropic vision blocks) and Gemini (Part.inlineData).
Spring AI Media API
RagService.java — building a multimodal message
View source ↗
Media media = new Media(
MimeType.valueOf("image/jpeg"),
new ByteArrayResource(imageBytes));
UserMessage userMessage = UserMessage.builder()
.text(promptText)
.media(media)
.build();
String answer = chatClient.prompt()
.messages(userMessage)
.call()
.content();
Provider-Specific Conversion
Spring AI converts the Media object into the correct format for each provider automatically:
| Provider | Wire format |
|---|---|
| Anthropic (Claude) | Vision content block: {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "..."}} |
| Google GenAI (Gemini) | mediaToParts() → Part.inlineData with base64 and MIME type |
Your application code does not change between providers — only the bean injected changes.
Base64 Image Decoding from the Frontend
The React frontend sends images as Base64-encoded data URIs (e.g., data:image/jpeg;base64,/9j/4AAQ...). The backend decodes this before constructing the Media object:
RagService.java — buildUserMessageWithImage()
View source ↗
private static UserMessage buildUserMessageWithImage(String text, String imageBase64) {
String mimeTypeStr = "image/jpeg";
if (imageBase64.startsWith("data:")) {
int semi = imageBase64.indexOf(';');
if (semi > 5) mimeTypeStr = imageBase64.substring(5, semi);
}
String encoded = imageBase64.contains(",")
? imageBase64.split(",", 2)[1] : imageBase64;
byte[] imageBytes = Base64.getDecoder().decode(encoded);
Media media = new Media(MimeType.valueOf(mimeTypeStr),
new ByteArrayResource(imageBytes));
return UserMessage.builder().text(text).media(media).build();
}
Key Points
- The MIME type is parsed from the data URI prefix (
data:image/png;,data:image/webp;, etc.) - The base64 payload is extracted after the comma separator in the data URI
- Raw base64 (no data URI prefix) is also supported
The imagePresent Flag
When an image is attached, MultilingualPromptBuilder prepends an image instruction to the user message:
Additional instruction when imagePresent = true
"An image has been attached. Carefully examine it and use it to answer
the question. If the image contains text, tables, or diagrams, analyse
them thoroughly.\n\n"
This explicit instruction improves vision model performance — without it, models sometimes focus on text context and underutilise the attached image.
Gemini models generally perform better than Claude on dense diagram interpretation (charts, technical schematics). Claude excels at analysing documents with mixed text and images. Consider routing image-heavy queries to the appropriate provider.