Multimodal Image Input

Module 6 · ~9 min read
Spring AI's Media API lets you attach images to a UserMessage. The framework handles the provider-specific encoding automatically — the same code works for both Claude (Anthropic vision blocks) and Gemini (Part.inlineData).

Spring AI Media API

RagService.java — building a multimodal message View source ↗
Media media = new Media(
    MimeType.valueOf("image/jpeg"),
    new ByteArrayResource(imageBytes));

UserMessage userMessage = UserMessage.builder()
    .text(promptText)
    .media(media)
    .build();

String answer = chatClient.prompt()
    .messages(userMessage)
    .call()
    .content();

Provider-Specific Conversion

Spring AI converts the Media object into the correct format for each provider automatically:

ProviderWire format
Anthropic (Claude) Vision content block: {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "..."}}
Google GenAI (Gemini) mediaToParts()Part.inlineData with base64 and MIME type

Your application code does not change between providers — only the bean injected changes.

Base64 Image Decoding from the Frontend

The React frontend sends images as Base64-encoded data URIs (e.g., data:image/jpeg;base64,/9j/4AAQ...). The backend decodes this before constructing the Media object:

RagService.java — buildUserMessageWithImage() View source ↗
private static UserMessage buildUserMessageWithImage(String text, String imageBase64) {
    String mimeTypeStr = "image/jpeg";
    if (imageBase64.startsWith("data:")) {
        int semi = imageBase64.indexOf(';');
        if (semi > 5) mimeTypeStr = imageBase64.substring(5, semi);
    }
    String encoded = imageBase64.contains(",")
        ? imageBase64.split(",", 2)[1] : imageBase64;
    byte[] imageBytes = Base64.getDecoder().decode(encoded);
    Media media = new Media(MimeType.valueOf(mimeTypeStr),
        new ByteArrayResource(imageBytes));
    return UserMessage.builder().text(text).media(media).build();
}

Key Points

The imagePresent Flag

When an image is attached, MultilingualPromptBuilder prepends an image instruction to the user message:

Additional instruction when imagePresent = true
"An image has been attached. Carefully examine it and use it to answer
the question. If the image contains text, tables, or diagrams, analyse
them thoroughly.\n\n"

This explicit instruction improves vision model performance — without it, models sometimes focus on text context and underutilise the attached image.

Gemini models generally perform better than Claude on dense diagram interpretation (charts, technical schematics). Claude excels at analysing documents with mixed text and images. Consider routing image-heavy queries to the appropriate provider.