Image Generation

Module 6 · ~10 min read
Power RAG integrates image generation as a first-class feature. When the user asks to "generate a picture of...", the system detects the intent, calls Google's Imagen 3 model, and returns the generated image as a base64 PNG — no RAG retrieval needed for generation requests.

Intent Detection

Image generation requests are detected by looking for the co-occurrence of a generation verb and an image noun in the query:

ImageGenerationService.java — isImageGenerationRequest() View source ↗
private static final Set<String> GEN_VERBS = Set.of(
    "generate", "create", "draw", "paint", "make", "produce", "design");
private static final Set<String> IMAGE_NOUNS = Set.of(
    "image", "picture", "photo", "illustration", "artwork", "diagram");

public boolean isImageGenerationRequest(String question) {
    String lower = question.toLowerCase();
    boolean hasVerb = GEN_VERBS.stream().anyMatch(lower::contains);
    boolean hasNoun = IMAGE_NOUNS.stream().anyMatch(lower::contains);
    return hasVerb && hasNoun;
}

Both conditions must be true. "Create a summary" has a verb but no image noun → not detected. "Show me an image" has a noun but no generation verb → not detected. "Draw a diagram of the architecture" → detected.

Imagen 3 via Google GenAI SDK

Power RAG uses the Google GenAI Java SDK directly (not Spring AI) for Imagen 3, as Spring AI 1.1.2 does not yet wrap Imagen natively.

ImageGenerationService.java — Imagen 3 call View source ↗
com.google.genai.Client client = com.google.genai.Client.builder()
    .apiKey(apiKey).build();

GenerateImagesResponse resp = client.models.generateImages(
    "imagen-3.0-generate-002", prompt,
    GenerateImagesConfig.builder().numberOfImages(1).build());

byte[] imageBytes = resp.generatedImages().get(0)
    .image().get().imageBytes().get().toByteArray();
return "data:image/png;base64," + Base64.getEncoder().encodeToString(imageBytes);

Gemini Flash Fallback

If Imagen 3 is unavailable or the API key lacks Imagen permissions, the service falls back to Gemini's multimodal output capability:

ImageGenerationService.java — Gemini Flash image output fallback View source ↗
GenerateContentConfig config = GenerateContentConfig.builder()
    .responseModalities(List.of("IMAGE", "TEXT")).build();
GenerateContentResponse resp = client.models.generateContent(
    "gemini-2.0-flash-preview-image-generation", prompt, config);
// Extract from resp.parts() → Part.inlineData → blob bytes

Pipeline Integration

Image generation detection happens at Stage 1.5, after the cache miss but before hybrid retrieval. This ordering is intentional:

Imagen 3 produces significantly higher quality images than Gemini Flash's image output. Use Imagen 3 as the primary path and Gemini as a reliability fallback, not the other way around.