Tool Observability & Audit

Module 10 · ~18 min read

When an LLM calls external tools during a conversation, you need to know exactly which tools fired, how long each took, and whether they succeeded. Without this, a slow or failing tool call is an invisible black box — you cannot tell from the final answer alone why a response took five seconds or contained inaccurate information.

Power RAG instruments every MCP tool invocation transparently, without modifying the MCP servers or the Spring AI tool pipeline. The data flows from the wrapper class through a thread-local buffer, into the API response, and finally into the PostgreSQL audit log — giving you full traceability at every layer.

The Data Model: McpToolInvocationSummary

One record is created per tool call, per chat turn:

backend/src/main/java/com/powerrag/mcp/McpToolInvocationSummary.java View source ↗
@JsonInclude(JsonInclude.Include.NON_NULL)
public record McpToolInvocationSummary(
        String serverId,       // inferred from tool name prefix, e.g. "powerrag-tools"
        String toolName,       // exact tool name, e.g. "jira_search_issues"
        boolean success,       // true if the call returned without exception
        long durationMs,       // client-side wall clock time for the call
        String errorMessage,   // null on success; truncated to 200 chars on failure
        String argsSummary     // tool input summary, max 200 chars; null if blank
) {}

@JsonInclude(NON_NULL) keeps the JSON compact — errorMessage and argsSummary are omitted when null, which is the common case for successful calls with simple inputs.

Layer 1 — ObservingToolCallback

This is the decorator that wraps every ToolCallback from the MCP provider. It intercepts each call, times it, and posts a summary to the recorder. The model never knows it is being observed.

backend/src/main/java/com/powerrag/mcp/ObservingToolCallback.java — invoke() View source ↗
private String invoke(Supplier<String> supplier, String toolInput) {
    String toolName = delegate.getToolDefinition().name();
    String serverId = inferServerId(toolName);   // "powerrag-tools__jira_search" → "powerrag-tools"
    long t0 = System.currentTimeMillis();
    try {
        String out = normalizeMcpToolOutput(supplier.get());
        long ms = System.currentTimeMillis() - t0;
        recorder.record(new McpToolInvocationSummary(
                serverId, toolName, true, ms, null, summarizeArgs(toolInput)));
        return out;
    } catch (RuntimeException e) {
        long ms = System.currentTimeMillis() - t0;
        String msg = e.getMessage();
        if (msg != null && msg.length() > 200) msg = msg.substring(0, 200) + "…";
        recorder.record(new McpToolInvocationSummary(
                serverId, toolName, false, ms, msg, summarizeArgs(toolInput)));
        throw e;   // re-throw so Spring AI can handle tool failures
    }
}

Notice that the exception is re-thrown after recording. The observer records the failure but does not swallow it — Spring AI still gets to decide whether to retry or surface the error to the model.

Server ID is inferred from the tool name prefix, using a double-underscore separator that Spring AI introduces when namespacing tools from multiple MCP servers:

ObservingToolCallback.java — inferServerId()
static String inferServerId(String toolName) {
    if (toolName == null) return "mcp";
    int sep = toolName.indexOf("__");
    if (sep > 0) return toolName.substring(0, sep);  // "powerrag-tools__fetch_url" → "powerrag-tools"
    return "mcp";                                     // no namespace prefix: generic fallback
}

Layer 2 — McpInvocationRecorder

The recorder collects invocations in a ThreadLocal list. Because Spring MVC handles each HTTP request on a single thread, every tool call made during one chat turn naturally lands in the same list, without any synchronisation required.

backend/src/main/java/com/powerrag/mcp/McpInvocationRecorder.java View source ↗
@Component
public class McpInvocationRecorder {

    private final ThreadLocal<List<McpToolInvocationSummary>> current =
            ThreadLocal.withInitial(ArrayList::new);

    public void clear() { current.get().clear(); }

    public void record(McpToolInvocationSummary summary) { current.get().add(summary); }

    /** Returns an immutable snapshot and clears the buffer for this thread. */
    public List<McpToolInvocationSummary> snapshotAndClear() {
        List<McpToolInvocationSummary> list = new ArrayList<>(current.get());
        current.get().clear();
        return list.isEmpty() ? List.of() : Collections.unmodifiableList(list);
    }
}
Key concept: snapshotAndClear() both returns the data and resets the buffer in one atomic step. RagService always calls clear() before the LLM call and snapshotAndClear() after, so there is no state leaking between requests even in error paths — the recorder is always clean for the next request on the same thread.

Layer 3 — Database Audit Log

The Flyway V8 migration adds a nullable JSONB column to the interactions table:

backend/src/main/resources/db/migration/V8__interaction_mcp_invocations.sql View source ↗
-- MCP tool invocation summaries for a chat turn (nullable when no tools used)
ALTER TABLE interactions ADD COLUMN mcp_invocations jsonb NULL;

The column is NULL for interactions where no tools fired, keeping the table compact for the common case. The JPA entity maps the column using Hibernate's JSONB type:

backend/src/main/java/com/powerrag/domain/Interaction.java — mcpInvocations field View source ↗
@JdbcTypeCode(SqlTypes.JSON)
@Column(columnDefinition = "jsonb")
private List<Map<String, Object>> mcpInvocations;  // null when no tools were used

Stored JSON for a two-tool call looks like this:

Example — interactions.mcp_invocations column value
[
  {
    "serverId":   "powerrag-tools",
    "toolName":   "jira_search_issues",
    "success":    true,
    "durationMs": 843,
    "argsSummary": "{\"jql\": \"project = KAN ORDER BY created DESC\", \"max_results\": 5}"
  },
  {
    "serverId":    "powerrag-tools",
    "toolName":    "jira_get_issue",
    "success":     true,
    "durationMs":  312
  }
]

Layer 4 — API Response

RagResponse carries the invocation list alongside the answer and sources, so the frontend can display tool activity without a separate API call:

backend/src/main/java/com/powerrag/rag/model/RagResponse.java — mcpInvocations field View source ↗
public record RagResponse(
        String answer,
        double confidence,
        List<SourceRef> sources,
        String modelId,
        long durationMs,
        UUID interactionId,
        boolean cacheHit,
        String error,
        String generatedImageBase64,
        List<McpToolInvocationSummary> mcpInvocations  // empty list when no tools fired
) {
    public boolean mcpToolsUsed() {
        return !mcpInvocations.isEmpty();
    }
}

The frontend TypeScript interface mirrors this:

frontend/src/api/chatApi.ts — McpToolInvocationSummary type View source ↗
export interface McpToolInvocationSummary {
  serverId:      string
  toolName:      string
  success:       boolean
  durationMs:    number
  errorMessage?: string
  argsSummary?:  string
}

export interface ChatQueryResponse {
  answer:       string
  confidence:   number
  sources:      SourceRef[]
  modelId:      string
  durationMs:   number
  interactionId: string
  cacheHit:     boolean
  error?:       string
  generatedImageBase64?: string
  mcpInvocations?: McpToolInvocationSummary[]  // undefined when not present in response
}

Layer 5 — Frontend Display

The chat window renders an expandable badge for each message that used tools. The badge uses amber styling to distinguish it from the cyan cache-hit chip and the green confidence indicator:

frontend/src/components/ChatWindow.tsx — MCP badge (condensed) View source ↗
{(msg.response.mcpInvocations?.length ?? 0) > 0 && (
  <details data-testid="mcp-tools-badge">
    <summary className="text-xs text-amber-400 border border-amber-700/60 px-2 py-0.5 rounded-full">
      MCP · {msg.response.mcpInvocations!.length} {plural}
    </summary>
    <ul className="mt-2 text-xs text-slate-500 space-y-1">
      {msg.response.mcpInvocations!.map(inv => (
        <li key={inv.toolName}>
          <span className="text-slate-400">{inv.toolName}</span>
          <span className="text-slate-600"> · {inv.durationMs}ms</span>
          {!inv.success && inv.errorMessage && (
            <span className="text-amber-600/90"> — {inv.errorMessage}</span>
          )}
        </li>
      ))}
    </ul>
  </details>
)}

The HTML <details>/<summary> element provides the expand/collapse behaviour with no JavaScript required — it is a native browser control.

The McpToolsPanel Sidebar

A separate panel component shows which tools are available in the current session. It calls GET /api/chat/mcp-tools on load and displays the result:

frontend/src/components/McpToolsPanel.tsx — capabilities panel View source ↗
// Shows status: "Not configured" / "Not attached" / list of tool names
// Jira hint: renders a link to the Jira board if jira_* tools are present
function McpToolsPanel({ data }: { data: McpToolsCapabilitiesResponse | undefined }) {
  const hasJira = mcpHasJiraTools(data)
  return (
    <div data-testid="mcp-tools-panel"
         className="border border-amber-700/40 rounded bg-amber-950/20 p-3">
      {/* Tool list or status message */}
      {hasJira && (
        <a href={JIRA_BOARD_URL} className="text-xs text-amber-500">
          Open Jira board ↗
        </a>
      )}
    </div>
  )
}

Confidence Score Adjustment

The confidence scorer takes MCP tool invocations into account. Successful tool calls provide live, authoritative data that supplements or replaces KB retrieval — this can increase the model's effective confidence. Failed tool calls (where the model had to answer without the data it requested) can lower it:

backend/src/main/java/com/powerrag/rag/scoring/ConfidenceScorer.java — MCP factor View source ↗
// In responseConfidence():
double confidence = scorer.responseConfidence(
        retrievalConfidence, hasRelevantDocs, mcpInvocations);

// mcpInvocations is passed in from the recorded invocation list.
// If tools ran and succeeded, the base confidence is boosted.
// If tools ran but all failed, the confidence is reduced.

Test Coverage

The observability layer has dedicated unit tests that verify output normalisation, argument summarisation, and failure recording without needing a live MCP server:

backend/src/test/java/com/powerrag/mcp/ObservingToolCallbackTest.java View source ↗
@Test
void normalizesTextContentWrapper() {
    String wrapped = "TextContent[annotations=null, text={\"key\":\"value\"}, meta=null]";
    String result = ObservingToolCallback.normalizeMcpToolOutput(wrapped);
    assertEquals("{\"key\":\"value\"}", result);
}

@Test
void passesPlainJsonThrough() {
    String json = "{\"ok\":true,\"text\":\"hello\"}";
    assertEquals(json, ObservingToolCallback.normalizeMcpToolOutput(json));
}

@Test
void recordsFailureAndRethrows() {
    ToolCallback failing = mock(ToolCallback.class);
    when(failing.call(any())).thenThrow(new RuntimeException("timeout"));
    ObservingToolCallback obs = new ObservingToolCallback(failing, recorder);

    assertThrows(RuntimeException.class, () -> obs.call("{}"));

    List<McpToolInvocationSummary> recorded = recorder.snapshotAndClear();
    assertEquals(1, recorded.size());
    assertFalse(recorded.get(0).success());
    assertEquals("timeout", recorded.get(0).errorMessage());
}

Frontend test IDs provide stable anchors for E2E assertions:

Test IDComponentWhat it selects
mcp-tools-badge ChatWindow The expandable MCP invocation summary on each message
mcp-tools-panel McpToolsPanel The sidebar panel listing available tools
mcp-tools-toggle McpToolsPanel The show/hide toggle button inside the panel
Tip: Query the audit log with PostgreSQL's JSONB operators to analyse tool usage patterns across interactions: SELECT mcp_invocations FROM interactions WHERE mcp_invocations IS NOT NULL AND mcp_invocations @> '[{"success":false}]'; This finds all interactions where at least one tool call failed.