removing one llm round-trip from a voice caddy

companion piece to the golf case study. this is the long version of the "voice-to-action latency was the whole product" hard problem.

the setup

the voice caddy in the golf app does two things: caddy advice ("what club for 165 yards into a two-club wind?") and bookkeeping ("blake got 5, jake got 4, i got 6"). both run through a single mastra agent with twenty-two tools. the agent uses claude haiku for routing and structured outputs, claude sonnet for the longer-form caddy reasoning.

the failure mode was simple. every voice message triggered a read_round_context tool call before the agent could decide what to do. that's:

user finishes speaking
transcription returns text
agent decides "i need round context"
tool runs (db round-trip)
agent receives context, decides what to actually do
agent runs another tool, or just answers

steps 3–5 added one full llm round-trip plus one db round-trip in series before any output token was generated. on a putting green over LTE, the user had finished typing the score manually by the time the agent got to step 6.

the fix

context pre-loading. the api builds a RoundContext blob in parallel — current hole, players, scores, configured games, weather, golf bag — and injects it as a system message before the agent runs. the agent still has a read_round_context tool for repair queries (someone says "wait, what did blake have on five?" for a hole that was scored hours ago and isn't in the active context window) but it doesn't need it on the happy path.

// packages/golf/domains/src/packages/ai/...
const context = await buildRoundContext({ roundId, db, weatherService });
const messages = [
  { role: "system", content: SYSTEM_PROMPT },
  { role: "system", content: serializeRoundContext(context) },
  ...userMessages,
];

what i'd do differently

i'd build the context-builder before the agent. the order i went in was:

ship the agent with all twenty-two tools
write tracing
notice every conversation started with read_round_context
build the context-builder

if i'd built the context-builder first, the tool would have existed only as a repair affordance from day one, and i would have shipped without the latency hit. but i didn't know context pre-loading was the right move until i had production-ish traces in front of me.

what's still open

per-tool latency tracing in langfuse. the round-trip count went from "two round-trips minimum" to "one round-trip minimum" but i don't have a clean p95 chart broken down by tool yet. that's the next post.