removing one llm round-trip from a voice caddy
companion piece to the golf case study. this is the long version of the "voice-to-action latency was the whole product" hard problem.
the setup
the voice caddy in the golf app does two things: caddy advice ("what club for 165 yards into a two-club wind?") and bookkeeping ("blake got 5, jake got 4, i got 6"). both run through a single mastra agent with twenty-two tools. the agent uses claude haiku for routing and structured outputs, claude sonnet for the longer-form caddy reasoning.
the failure mode was simple. every voice message triggered a read_round_context tool call
before the agent could decide what to do. that's:
- user finishes speaking
- transcription returns text
- agent decides "i need round context"
- tool runs (db round-trip)
- agent receives context, decides what to actually do
- agent runs another tool, or just answers
steps 3–5 added one full llm round-trip plus one db round-trip in series before any output token was generated. on a putting green over LTE, the user had finished typing the score manually by the time the agent got to step 6.
the fix
context pre-loading. the api builds a RoundContext blob in parallel — current hole, players,
scores, configured games, weather, golf bag — and injects it as a system message before the
agent runs. the agent still has a read_round_context tool for repair queries (someone says
"wait, what did blake have on five?" for a hole that was scored hours ago and isn't in the
active context window) but it doesn't need it on the happy path.
// packages/golf/domains/src/packages/ai/...
const context = await buildRoundContext({ roundId, db, weatherService });
const messages = [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "system", content: serializeRoundContext(context) },
...userMessages,
];
what i'd do differently
i'd build the context-builder before the agent. the order i went in was:
- ship the agent with all twenty-two tools
- write tracing
- notice every conversation started with
read_round_context - build the context-builder
if i'd built the context-builder first, the tool would have existed only as a repair affordance from day one, and i would have shipped without the latency hit. but i didn't know context pre-loading was the right move until i had production-ish traces in front of me.
what's still open
per-tool latency tracing in langfuse. the round-trip count went from "two round-trips minimum" to "one round-trip minimum" but i don't have a clean p95 chart broken down by tool yet. that's the next post.