How We Debugged a Production Bug Across 4 LLMs in 12 Hours
(and Saved $84 via Consensus Voting)
A Prisma migration row inserted into _prisma_migrations while the actual ALTER TABLE never applied. Production reads broke at 03:44 UTC. We rotated through GPT-4o, Claude Sonnet-4, Gemini 1.5 Pro, and Llama-3.3-70B over twelve hours. This is the literal thread graph, with real token counts, real provider pricing, and the consensus-vote math that routed cheap judge work to Llama and kept the bill at $0.1168.
03:44 UTC — the rabbit hole begins on GPT-4o
The first message of thread thr_001_1e4959fc hit GPT-4o at 03:44:10: “Prisma migration failed mid-way on production. db._prisma_migrations shows the row but the actual ALTER TABLE didn't apply. How do I recover safely?”
GPT-4o produced a confident answer in about two minutes — and it was the wrong answer first. It anchored on a prisma migrate resolve --applied path that presupposed the schema change actually existed in the database. The model never asked the question that mattered: did the ALTER TABLE partially execute, or did it not execute at all? Output cost for the first GPT-4o branch: $0.0606 across 12,900 in / 2,834 out tokens before we caught the trajectory drift.
05:13 UTC — branching to Claude Sonnet-4 for a different angle
We hit branch b_01 at 19:05:13 (cross-day handoff). Reason logged in merge_graph.json: model_switch_due_to_context_limit. We exported the GPT-4o context to Synthpad, then re-rooted on Claude Sonnet-4 with a sharper prompt: “Assume the ALTER TABLE did NOT execute. Walk me through recovery, prioritising preservation of the existing rows.”
Claude opened with the question GPT-4o never asked: “Have you confirmed via information_schema.columns whether the new column physically exists?” That single counter-question unblocked the entire debug. Claude's contribution: 8,391 in / 1,813 out tokens, $0.0524 total.
07:26 UTC — Gemini 1.5 Pro as tie-breaker judge
GPT-4o and Claude disagreed on the safe rollback path. GPT-4o wanted prisma migrate resolve --rolled-back; Claude wanted a manual SQL DDL inverse. We routed the disagreement to a Gemini 1.5 Pro consensus vote. From consensus_votes.json, the relevant vote (cv_006) recorded a consensus score of 0.392 with score variance 0.00220 — low variance means all four judges agreed on the ranking even when the answers diverged.
Gemini's exact tie-breaker justification, logged in our archive: “Off-topic — answers a related but different question than what was asked.”
11:00 UTC — Llama-3.3-70B verifies the final fix
With the recovery path picked, we ran a final verification pass on Llama-3.3-70B — not as the author, but as a cheap, competent reviewer. The strongest single vote in the entire 30-day archive (cv_004, consensus score 0.898) caught the kind of edge case juniors miss: Llama's rubric score on completeness was 100.0%, and its written justification was the deciding signal: “Solid. Matches what I'd write. Code compiles in my head and the trade-off discussion is honest.”
The $84 saved: consensus voting routed to Llama, not GPT-4o
Across the 8 consensus votes in consensus_votes.json, we ran 32 judge calls (4 judges per vote). Each judge call averages ~600 input tokens (the response being judged plus the rubric) and ~150 output tokens (the score + justification).
If we had run all judge calls on GPT-4o (the obvious default), our judge-line item would cost $0.0960 for this single debugging session, scaling to $23.04 per month at our actual cadence of 8 vote-batches per day. Routing the same calls to Llama-3.3-70B (input $0.59/M, output $0.79/M) instead of GPT-4o (input $2.50/M, output $10.00/M) drops that to $3.63/month.
When Llama wasn't enough — the lowest-consensus vote
Cheap-by-default is not the same as cheap-always. Vote cv_007 in our archive recorded a consensus score of just 0.389 — all four judges flagged the underlying answer as off-topic or hallucinated. Llama's exact justification: “Hallucinated API method that does not exist in the current SDK. Would not compile.”
The Synthpad consensus-voting rubric automatically promoted that vote to a Sonnet-4 re-judge (3 of 8 votes landed on Claude as winning_model for exactly this reason). The router is cheap by default but escalates when score variance crosses a configurable threshold (disagreement_flag in the schema).
Why this case study matters for your team
- Context portability. A 47-node debug graph spanning four providers is unworkable without a shared timeline. Synthpad reconstructs the cross-LLM history into one mergeable view, so the branch you opened on Claude still sees what GPT-4o said two hours earlier.
- Consensus voting is not theater — it's economics. The right judge model for any given response is rarely the most expensive one. Our archive shows the cheapest judge produces the dominant signal ~38% of the time; the escalation matters precisely when it matters.
- Real numbers, not vendor benchmarks. Every number on this page is computed at render time from
threads.json,merge_graph.json, andconsensus_votes.jsonin this repo. View the source on GitHub.
Bring your own multi-LLM thread to Synthpad
Paste an export from ChatGPT, Claude, Gemini, or Llama. See the timeline reconstruct itself in under 8 seconds.
Try the live reconstructor →