Switching to Fable 5: The Tuesday That Cost $22,000

By Jordan Ritter

June 12, 2026

We swapped models on a Tuesday morning. By Wednesday morning, we owed Anthropic roughly $22,000.

No runaway script. No leaked credentials. No weekend cron job grinding through a billion tokens. Five engineers opened five normal coding sessions, did normal Tuesday work, and the bill quietly walked off into the sunset.

This is how that happened, why it stopped at twenty-two thousand instead of two hundred and twenty thousand, and the system we built called skills that turned out to be the difference.

Quick context: CopilotKit builds the tooling that teams use to embed AI agents in their products. An "agent" is a large language model — the kind of system behind ChatGPT or Claude — that drives a real workflow: writing code, running tests, opening pull requests. We use them heavily ourselves.

What internal skills are

Every engineering team has tribal knowledge. Always run the formatter before pushing. Group commits by purpose. Never mock an LLM call with a hand-rolled stub. Most teams keep this in a Notion doc nobody reads or a Slack pin that scrolls off in a week.

We took a different approach. We packaged the rules as skills: small, named instruction packs that agents automatically load whenever they fire. Each skill is roughly a one-page Markdown file with a header describing when to use it and a body outlining the procedure. Think git hooks, but for behavior instead of code.

A few real ones:

cr-loop is the code review skill. Before any pull request lands, it dispatches seven independent reviewers at the same diff in parallel. The change does not ship until every finding is fixed and a confirmation round comes back clean.
blitz is the parallel work skill. Ten independent tasks fan out into ten isolated git worktrees, run concurrently, and coordinate the merge back. A day of sequential work becomes an hour of parallel work.
pre-push-quality runs the formatter, linter, type checker, tests, and build before any push. Anything red, you don't push.
aimock records real responses from real models, so our tests don't silently drift the day a model updates underneath us.

There are dozens. They're small, composable, and named after the moment they fire. The win isn't any single skill — it's that every agent session begins from the same baseline of the way we do things here, without anyone having to remember.

A footnote on the economics

Agent runs are not free, and the math is non-obvious. Every "turn" — one back-and-forth between the orchestrating agent and the model — carries the entire conversation so far as input. A long session pays for the same context, over and over. Multiply by sub-agents running in parallel, multiply by five concurrent engineers, and a bad afternoon moves real money.

A normal session for us is a few hundred turns and a handful of sub-agents. Keep that number in mind.

The Tuesday

On 2026-06-10, we flipped the default model on our coding agents to a new candidate. Anthropic had released a new Mythos-class model called fable-5, faster on benchmarks while being smarter on hard reasoning. The kind of upgrade you'd be irresponsible not to try.

Five engineers opened sessions on it that day. A bug fix, a project closeout, a couple of code reviews, and an isolation fix on an open pull request. Tuesday stuff. No alarms went off. The sessions just kept going.

The bill

By the time we looked, those five sessions had together logged 44,796 parent turns and spawned 2,122 sub-agents. One session alone logged 14,283 turns and dispatched 634 sub-agents. The biggest single turn we measured was 991 KB of context — roughly a quarter of a million tokens in one shot.

Estimated cost for the day: about $22,000.

A few hundred turns is normal. Fourteen thousand turns in one session is not. We were off by one to two orders of magnitude on every axis, simultaneously, in every session.

What actually broke

Same instructions. Same skills. Same work. Only the model changed. Four drift modes lit up at once.

Sub-agents got fat. Disposable sub-agents are supposed to stay small and get thrown away. The design ceiling is roughly 200 KB of context each. Ours arrived at 1.3 to 1.9 megabytes — about eight times the budget.
Review loops wouldn't stop. One session logged 462 round-marker references, with explicit fix5 and FIX13-FOLDS labels showing thirteen-plus rounds of fixing on a single piece of work. The skill converged to zero. The model kept finding one more thing.
Fanout exploded mid-run. Instead of running the work it had been asked to do, the agent kept re-decomposing the work and dispatching fresh sub-agents to do it. Single sessions reached hundreds of children.
Single turns blew the budget. A normal turn fits comfortably in twenty thousand tokens. Ours reached two hundred and fifty thousand. That one number dominates the upper end of the bill.

Here's the underlying mechanism, as best we can reconstruct it. Older models obeyed prosaic rules — instructions written as English prose, including negative ones ("don't do that," "keep this small"). fable-5 reads prose much more loosely. It nods along to a polite English "keep sub-agents small" and then quietly carries a megabyte of context. It wants — and works much better against — structural rules: explicit numeric budgets, hard ceilings, machine-checkable gates. The polite version of the rule was no longer binding. The rest of the bill is multiplication.

What saved us

Three things.

We noticed in twelve hours instead of twelve days. The same instrumentation that lets us run seven reviewers in parallel also lets us see when something is running away. Turn count, sub-agent count, per-turn size — all measurable on every active session. Those numbers were how we found out, not the invoice.

The blast was contained. Skills like blitz and cr-loop keep work scoped to isolated worktrees and per-session branches. No shared state to corrupt, no production to roll back, no customer data in the splash zone. Just a very expensive set of cleanups.

The fix lives at the rules layer, not the application layer. We didn't rewrite anyone's code. We rewrote the rules the agents read — structural budgets, hard caps on the sub-agent count, explicit token ceilings per turn, all expressed as numbers rather than adjectives. The same skills system that failed us is the one we used to fix the failure. A CI detector for this exact shape of drift is going in next, so the next model that drifts gets caught at minute fifteen instead of hour twelve.

How a skill actually works

Take cr-loop, the code review skill. When an agent is about to open a pull request, the skill fires automatically. It dispatches seven independent reviewer agents in parallel, each pointed at the same diff with the same verbatim prompt — designed to refute the change rather than confirm it. The orchestrating agent has no authority to declare "looks good, ship it." Only the seven reviewers reporting zero findings constitute convergence. Anything comes back, fix it, dispatch a fresh round of seven, do it again. The loop exits when seven independent agents, with no memory of one another, all agree that the diff is clean.

That is an enforced check, not a polite suggestion. The orchestrator cannot ship the change otherwise. The file describing all of this is about a page of Markdown that starts like this:

---
name: cr-loop
description: Review-fix loop with 7-agent unbiased CR and mandatory convergence to zero.
version: 2.10.1
---

That header is how the agent finds the skill. The body underneath is the procedure. The whole thing lives in a git repository alongside the code, and every change goes through the same review process as production.

Why this keeps getting better

The punchline of the skills system isn't any particular skill. It's the loop.

Every incident teaches the team something. Don't trust the orchestrator's own claim that CI is green — verify with gh pr checks first. Always rebase on latest main before branching. A confirmation round is not optional just because the fix "is obviously correct." Each of those rules came out of a real fuckup. Each is now a line in a skill, with a Why explaining the failure mode and a How to apply explaining the discipline, so a future agent can judge edge cases, not just follow the letter.

When a new agent session starts tomorrow, it loads all of these without anyone asking. The team gets smarter every time something breaks. The smartening compounds.

The lesson

The day was expensive. The lesson was cheaper than you'd think.

If you're going to let AI agents do real work, you need a layer that encodes how you want them to behave, separate from the agents themselves. Call it skills, call it guardrails, call it institutional memory — it doesn't matter what it's called. What matters is that it exists, that it's small and composable, and that you can change it faster than you can change a model.

Because the model will change. And when it does, the team with the rules layer pays $22,000 and writes a blog post. The team without it pays much more and issues a press release.