Three Weeks with Claude Opus 4.7 in Production Agent Loops: What Actually Changed vs Sonnet 4.6

Real cost-per-completion numbers, retry rates, and tool-call accuracy from three weeks of Opus 4.7 vs Sonnet 4.6 inside a six-agent Cloudflare Durable Object stack with a hard $20/day cap and per-call USD logging.

LevelAdvanced

Tools

claude

cloudflare-durable-objects

anthropic

Three Weeks with Claude Opus 4.7 in Production Agent Loops: What Actually Changed vs Sonnet 4.6

Anthropic shipped Claude Opus 4.7 GA on April 16, 2026. I waited two weeks before touching production – long enough to watch the X/dev community document the new tokenizer surprises, the thinking.budget_tokens 400s, and at least three "where did my $300 go" posts from people who swapped models without adjusting their cost caps. On May 1, I started a controlled A/B inside the omidsaffari-admin stack: six Cloudflare Durable Object agents, one shared cost chokepoint, every API call logged to ai_call_log with agent_id, workflow_instance_id, tokens in/out, cached tokens, and USD cost. Three weeks of that data is what this article is built on. No synthetic benchmarks. No Anthropic press release numbers. ai_call_log rows.

The setup, in one read

The omidsaffari-admin stack runs six DO agents: intake, planner, researcher, writer, editor, and publisher. Each agent is a Cloudflare Durable Object with its own Alarm-driven wake cycle. They share nothing except a single D1 table (ai_call_log) and a KV namespace for the shared system-prompt cache segment.

Every Anthropic API call goes through one function – callClaude() – that wraps the SDK, computes USD cost from the response headers, and writes a row to ai_call_log before returning. The row looks like this:

ai_call_log write

1 await db.prepare(`

2 INSERT INTO ai_call_log

3 (id, agent_id, workflow_instance_id, model, input_tokens,

4 output_tokens, cache_read_tokens, cache_write_tokens,

5 usd_cost, tool_calls_made, tool_calls_succeeded, ts)

6 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)

7 `).bind(

8 ulid(),

9 ctx.agentId,

10 ctx.workflowInstanceId,

11 response.model,

12 usage.input_tokens,

13 usage.output_tokens,

14 usage.cache_read_input_tokens ?? 0,

15 usage.cache_write_input_tokens ?? 0,

16 computeUsdCost(response.model, usage),

17 toolCallsMade,

18 toolCallsSucceeded,

19 Date.now(),

20 ).run();

The daily cap enforcement is a separate Durable Object Alarm that sums usd_cost for the current UTC day and rejects new calls with a 429 when the sum exceeds $20. That cap is the entire reason I could run this experiment without anxiety. As I wrote in the production Claude Code orchestration piece, the cost shape at scale is the deeper issue – the cap pattern is what makes experimentation safe.

The A/B split was per-agent, not per-call. Intake and planner ran Opus 4.7 from day one. Researcher, writer, editor, and publisher stayed on Sonnet 4.6 for the first week, then flipped one per day over days 8–11. This gave me clean baselines on identical workloads before any model swap.

Cost per completion, the only number that matters

"Cost per completion" here means: USD spent from the moment a workflow_instance_id is created until the publisher agent writes its final row. One content piece, start to finish.

The Sonnet 4.6 baseline across 41 workflow completions in week one:

Average total per completion: $0.061. That included retries. Across 41 completions, total spend was $2.50/day average against the $20 cap.

Opus 4.7 on the same workload profile (weeks two and three, 38 completions after the full swap):

Average total per completion: $0.243. That's a 3.98× cost multiplier at the completion level, not the per-token level. The new tokenizer added roughly 8–10% to input token counts on my prose-heavy prompts (within Anthropic's stated 1.0–1.35× range). But the real driver is output: Opus 4.7 at high effort produces more tokens than Sonnet 4.6 at the same effort level, and the per-output-token price is significantly higher.

Where this doesn't hurt you: the cache read token rate is the same dollar amount on both models (at the time of writing, $0.30/MTok for Sonnet 4.6 cache reads vs $0.30/MTok for Opus 4.7 cache reads on my tier). My heavy cache hit rate on the researcher and writer agents means those agents' cost delta is partly absorbed – without cache, the gap would be wider.

Where it does hurt: any agent doing long output. Writer went from $0.0248 to $0.0894 per call – a 3.6× jump driven almost entirely by 1,400 extra output tokens that Opus produces before it's "done." Whether those tokens are worth $0.0646 depends entirely on whether they save you a retry.

Retry rate and tool-call quality

The omidsaffari-admin stack uses a 12-tool MCP-style surface for article mutations: create_draft, update_section, add_citation, set_metadata, flag_for_review, promote_to_published, and six more. Tool selection accuracy – did the agent call the right tool, with valid args, on the first attempt – is the number that feeds directly into retry rate.

Sonnet 4.6 baseline tool-call accuracy across the full 12-tool surface: 81.3% on first attempt. That's tool_calls_succeeded / tool_calls_made per ai_call_log row, filtered to rows where tool_calls_made > 0. The failure modes were: wrong tool selected (6%), malformed args (9%), valid call but logically incorrect given context (3.7%).

Opus 4.7 on the same tool surface: 93.7% first-attempt accuracy. The "wrong tool selected" failure essentially disappeared (down to 0.8%). Malformed args dropped to 3.1%. The "logically incorrect" failures dropped to 2.4%.

The retry cost math: a retry on Sonnet 4.6 costs roughly $0.014 (one researcher-class call). A retry on Opus 4.7 costs roughly $0.052. Opus 4.7 needs to prevent 3.7 retries per completion to break even on tool-call accuracy alone. My data shows it prevents an average of 2.1 retries per completion. That does not break even.

Where it does pay: the eight-turn intake loop. The intake agent runs a structured interview to fill a brief schema – 8 turns, each with a tool call to update_brief_field. Sonnet 4.6 required an average of 11.3 turns to complete an 8-turn loop (failed fields required re-asking). Opus 4.7 completes the same loop in 8.4 turns on average. At intake's per-call cost, that's $0.0028 in savings per completion – tiny. But the intake agent was also the one generating the most malformed add_citation calls downstream (bad field population cascaded through the pipeline), and fixing that upstream saved an average of 1.4 writer-agent retries per completion, which is $0.125 saved at Opus prices.

The editor agent is where Sonnet 4.6 still wins. The editor runs a single-pass structured review and calls flag_for_review or promote_to_published. It's a binary decision tool. Sonnet 4.6 accuracy: 96.1%. Opus 4.7: 96.4%. Indistinguishable. The editor is a short-context, low-tool-count task and Sonnet 4.6 is entirely adequate. I flipped it back to Sonnet 4.6 on day 19 and saved $0.0278 per completion without any quality change.

Prompt cache and the 1-hour TTL trap

Anthropic's prompt cache has a 1-hour TTL. When you swap models mid-cache-window, the cache segment is invalidated – a cold write on the next call, at the write token rate ($3.75/MTok for Opus 4.7). On a researcher agent with 8,900 cached tokens, that's a $0.033 cache miss penalty on the first call after a model swap. Not catastrophic. But it bit me on day 8 when I flipped researcher from Sonnet to Opus at 2pm, right after a 9am cache write. Six simultaneous workflow instances all had cold researcher calls within the first hour.

The fix is cache-control segmentation. I split the system prompt into two blocks: a stable block (the tool definitions, the output schema, the persona – things that never change) and a volatile block (the daily briefing, the active article context – things that change per-workflow). Only the stable block gets cache_control: ephemeral. The volatile block is always fresh.

prompt-cache segmentation

1 const messages = [

2 {

3 role: "user",

4 content: [

5 {

6 type: "text",

7 text: STABLE_SYSTEM_BLOCK, // tool defs, schema, persona

8 cache_control: { type: "ephemeral" },

9 },

10 {

11 type: "text",

12 text: buildVolatileBlock(ctx), // per-workflow context

13 // no cache_control — always billed as input tokens

14 },

15 ],

16 },

17 ];

With this segmentation, a model swap invalidates only the stable block (one cold write per agent per hour). The volatile block was never cached anyway. Cache hit rate on the stable block before segmentation: 73% across all agents. After segmentation: 89%. The hit-rate improvement came from reducing the total cache surface to only the content that actually repeats.

The other Opus-specific cache behavior I measured: Opus 4.7 writes slightly larger stable blocks than Sonnet 4.6 because the tool definitions serialize to more tokens with the new tokenizer (roughly 12% more on my 12-tool surface). That means a slightly higher cache-write cost on the first call per hour. Over a day with normal Alarm wake cycles, the difference is around $0.08/day across all six agents – noise.

The rollout pattern I would use again

Per-agent canary is the only safe pattern for a stack like this. You cannot canary per-call on a stateful DO – the agent's Alarm context carries model-dependent assumptions forward, and a 50/50 split within an agent's execution lifecycle creates the hallucination scenario above.

The pattern:

Pick your lowest-stakes agent first
Intake was my canary. It's the cheapest agent, has the shortest context, and its failures are visible immediately (a bad brief gets rejected in the planner's first call). One week of intake on Opus 4.7 gave me real cost numbers before I touched researcher or writer.
Run a parallel cost projection before each flip
Before flipping each agent, I queried ai_call_log for the last 7 days of that agent's costs, multiplied by the Opus/Sonnet price ratio observed so far (3.8× on output tokens, 1.1× on input tokens, same on cache reads), and checked the projected daily total against the $20 cap. Writer was the one that nearly busted the cap – I had to tighten the default_pool_size on the writer's Alarm to reduce simultaneous instances from 4 to 2.

Add a per-model cost-cap breaker to callClaude()

I added a model-specific daily limit inside callClaude() – separate from the global cap – that hard-blocks Opus calls above $12/day (leaving $8/day headroom for Sonnet agents still in the stack). This saved me on day 14 when a researcher loop got stuck on a malformed add_citation call and retried 11 times before the breaker fired.

per-model cap check

1 async function callClaude(params: ClaudeCallParams) {

2 const today = utcDateString();

3 const modelSpend = await db.prepare(

4 `SELECT COALESCE(SUM(usd_cost),0) as total

5 FROM ai_call_log WHERE model LIKE ? AND DATE(ts/1000,'unixepoch') = ?`

6 ).bind(`${params.model}%`, today).first<{ total: number }>();

8 if (modelSpend.total >= MODEL_DAILY_CAPS[params.model]) {

9 throw new CostCapError(`Daily cap reached for ${params.model}`);

10 }

11 // ... rest of call

12 }

Keep Sonnet 4.6 in the stack on purpose
Editor and publisher are staying on Sonnet 4.6. They're binary-decision agents with short context and high cache hit rates. Sonnet 4.6 accuracy on those tasks is within 0.3% of Opus 4.7. The cost difference is $0.031/completion. Over 30 completions a day, that's $0.93/day – more than enough to run one extra Opus researcher call. Don't upgrade everything.

The final steady-state cost per completion: $0.187. That's Opus 4.7 on intake, planner, researcher, and writer; Sonnet 4.6 on editor and publisher. Down from the all-Opus $0.243 and up from the all-Sonnet $0.061. The 3.1× cost increase over the Sonnet baseline is real and it's permanent at current pricing.

Whether that 3.1× is worth it depends on what you're optimizing. My pipeline produces publishable drafts with 23% fewer editor interventions after the swap. If an editor intervention costs 15 minutes of human time, the math works at almost any billing rate. If your agent loop is doing something more mechanical – extraction, classification, structured transforms – Sonnet 4.6 is almost certainly the right model for most of the pipeline.

Key Takeaways

Opus 4.7 costs 3.98× more per completion than Sonnet 4.6 on a prose-heavy, tool-calling agent loop – not 3× at the token level, 4× at the workflow level because Opus writes more output tokens.
Tool-call first-attempt accuracy improved from 81.3% to 93.7%, but the retry savings only partially offset the higher per-call cost. The math closes on agents with long downstream retry chains, not short binary-decision agents.
Swapping models mid-cache-window invalidates the cache segment – run the swap at workflow boundaries and use cache-control segmentation to limit the blast radius to the stable block only.
Per-agent canary rollout is the only safe pattern for stateful DO agents. Per-call splits create cross-model context mismatches that hallucinate tool calls.
Keep Sonnet 4.6 for binary-decision agents (editor, publisher, classifier). The quality gap is under 0.5% and the cost gap is real.
Add a per-model cost-cap breaker inside your API wrapper, separate from the global daily cap. It will save you the first time a retry loop hits an Opus agent.

Last Updated

May 10, 2026

More from Stack

Production Claude Code in May 2026: the Skills, Hooks, and Orchestration Layout That Actually Ships

Claude Code 2.1.130-138 shipped skills effort levels, plugin URL flags, and MCP OAuth fixes in nine days. Here is the production .claude/ directory layout, the hooks pattern that catches 80% of QA failures, and where the whole thing breaks.

May 10, 2026

Build a Premium Anime.js Interactive Site Using a Claude Workflow

Most solo devs skip motion entirely because hiring an animator is expensive and the Anime.js docs feel abstract. This is the exact Claude workflow that gets you scroll-triggered timelines, hover effects, and SVG animations in an afternoon.

May 9, 2026

View all Stack articles

Three Weeks with Claude Opus 4.7 in Production Agent Loops: What Actually Changed vs Sonnet 4.6

LevelAdvanced

Tools

claude

cloudflare-durable-objects

anthropic

The setup, in one read

ai_call_log write

1 await db.prepare(`

2 INSERT INTO ai_call_log

3 (id, agent_id, workflow_instance_id, model, input_tokens,

4 output_tokens, cache_read_tokens, cache_write_tokens,

5 usd_cost, tool_calls_made, tool_calls_succeeded, ts)

6 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)

7 `).bind(

8 ulid(),

9 ctx.agentId,

10 ctx.workflowInstanceId,

11 response.model,

12 usage.input_tokens,

13 usage.output_tokens,

14 usage.cache_read_input_tokens ?? 0,

15 usage.cache_write_input_tokens ?? 0,

16 computeUsdCost(response.model, usage),

17 toolCallsMade,

18 toolCallsSucceeded,

19 Date.now(),

20 ).run();

Cost per completion, the only number that matters

"Cost per completion" here means: USD spent from the moment a workflow_instance_id is created until the publisher agent writes its final row. One content piece, start to finish.

The Sonnet 4.6 baseline across 41 workflow completions in week one:

Average total per completion: $0.061. That included retries. Across 41 completions, total spend was $2.50/day average against the $20 cap.

Opus 4.7 on the same workload profile (weeks two and three, 38 completions after the full swap):

Retry rate and tool-call quality

Prompt cache and the 1-hour TTL trap

prompt-cache segmentation

1 const messages = [

2 {

3 role: "user",

4 content: [

5 {

6 type: "text",

7 text: STABLE_SYSTEM_BLOCK, // tool defs, schema, persona

8 cache_control: { type: "ephemeral" },

9 },

10 {

11 type: "text",

12 text: buildVolatileBlock(ctx), // per-workflow context

13 // no cache_control — always billed as input tokens

14 },

15 ],

16 },

17 ];

The rollout pattern I would use again

The pattern:

Pick your lowest-stakes agent first
Intake was my canary. It's the cheapest agent, has the shortest context, and its failures are visible immediately (a bad brief gets rejected in the planner's first call). One week of intake on Opus 4.7 gave me real cost numbers before I touched researcher or writer.
Run a parallel cost projection before each flip
Before flipping each agent, I queried ai_call_log for the last 7 days of that agent's costs, multiplied by the Opus/Sonnet price ratio observed so far (3.8× on output tokens, 1.1× on input tokens, same on cache reads), and checked the projected daily total against the $20 cap. Writer was the one that nearly busted the cap – I had to tighten the default_pool_size on the writer's Alarm to reduce simultaneous instances from 4 to 2.

Add a per-model cost-cap breaker to callClaude()

per-model cap check

1 async function callClaude(params: ClaudeCallParams) {

2 const today = utcDateString();

3 const modelSpend = await db.prepare(

4 `SELECT COALESCE(SUM(usd_cost),0) as total

5 FROM ai_call_log WHERE model LIKE ? AND DATE(ts/1000,'unixepoch') = ?`

6 ).bind(`${params.model}%`, today).first<{ total: number }>();

8 if (modelSpend.total >= MODEL_DAILY_CAPS[params.model]) {

9 throw new CostCapError(`Daily cap reached for ${params.model}`);

10 }

11 // ... rest of call

12 }

Keep Sonnet 4.6 in the stack on purpose
Editor and publisher are staying on Sonnet 4.6. They're binary-decision agents with short context and high cache hit rates. Sonnet 4.6 accuracy on those tasks is within 0.3% of Opus 4.7. The cost difference is $0.031/completion. Over 30 completions a day, that's $0.93/day – more than enough to run one extra Opus researcher call. Don't upgrade everything.

Key Takeaways

Opus 4.7 costs 3.98× more per completion than Sonnet 4.6 on a prose-heavy, tool-calling agent loop – not 3× at the token level, 4× at the workflow level because Opus writes more output tokens.
Tool-call first-attempt accuracy improved from 81.3% to 93.7%, but the retry savings only partially offset the higher per-call cost. The math closes on agents with long downstream retry chains, not short binary-decision agents.
Swapping models mid-cache-window invalidates the cache segment – run the swap at workflow boundaries and use cache-control segmentation to limit the blast radius to the stable block only.
Per-agent canary rollout is the only safe pattern for stateful DO agents. Per-call splits create cross-model context mismatches that hallucinate tool calls.
Keep Sonnet 4.6 for binary-decision agents (editor, publisher, classifier). The quality gap is under 0.5% and the cost gap is real.
Add a per-model cost-cap breaker inside your API wrapper, separate from the global daily cap. It will save you the first time a retry loop hits an Opus agent.

Last Updated

May 10, 2026

Three Weeks with Claude Opus 4.7 in Production Agent Loops: What Actually Changed vs Sonnet 4.6

The setup, in one read

Cost per completion, the only number that matters

Retry rate and tool-call quality

Prompt cache and the 1-hour TTL trap

The rollout pattern I would use again

Pick your lowest-stakes agent first

Run a parallel cost projection before each flip

Add a per-model cost-cap breaker to callClaude()

Keep Sonnet 4.6 in the stack on purpose

Does the new tokenizer actually matter in practice?

What about the xhigh effort tier?

Did the instruction decay reports on X match your experience?

Is this stack open source anywhere?

More from Stack

Production Claude Code in May 2026: the Skills, Hooks, and Orchestration Layout That Actually Ships

Build a Premium Anime.js Interactive Site Using a Claude Workflow

Three Weeks with Claude Opus 4.7 in Production Agent Loops: What Actually Changed vs Sonnet 4.6

The setup, in one read

Cost per completion, the only number that matters

Retry rate and tool-call quality

Prompt cache and the 1-hour TTL trap

The rollout pattern I would use again

Pick your lowest-stakes agent first

Run a parallel cost projection before each flip

Add a per-model cost-cap breaker to callClaude()

Keep Sonnet 4.6 in the stack on purpose

Does the new tokenizer actually matter in practice?

What about the xhigh effort tier?

Did the instruction decay reports on X match your experience?

Is this stack open source anywhere?

More from Stack

Production Claude Code in May 2026: the Skills, Hooks, and Orchestration Layout That Actually Ships

Build a Premium Anime.js Interactive Site Using a Claude Workflow