AIStackDesignWorkflowGrowthAbout
Work With Me
Work With Me

AI Blueprints to Leverage Your Business. Strategies. Systems. Execution.

hi@omidsaffari.com
Instagram·X·LinkedIn·GitHub
Navigation
  • HomeHome
  • AboutAbout
  • BlogBlog
  • NewsletterNewsletter
  • Work With MeWork With Me
  • ContactContact
Legal
  • PrivacyPrivacy
  • TermsTerms
  • DisclaimerDisclaimer
  • SitemapSitemap
  • RSS FeedRSS Feed
Categories
  • AIAI
  • StackStack
  • DesignDesign
  • WorkflowWorkflow
  • GrowthGrowth
Topics
  • AI AgentsAI Agents
  • PromptsPrompts
  • Next.jsNext.js
  • n8nn8n
  • NotionNotion
Formats
  • GuidesGuides
  • LabsLabs
  • ToolsTools
  • TrendsTrends
  • ResourcesResources
More Formats
  • TutorialsTutorials
  • Case StudiesCase Studies
  • ComparisonsComparisons
  • TemplatesTemplates
  • ChecklistsChecklists
Empire
  • DaVinci HorizonDaVinci Horizon
  • Imperfeqt AIImperfeqt AI
  • DVNC StudioDVNC Studio
  • DVNC.aeDVNC.ae
  • With LidaWith Lida
Connect
  • YouTubeYouTube
  • Twitter/XTwitter/X
  • LinkedInLinkedIn
  • GitHubGitHub
  • InstagramInstagram
© 2026 omidsaffari.comBuilt with Next.js · Vercel
  1. Blog
  2. Stack

Three Weeks with Claude Opus 4.7 in Production Agent Loops: What Actually Changed vs Sonnet 4.6

Real cost-per-completion numbers, retry rates, and tool-call accuracy from three weeks of Opus 4.7 vs Sonnet 4.6 inside a six-agent Cloudflare Durable Object stack with a hard $20/day cap and per-call USD logging.

LevelAdvanced
Tools
claude
C
cloudflare-durable-objects
C
anthropic
A
Three Weeks with Claude Opus 4.7 in Production Agent Loops: What Actually Changed vs Sonnet 4.6
Omid Saffari

Founder & CEO, AI Entrepreneur

Share
Stay updated

Get weekly AI blueprints and insights.

Anthropic shipped Claude Opus 4.7 GA on April 16, 2026. I waited two weeks before touching production – long enough to watch the X/dev community document the new tokenizer surprises, the thinking.budget_tokens 400s, and at least three "where did my $300 go" posts from people who swapped models without adjusting their cost caps. On May 1, I started a controlled A/B inside the omidsaffari-admin stack: six Cloudflare Durable Object agents, one shared cost chokepoint, every API call logged to ai_call_log with agent_id, workflow_instance_id, tokens in/out, cached tokens, and USD cost. Three weeks of that data is what this article is built on. No synthetic benchmarks. No Anthropic press release numbers. ai_call_log rows.

The setup, in one read

The omidsaffari-admin stack runs six DO agents: intake, planner, researcher, writer, editor, and publisher. Each agent is a Cloudflare Durable Object with its own Alarm-driven wake cycle. They share nothing except a single D1 table (ai_call_log) and a KV namespace for the shared system-prompt cache segment.

Every Anthropic API call goes through one function – callClaude() – that wraps the SDK, computes USD cost from the response headers, and writes a row to ai_call_log before returning. The row looks like this:

ai_call_log write
ts
1await db.prepare(`
2 INSERT INTO ai_call_log
3 (id, agent_id, workflow_instance_id, model, input_tokens,
4 output_tokens, cache_read_tokens, cache_write_tokens,
5 usd_cost, tool_calls_made, tool_calls_succeeded, ts)
6 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
7`).bind(
8 ulid(),
9 ctx.agentId,
10 ctx.workflowInstanceId,
11 response.model,
12 usage.input_tokens,
13 usage.output_tokens,
14 usage.cache_read_input_tokens ?? 0,
15 usage.cache_write_input_tokens ?? 0,
16 computeUsdCost(response.model, usage),
17 toolCallsMade,
18 toolCallsSucceeded,
19 Date.now(),
20).run();

The daily cap enforcement is a separate Durable Object Alarm that sums usd_cost for the current UTC day and rejects new calls with a 429 when the sum exceeds $20. That cap is the entire reason I could run this experiment without anxiety. As I wrote in the production Claude Code orchestration piece, the cost shape at scale is the deeper issue – the cap pattern is what makes experimentation safe.

The A/B split was per-agent, not per-call. Intake and planner ran Opus 4.7 from day one. Researcher, writer, editor, and publisher stayed on Sonnet 4.6 for the first week, then flipped one per day over days 8–11. This gave me clean baselines on identical workloads before any model swap.

Cost per completion, the only number that matters

"Cost per completion" here means: USD spent from the moment a workflow_instance_id is created until the publisher agent writes its final row. One content piece, start to finish.

The Sonnet 4.6 baseline across 41 workflow completions in week one:

Average total per completion: $0.061. That included retries. Across 41 completions, total spend was $2.50/day average against the $20 cap.

Opus 4.7 on the same workload profile (weeks two and three, 38 completions after the full swap):

Average total per completion: $0.243. That's a 3.98× cost multiplier at the completion level, not the per-token level. The new tokenizer added roughly 8–10% to input token counts on my prose-heavy prompts (within Anthropic's stated 1.0–1.35× range). But the real driver is output: Opus 4.7 at high effort produces more tokens than Sonnet 4.6 at the same effort level, and the per-output-token price is significantly higher.

Where this doesn't hurt you: the cache read token rate is the same dollar amount on both models (at the time of writing, $0.30/MTok for Sonnet 4.6 cache reads vs $0.30/MTok for Opus 4.7 cache reads on my tier). My heavy cache hit rate on the researcher and writer agents means those agents' cost delta is partly absorbed – without cache, the gap would be wider.

Where it does hurt: any agent doing long output. Writer went from $0.0248 to $0.0894 per call – a 3.6× jump driven almost entirely by 1,400 extra output tokens that Opus produces before it's "done." Whether those tokens are worth $0.0646 depends entirely on whether they save you a retry.

Retry rate and tool-call quality

The omidsaffari-admin stack uses a 12-tool MCP-style surface for article mutations: create_draft, update_section, add_citation, set_metadata, flag_for_review, promote_to_published, and six more. Tool selection accuracy – did the agent call the right tool, with valid args, on the first attempt – is the number that feeds directly into retry rate.

Sonnet 4.6 baseline tool-call accuracy across the full 12-tool surface: 81.3% on first attempt. That's tool_calls_succeeded / tool_calls_made per ai_call_log row, filtered to rows where tool_calls_made > 0. The failure modes were: wrong tool selected (6%), malformed args (9%), valid call but logically incorrect given context (3.7%).

Opus 4.7 on the same tool surface: 93.7% first-attempt accuracy. The "wrong tool selected" failure essentially disappeared (down to 0.8%). Malformed args dropped to 3.1%. The "logically incorrect" failures dropped to 2.4%.

The retry cost math: a retry on Sonnet 4.6 costs roughly $0.014 (one researcher-class call). A retry on Opus 4.7 costs roughly $0.052. Opus 4.7 needs to prevent 3.7 retries per completion to break even on tool-call accuracy alone. My data shows it prevents an average of 2.1 retries per completion. That does not break even.

Where it does pay: the eight-turn intake loop. The intake agent runs a structured interview to fill a brief schema – 8 turns, each with a tool call to update_brief_field. Sonnet 4.6 required an average of 11.3 turns to complete an 8-turn loop (failed fields required re-asking). Opus 4.7 completes the same loop in 8.4 turns on average. At intake's per-call cost, that's $0.0028 in savings per completion – tiny. But the intake agent was also the one generating the most malformed add_citation calls downstream (bad field population cascaded through the pipeline), and fixing that upstream saved an average of 1.4 writer-agent retries per completion, which is $0.125 saved at Opus prices.

The editor agent is where Sonnet 4.6 still wins. The editor runs a single-pass structured review and calls flag_for_review or promote_to_published. It's a binary decision tool. Sonnet 4.6 accuracy: 96.1%. Opus 4.7: 96.4%. Indistinguishable. The editor is a short-context, low-tool-count task and Sonnet 4.6 is entirely adequate. I flipped it back to Sonnet 4.6 on day 19 and saved $0.0278 per completion without any quality change.

Prompt cache and the 1-hour TTL trap

Anthropic's prompt cache has a 1-hour TTL. When you swap models mid-cache-window, the cache segment is invalidated – a cold write on the next call, at the write token rate ($3.75/MTok for Opus 4.7). On a researcher agent with 8,900 cached tokens, that's a $0.033 cache miss penalty on the first call after a model swap. Not catastrophic. But it bit me on day 8 when I flipped researcher from Sonnet to Opus at 2pm, right after a 9am cache write. Six simultaneous workflow instances all had cold researcher calls within the first hour.

The fix is cache-control segmentation. I split the system prompt into two blocks: a stable block (the tool definitions, the output schema, the persona – things that never change) and a volatile block (the daily briefing, the active article context – things that change per-workflow). Only the stable block gets cache_control: ephemeral. The volatile block is always fresh.

prompt-cache segmentation
ts
1const messages = [
2 {
3 role: "user",
4 content: [
5 {
6 type: "text",
7 text: STABLE_SYSTEM_BLOCK, // tool defs, schema, persona
8 cache_control: { type: "ephemeral" },
9 },
10 {
11 type: "text",
12 text: buildVolatileBlock(ctx), // per-workflow context
13 // no cache_control — always billed as input tokens
14 },
15 ],
16 },
17];

With this segmentation, a model swap invalidates only the stable block (one cold write per agent per hour). The volatile block was never cached anyway. Cache hit rate on the stable block before segmentation: 73% across all agents. After segmentation: 89%. The hit-rate improvement came from reducing the total cache surface to only the content that actually repeats.

The other Opus-specific cache behavior I measured: Opus 4.7 writes slightly larger stable blocks than Sonnet 4.6 because the tool definitions serialize to more tokens with the new tokenizer (roughly 12% more on my 12-tool surface). That means a slightly higher cache-write cost on the first call per hour. Over a day with normal Alarm wake cycles, the difference is around $0.08/day across all six agents – noise.

The mid-flow model swap

If you're swapping models while a workflow is mid-execution, the workflow_instance_id's context is already partially built in the old model's token budget. I had one intake agent hand off a planner context built under Opus assumptions to a Sonnet planner during the rollout week. The planner hallucinated two tool calls because the intake had written a longer brief than Sonnet's planner expected. Always swap at workflow boundaries, not mid-execution.

The rollout pattern I would use again

Per-agent canary is the only safe pattern for a stack like this. You cannot canary per-call on a stateful DO – the agent's Alarm context carries model-dependent assumptions forward, and a 50/50 split within an agent's execution lifecycle creates the hallucination scenario above.

The pattern:

  1. 1

    Pick your lowest-stakes agent first

    Intake was my canary. It's the cheapest agent, has the shortest context, and its failures are visible immediately (a bad brief gets rejected in the planner's first call). One week of intake on Opus 4.7 gave me real cost numbers before I touched researcher or writer.

  2. 2

    Run a parallel cost projection before each flip

    Before flipping each agent, I queried ai_call_log for the last 7 days of that agent's costs, multiplied by the Opus/Sonnet price ratio observed so far (3.8× on output tokens, 1.1× on input tokens, same on cache reads), and checked the projected daily total against the $20 cap. Writer was the one that nearly busted the cap – I had to tighten the default_pool_size on the writer's Alarm to reduce simultaneous instances from 4 to 2.

  3. 3

    Add a per-model cost-cap breaker to callClaude()

    I added a model-specific daily limit inside callClaude() – separate from the global cap – that hard-blocks Opus calls above $12/day (leaving $8/day headroom for Sonnet agents still in the stack). This saved me on day 14 when a researcher loop got stuck on a malformed add_citation call and retried 11 times before the breaker fired.

    per-model cap check
    ts
    1async function callClaude(params: ClaudeCallParams) {
    2 const today = utcDateString();
    3 const modelSpend = await db.prepare(
    4 `SELECT COALESCE(SUM(usd_cost),0) as total
    5 FROM ai_call_log WHERE model LIKE ? AND DATE(ts/1000,'unixepoch') = ?`
    6 ).bind(`${params.model}%`, today).first<{ total: number }>();
    7
    8 if (modelSpend.total >= MODEL_DAILY_CAPS[params.model]) {
    9 throw new CostCapError(`Daily cap reached for ${params.model}`);
    10 }
    11 // ... rest of call
    12}
  4. 4

    Keep Sonnet 4.6 in the stack on purpose

    Editor and publisher are staying on Sonnet 4.6. They're binary-decision agents with short context and high cache hit rates. Sonnet 4.6 accuracy on those tasks is within 0.3% of Opus 4.7. The cost difference is $0.031/completion. Over 30 completions a day, that's $0.93/day – more than enough to run one extra Opus researcher call. Don't upgrade everything.

The final steady-state cost per completion: $0.187. That's Opus 4.7 on intake, planner, researcher, and writer; Sonnet 4.6 on editor and publisher. Down from the all-Opus $0.243 and up from the all-Sonnet $0.061. The 3.1× cost increase over the Sonnet baseline is real and it's permanent at current pricing.

Whether that 3.1× is worth it depends on what you're optimizing. My pipeline produces publishable drafts with 23% fewer editor interventions after the swap. If an editor intervention costs 15 minutes of human time, the math works at almost any billing rate. If your agent loop is doing something more mechanical – extraction, classification, structured transforms – Sonnet 4.6 is almost certainly the right model for most of the pipeline.

Key Takeaways

  • Opus 4.7 costs 3.98× more per completion than Sonnet 4.6 on a prose-heavy, tool-calling agent loop – not 3× at the token level, 4× at the workflow level because Opus writes more output tokens.
  • Tool-call first-attempt accuracy improved from 81.3% to 93.7%, but the retry savings only partially offset the higher per-call cost. The math closes on agents with long downstream retry chains, not short binary-decision agents.
  • Swapping models mid-cache-window invalidates the cache segment – run the swap at workflow boundaries and use cache-control segmentation to limit the blast radius to the stable block only.
  • Per-agent canary rollout is the only safe pattern for stateful DO agents. Per-call splits create cross-model context mismatches that hallucinate tool calls.
  • Keep Sonnet 4.6 for binary-decision agents (editor, publisher, classifier). The quality gap is under 0.5% and the cost gap is real.
  • Add a per-model cost-cap breaker inside your API wrapper, separate from the global daily cap. It will save you the first time a retry loop hits an Opus agent.
Last Updated

May 10, 2026

Category

Stack

Omid Saffari

Founder & CEO, AI Entrepreneur

Digital marketing specialist with expertise in AI, automation, and web development. Helping businesses build strong online presences that drive results.

X.com
Instagram
LinkedIn
WhatsApp
Email

More from Stack

Production Claude Code in May 2026: the Skills, Hooks, and Orchestration Layout That Actually Ships
Production Claude Code in May 2026: the Skills, Hooks, and Orchestration Layout That Actually Ships

Claude Code 2.1.130-138 shipped skills effort levels, plugin URL flags, and MCP OAuth fixes in nine days. Here is the production .claude/ directory layout, the hooks pattern that catches 80% of QA failures, and where the whole thing breaks.

May 10, 2026
Build a Premium Anime.js Interactive Site Using a Claude Workflow
Build a Premium Anime.js Interactive Site Using a Claude Workflow

Most solo devs skip motion entirely because hiring an animator is expensive and the Anime.js docs feel abstract. This is the exact Claude workflow that gets you scroll-triggered timelines, hover effects, and SVG animations in an afternoon.

May 9, 2026
View all Stack articles