Edition 2026-06-04 · read as Data Science
AnthropicEndsClaudeFlat-Rate,MetersAgentSDKUsage
- Sources
- 36
- Words
- 1,759
- Read
- 9min
Topics Agentic AI LLM Inference AI Capital
◆ The signal
Anthropic killed the flat-rate Claude subscription this week. Programmatic usage through the Agent SDK, GitHub Actions, and third-party tools now bills metered API credits at list price, which erases a 70–90% effective discount. ServiceNow burned its full-year Claude budget by May. OpenAI launched a 2-month-free Codex enterprise switch promo the same day. Whether the new credit cap is a price hike or a rounding error depends entirely on a team's token mix.
◆ INTELLIGENCE MAP
01 Anthropic's Pricing & Capacity Crisis
act nowAnthropic planned for 10x growth, hit 80x, and is leasing xAI's entire 220K-GPU Colossus 1 cluster to survive. Subscriptions are now metered at API rates. June 15 splits third-party tool credits with no rollover. ServiceNow burned its annual budget in 5 months. OpenAI is paying for your evaluation of the alternative.
- Growth vs plan
- Colossus GPUs leased
- B2B spend share
- ARR trajectory
- June 15 deadline
- Planned growth10
- Actual growth80
02 59% Agentic Token Share — Your Eval Harness Is Measuring the Minority
monitorVercel's AI Gateway shows 59% of production tokens are now multi-turn agentic workloads. Anthropic captures 61% of spend via Opus while Google captures 38% of volume via Flash. Cost models built on single-turn assumptions are ~5x wrong on agentic spend. No vendor loyalty observed — teams route on task, not brand.
- Agentic share
- Anthropic spend share
- Google volume share
- Teams tracked
- MCP token overhead
03 AI Cyber Capability Crosses Full-Takeover Threshold
monitorAnthropic's Mythos cleared both UK AISI simulated attack ranges — the first model to complete full network takeover end-to-end. GPT-5.5-cyber cleared one of two. Google confirmed AI-built offensive tooling observed in the wild. AISI is already building harder tests because the current ladder is saturating.
- Mythos ranges cleared
- GPT-5.5 ranges cleared
- MDASH Windows bugs
- Firefox bugs found
- Palo Alto products scanned
- 01Mythos (new)Full takeover
- 02GPT-5.5-cyberPartial takeover
- 03Mythos (prior)Adv. persistence
04 GPU Supply Crunch: 4:1 Demand Ratio, Neoclouds Sold Out
monitorNebius reports 4+ customers per GPU brought online, 684% YoY revenue growth, and $3–3.4B 2026 guidance from a $530M base. Cerebras IPO'd at $56B with a $20B OpenAI commitment. Cisco AI orders jumping $5B→$9B. Memory hardware shortage now the next bottleneck after GPUs.
- Nebius YoY growth
- Cerebras valuation
- OpenAI→Cerebras
- Cisco AI order growth
- Nebius 2026 guide
05 Training Efficiency: 3 Papers That Change Unit Economics
backgroundNous TST reports 2–3x wall-clock speedup at matched FLOPs with no inference architecture change (validated to 10B). NVIDIA Star Elastic claims 360x cheaper model-family derivation. Datology beat InternVL3.5-2B by 10 points at 17x less compute via data curation. The marginal dollar in training is moving from compute to curation and recipe.
- TST speedup
- Star Elastic cost
- Datology compute
- Datology accuracy
- TST scale validated
◆ DEEP DIVES
01 Anthropic's 80x Miss: Your Claude Budget Is Already Wrong
What Happened
Three changes this week make every pre-May-2026 Claude cost model unusable. First, Anthropic converted paid subscriptions into dollar-matched API credits, which ends the implicit 70–90% discount developers were running through Agent SDK, claude-p, GitHub Actions, and third-party harnesses on Max plans. Second, Dario Amodei said at Code with Claude that Anthropic planned for 10x growth and hit 80x in revenue and usage. Third, they are leasing xAI's entire Colossus 1 cluster, 220,000+ NVIDIA GPUs across H100, H200, and GB200, from a CEO who called them 'misanthropic and evil' three months ago.
Capacity is the binding constraint. ServiceNow's CDIO confirmed they burned through the full-year Claude budget by May after price hikes hit a platform with no native per-user consumption telemetry.
The Pricing Change, Decoded
Surface Before After (June 15) Agent SDK / claude-p / GitHub Actions Flat subscription, unlimited Dollar-matched API credits, metered Third-party tools (Zed, Conductor, OpenCode, T3) Bundled in plan Separate credit bucket, no rollover, overflow at API rates Claude Code (Pro/Max/Team) 5-hour limit, peak throttled Limits doubled, peak throttle removed Opus API rate limits Squeezed during crunch 'Substantially raised' Anthropic has hired a CFO and is targeting an October IPO. Margin-per-token is now a board-level metric. The subsidy regime is structurally over, not paused.
The Counter-Offensive
OpenAI shipped a 2-month-free Codex enterprise switch promo the same day. Ramp's April data had Anthropic at 34.4% versus OpenAI's 32.3% in business adoption, the first lead change. The thing that number doesn't tell you is durability under repricing, which is exactly what OpenAI is testing.
Any benchmark you ran between mid-April and now measured degraded-capacity Claude, not representative Claude. Re-baseline after the Colossus integration lands, not before.
What To Do
The immediate work is reconciliation: audit every Claude-backed workload and project token burn against the new credit cap. The structural work is routing: a provider-abstraction layer (LiteLLM, Portkey, or in-house) that makes vendor swaps a config change. The arbitrage work is the OpenAI promo. Use it as a free head-to-head window, but instrument with trajectory-level metrics. Pass@1 will not tell you which model your agents actually finish tasks on.
Anthropic provides no native per-user usage telemetry and no SLAs. Gateway-level logging with tenant, user, and feature tagging is now mandatory infrastructure. The vendor has offloaded that work to the customer, explicitly.
Action items
- Reconcile every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) against new credit cap; flag jobs that will exhaust credits before month-end
- Deploy an LLM gateway with per-user, per-feature tagging and daily token budget alerts in front of all Claude traffic
- Run a 2-month Codex evaluation under OpenAI's enterprise switch promo with matched prompts and tool schemas
- Avoid locking into annual Anthropic contracts until post-Colossus integration stability is observable (likely Q3)
Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with
02 59% Agentic: Your Eval Harness and Cost Model Are Scoring the Minority
The Production Reality
Vercel's AI Gateway production index covers 200,000 teams over 7 months. Agentic workloads now sit at 59% of all token volume, up from under 20% six months ago. The spend-vs-volume split is the interesting cut: Anthropic captures 61% of spend, mostly Opus on reasoning and planning nodes, while Google captures 38% of volume, mostly Flash on throughput work. The data shows no vendor loyalty. Teams route on task type.
This is not a forecast. It is present-tense production telemetry, and most eval harnesses were not built for it.
Why Your Cost Model Is 5x Wrong
Single-turn cost models assumed input-to-output ratios around 3:1. Agentic traces run closer to 15:1 on input, with heavy cache reuse on some providers and none on others. A forecast built on last year's ratio is off by roughly five times on spend. The error is not symmetric across vendors, which is what makes it expensive to ignore.
Provider Gateway Position Pricing Posture Implied Role Anthropic 61% of spend Premium Reasoning / planning nodes Google 38% of volume Aggressive / free tiers High-throughput utility calls OpenAI Fast-growing share Mid-tier Mixed; share spiking post-model-update DeepSeek V4 Pro Emerging $2.25/task (FlowGraph) Cost/capability sweet spot Abridge's production architecture at 80M+ clinical conversations is the cleanest validation at scale: cheap fast model triages, expensive model reasons only when called, 5–10x cost reduction versus routing everything to a frontier model. Glean's benchmark claims MCP uses 30% more tokens than a retrieval-tuned knowledge graph. The number is vendor-published with no methodology, so treat it as directional. The failure mode it points at — verbose tool outputs bloating context — is well-documented elsewhere.
The Eval Gap
If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.
Standard harnesses score single-turn responses against reference answers. That was the right instrument in 2023. It does not measure the 59% of traffic that is multi-step tool loops with retries, where a planner burns 40K tokens arguing with itself before landing on the right answer. Final-answer accuracy is 90%+ in both regimes. The bill is where they diverge, and the bill is the bottleneck nobody is benchmarking.
Microsoft's MDASH (100+ agents) beat Anthropic's Mythos on CyberGym by decomposing into scan, debate, and exploit stages. The CyberGym result is consistent with multi-agent ensembles outperforming monolithic models on verifiable tasks. Given the numbers, I expect about half the reported lift to survive on production traffic once retries, tool-call failures, and inference cost are accounted for. Half the lift is still worth the migration. A quarter is not.
Action items
- Add trajectory-level metrics to eval harness: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
- Instrument per-node token cost in agent pipelines and route utility calls (summarization, extraction, rewriting) to Flash/Haiku-class models
- Run a 1-hour spike comparing MCP/tool-calling token overhead vs. a retrieval-first baseline on 100 production traces
- Add model-routing abstraction layer if not present; ensure every call site can swap providers without code changes
Sources:Agentic traffic crossed fifty-nine percent · The CyberGym result · Vercel published a number worth sitting with · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs
03 AI Offensive Cyber Hit Full-Takeover — Your Release Gate Needs a New Tier
The Capability Jump
The UK AI Security Institute ran the newest Anthropic Mythos and OpenAI GPT-5.5-cyber against autonomous cyber-offense tasks. Both completed full network takeovers in controlled environments, one tier above the prior Mythos generation, which capped at 'advanced persistence.' Mythos cleared both of AISI's hardest tests. GPT-5.5-cyber cleared one. AISI is already building harder benchmarks because the current ladder is saturating, which is the polite way of saying the eval no longer measures the bottleneck.
Separately, Google's threat-intel team observed a hacking group using LLMs to build a cybercrime tool in the wild. First detected incident of its class. The risk moved from tabletop to production event.
Why This Matters for Model Selection
Refusal rates and prompt-injection catch rates do not measure end-to-end attack-chain completion. The AISI result says the eval rubric needs stages: recon → initial access → lateral movement → persistence → exfil. A model that passes a jailbreak suite but chains exploits in production is a deployment risk the current harness cannot see.
Model AISI Cyber Tier Hardest Tests Release Posture Mythos (new) Full network takeover 2 of 2 cleared Gated — select enterprises + gov GPT-5.5-cyber Full network takeover 1 of 2 cleared Gated — handful of companies Mythos (prior gen) Advanced persistence — Broader availability On practical outcomes, the harness dominates the model. Mozilla wrapped a custom agentic harness around existing fuzzing infrastructure and surfaced 271 Firefox bugs with Mythos. Daniel Stenberg pointed the same model at curl and got 1 CVE with 4 false positives. Same weights, 271:1 yield difference. The variable that moved was the scaffolding: reproducible test cases, ephemeral VMs, integration into existing signal pipelines. That correlates with throughput. It is also the cleanest causal story available without a controlled run.
Vulnerability discovery moved from human-weeks to model-minutes. If patch SLAs are not benchmarked against inference time, the defense is tuned to last year's threat model.
Implications for Your Stack
For teams shipping agentic systems with tool access, two obligations follow. First, red-team suites built against GPT-4-era assumptions will produce false negatives against Mythos-class attackers. Run a fresh adversarial spike with a frontier model against internal services and measure time-to-first-exploit. Second, production telemetry needs agent-trajectory features. Log tool-call sequences, detect recon → lateral movement → persistence patterns, and alert on graph anomalies rather than prompt-level filters.
Palo Alto's AI-driven scanning surfaced serious vulnerabilities across 130+ products. Microsoft's MDASH shipped 16 real Windows fixes in May Patch Tuesday. The thing these numbers don't tell you is the false-positive cost on the defender side, but the unit economics of automated bug discovery have crossed the threshold where running is cheaper than not running.
Action items
- Add staged cyber-capability eval (recon, initial access, lateral movement, persistence) to your model release gate for any agent with tool/shell access
- Run a red-team spike using Claude Mythos Preview or GPT-5.5 against internal services; measure time-to-first-exploit vs. human baseline
- Instrument agent-trajectory features in production telemetry: tool-call graph anomalies and egress patterns, not just input-side filters
- Compress critical-patch SLA for any system reachable by LLM agents; expect sustained CVE volume uplift as AI-assisted scanning scales
Sources:Mythos cleared the AISI attack ranges · The headline claim is that AI models have reached full network takeover · Mozilla shipped 271 bugs · PraisonAI was weaponized within four hours · Google's report of a threat actor using AI
04 Three Training Efficiency Breakthroughs That Change Your Q3 Compute Math
The Papers
Three research drops landed the same week. Each one nudges unit economics in a direction that matters if you are running training or distillation this quarter.
Work Claim Scale Validated Inference Impact Replication Risk Nous TST 2–3x wall-clock at matched FLOPs 270M → 10B-A1B MoE None — no architecture change Medium; single-source, clean claim NVIDIA Star Elastic 360x cheaper model-family derivation; 7x vs SOTA compression Not specified Produces family of sizes from one run High; lab-reported headline number Datology VLM curation +11.7 pts on 20 VLM benchmarks; 17x less compute 2B and 4B Lower response FLOPs — real serving win Medium; benchmark-selection risk What Transfers
Token Superposition Training (TST) is the one to spike first. It is a pretraining recipe change with no inference-side downstream. If it replicates, it is a free 2–3x on wall-clock. The mechanism is architectural during training only; at serving time the model looks identical to a standard transformer. Validated from 270M to 10B params with an A1B MoE variant, which is a wider range than most recipe papers bother with.
Star Elastic's 360x is the kind of claim that always shrinks under independent eval. Given the numbers in the paper, I expect roughly an order-of-magnitude haircut on replication. Even a 30x hold would restructure how teams produce model-size tiers for deployment, because one post-training run producing a family of sizes eliminates the current pattern of training multiple checkpoints or running expensive compression passes per target.
Datology's result is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation. Beating InternVL3.5-2B by ~10 points while using 17x less training compute, and producing a near-frontier 4B with 3.3x lower response FLOPs than Qwen3-VL-4B, is a serving-cost story as much as a training one. The thing the headline number does not tell you is which slices the 10-point gap concentrates in. Read the breakdown before you migrate.
The marginal dollar in training moved from compute to recipe and curation. If you're still scaling FLOPS before optimizing data, you're paying a 17x tax that Datology just quantified.
Practical Sequencing
TST is the lowest-risk experiment. Spike a 1B continued-pretraining run against a matched-FLOP baseline. If wall-clock comes in at even 1.6x with no val-loss regression, it pays for itself on the next full run. Star Elastic's value depends on whether you need to produce multiple serving sizes from a single training investment. If you do, the ROI is obvious at 30x and overwhelming at 360x. Datology's lesson is immediate and does not require reproducing the paper: audit your training data curation pipeline before adding more compute.
All three converge on the same point. The 4:1 GPU demand-to-supply ratio reported by Nebius makes efficiency recipes the highest-leverage alternative to fighting for more hardware.
Action items
- Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOP baseline this quarter
- Audit VLM training data curation pipeline quality before next compute scale-up
- Evaluate Star Elastic once the paper publishes — flag if you produce 3+ model-size tiers from the same base
- Lock H2 2026 GPU reservations across 2+ providers before quarterly sellouts tighten further
Sources:Claude just metered your agent SDK calls · Nebius reported more than four customers per GPU
◆ QUICK HITS
DuckDB shipped client-server mode (Quack HTTP) — ECS Fargate + DuckDB + Terraform is now a credible Spark-on-Glue replacement for sub-100GB ETL jobs at ~50% cost reduction
DuckDB shipped a client-server mode this week
Kafka Share Groups decouple consumer parallelism from partition count with ~linear 8x scaling at 32 instances on I/O-bound workloads — benchmark on embedding/enrichment consumers first
DuckDB shipped a client-server mode this week
Only 15% of organizations have the data foundation for agentic AI at scale (Fivetran); data quality/lineage cited as #1 blocker by ~50% — score target domains before greenlighting agent projects
DuckDB shipped a client-server mode this week
Update: Apache Iceberg CVE-2026-42812 (CVSS 9.9) lets attackers redirect table metadata to attacker-controlled S3, poisoning training data — patch and audit metadata pointer integrity
LiteLLM landed in the KEV catalog this week
TML-Interaction-Small reports 0.40s full-duplex turn-taking vs. 1.18s for GPT-Realtime-2.0 — a 3x gap on the metric that determines perceived naturalness in voice agents
TML is reporting 0.40 seconds of full-duplex latency
SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages — largest open agentic trace corpus for SFT/RM training; pull before licensing frictions arrive
Claude just metered your agent SDK calls
Duolingo publicly pegs AI-generated content 'slop' at ~20% requiring human QC, and reversed its blanket 'evaluate employees on AI usage' policy after observing performative adoption
Duolingo's twenty percent AI slop rate
LLM-as-a-Verifier eliminates tie-rate problem of LLM-as-a-Judge by decomposing into repeated binary verifications at token granularity — prototype on one eval pipeline this sprint
An Ollama endpoint exposed to the public internet
PyTorch 2.12 shipped MX quantization export support — the specific hook that lets you deploy low-precision models to inference runtimes without bespoke conversion tooling
The CyberGym result
COSO/PCAOB guidance now requires deterministic execution and tamper-evident audit trails for ML in regulated accounting — transformers are non-deterministic by default and need explicit pinning
The transformer underwriting models are outperforming
◆ Bottom line
The take.
Anthropic hit 80x growth on 10x capacity planning, killed the flat-rate Claude subsidy, and is leasing 220,000 GPUs from a competitor to keep the lights on — while 59% of production tokens are now agentic workloads that your single-turn eval harness doesn't measure and your cost model underestimates by 5x. Re-price your Claude budget this week, rebuild your eval around trajectories this sprint, and add a provider-routing layer before June 15 or you're making a $30B vendor's IPO margin target a line item on your quarterly invoice.
Frequently asked
- What exactly changed with Anthropic's Claude subscriptions this week?
- Anthropic converted flat-rate paid subscriptions into dollar-matched API credits billed at list price. Programmatic usage through the Agent SDK, GitHub Actions, claude-p, and third-party harnesses like Zed and OpenCode now meters against those credits, which removes the 70–90% effective discount developers were running on Max plans. Claude Code interactive limits were doubled and the peak throttle was removed, but the subsidy on programmatic access is structurally over.
- How do I figure out if the new credit cap is a real price hike or a rounding error for my team?
- Reconcile each Claude-backed workload against the new metered rate and compare projected token burn to your credit cap. The delta depends on token mix: heavy Agent SDK and GitHub Actions traffic that previously rode unlimited subscriptions will see the largest hit, while interactive Claude Code usage may actually improve under doubled limits. Without per-user, per-feature gateway telemetry you cannot answer the question, which is how ServiceNow burned its full-year budget by May.
- Is the OpenAI Codex switch promo worth taking seriously, or just a marketing stunt?
- It is worth running as a free head-to-head evaluation window, not as a migration commitment. Two months of free Codex enterprise access lets you benchmark on matched prompts and tool schemas with asymmetric payoff. Instrument with trajectory-level metrics — tool-call precision, steps-to-completion, cost-per-successful-task — because pass@1 will not tell you which model your agents actually finish work on.
- Why are single-turn cost models off by roughly 5x for agentic workloads?
- Agentic traces run input-to-output ratios closer to 15:1, versus the 3:1 assumption baked into older single-turn models, and cache reuse varies by provider. Vercel's gateway data across 200K teams shows agentic traffic is now 59% of token volume, with Anthropic capturing 61% of spend on planning nodes and Google 38% of volume on throughput work. A forecast built on last year's ratio understates spend by about five times, and the error is asymmetric across vendors.
- Which training efficiency result should I prioritize testing this quarter?
- Token Superposition Training is the lowest-risk spike because it changes only the pretraining recipe, with no inference-side architecture change, and it has been validated from 270M up to a 10B-A1B MoE. A matched-FLOP 1B continued-pretraining run is enough to see if the claimed 2–3x wall-clock holds; even a 1.6x result pays for itself on the next full run. Datology's curation finding is also immediately actionable without reproducing the paper — audit data quality before scaling compute further.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Princeton's ICML 2026 audit added GPT 5.5, Gemini 3.5 Flash, and Claude Opus 4.7 and found zero meaningful reliability improvement over pred…
- Hugging Face Transformers has an RCE path that fires from model config files — not pickle weights — across 2.2 billion installs.
- Anthropic ended the flat-rate Claude subsidy this week.
- Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credit…
- Anthropic's June 15 credit metering removes what was effectively a 70-90% subsidy on Claude-backed agents and eval harnesses.