Edition 2026-05-22 · read as Data Science
AnthropicEndsClaudeSubscriptionSubsidy,EvalCostsSpike
- Sources
- 36
- Words
- 1,362
- Read
- 7min
Topics Agentic AI LLM Inference AI Capital
◆ The signal
Anthropic quietly metered Claude subscriptions to dollar-matched API credits, removing what had been a 70-90% effective subsidy on Agent SDK, GitHub Actions, and third-party harness calls. OpenAI announced a 2-month-free Codex enterprise switch promo the same day. The thing the pricing page doesn't tell you: any eval harness or batch pipeline budgeted against flat subscription cost is now charging at API rates, and the overrun shows up in this week's token burn, not next quarter's review.
◆ INTELLIGENCE MAP
01 Anthropic Triple Pricing Shock + Capacity Crisis
act nowAnthropic metered subscriptions at API rates, tripled Opus 4.7 image cost, and announced a June 15 third-party tool credit split — all while admitting an 80x capacity miss that forced leasing xAI's 220K-GPU Colossus 1 cluster. ServiceNow burned its full-year Claude budget by May. Any Claude cost model built before this week is wrong.
- B2B share (Ramp)
- ARR growth (4mo)
- Colossus GPUs
- Image cost change
- June 15 credit split
02 59% Agentic Volume: Eval and Cost Models Obsolete
act nowVercel's AI Gateway (200K teams, 7 months) reports 59% of tokens are now agentic multi-turn traffic. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Cost models built on 3:1 input-output ratios are off by ~5x; single-turn eval harnesses score the minority of production traffic.
- Anthropic spend share
- Google volume share
- I/O ratio (agentic)
- MCP token overhead
03 Training Efficiency Breakthroughs: 2-360x Gains
monitorThree independent results shift pretraining and distillation economics: Nous TST delivers 2-3x wall-clock speedup at matched FLOPs with no inference change (validated 270M→10B). NVIDIA Star Elastic produces model-size families at 360x less cost than pretraining each. Datology beats InternVL3.5-2B by 10 pts at 17x less compute via pure data curation.
- TST speedup
- Star Elastic savings
- Datology bench gain
- Datology compute cut
04 Compute Supply Crunch: 4:1 Demand and Neocloud Boom
monitorNebius reported 684% YoY revenue growth with 4+ customers per GPU, guiding $3-3.4B for 2026 vs $530M in 2025. Cerebras IPO'd at $56B with a $20B OpenAI commitment. Cisco AI orders jumping from $5B to $9B with explicit memory-hardware shortage. H2 capacity priced on today's availability is likely mispriced.
- Nebius 2026 guide
- Nebius YoY growth
- Cerebras valuation
- Cisco AI order jump
- Nebius 2025530
- Nebius 2026 guide3200
05 AI Cyber Capability: AISI Ranges Saturating
backgroundAnthropic's Mythos cleared both AISI attack ranges (first model ever). GPT-5.5-cyber cleared one. Both achieved 'full network takeover' in controlled environments — a step-function above prior gen's 'advanced persistence.' AISI is already building harder tests. Google confirmed a threat actor using AI to build cybercrime tooling in the wild.
- Mythos ranges
- GPT-5.5 ranges
- Prior gen ceiling
- Palo Alto scan
- 01Mythos (new)Full takeover
- 02GPT-5.5-cyberPartial takeover
- 03Mythos (prior)Adv. persistence
◆ DEEP DIVES
01 Anthropic's Pricing Earthquake: Three Simultaneous Cost Shocks Hit Your Claude Stack
What Happened
The headline change: Claude subscriptions now convert to dollar-matched API credits for all programmatic usage, including Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% effective discount is gone. In the same week, Opus 4.7 tripled image processing cost, and starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) moves to a separate credit bucket equal to plan value with no rollover and overflow at API rates.
The driver behind the pricing is capacity. Anthropic planned for 10x growth and is seeing 80x. The emergency fix is leasing xAI's entire Colossus 1 cluster, 220,000+ GPUs spanning H100, H200, and GB200. A CFO is in seat and the company is targeting an October IPO, which is a reasonable proxy for why margin-per-token is now a board metric.
The Capacity Context
The pricing changes read more cleanly alongside the capacity numbers. ServiceNow's CDIO burned through a full-year Claude budget by May. National Life Group's CIO called Claude 'not great for companies that want per-user monitoring,' and Anthropic ships no native per-user telemetry and no SLAs, which is unusual for a dependency sitting on production critical paths.
Change Impact Timeline Subscription → API credits 70-90% discount gone on batch/eval workloads Immediate Opus 4.7 image cost 3x on multimodal pipelines Immediate June 15 third-party split No subsidized tokens for Zed/OpenCode/T3 30 days Colossus integration p95/p99 variance during heterogeneous fleet merge Weeks–months Any Claude benchmark from before May 7 is stale, and any cost model built against flat subscription rates is not directionally wrong, it is numerically wrong.
The OpenAI Counter-Move
Sam Altman posted a 2-month-free Codex enterprise switch promo on the same day Anthropic metered subscriptions. Ramp's April data showed Anthropic edging OpenAI for the first time, 34.4% vs 32.3%. The promo is an asymmetric-payoff bet: free to evaluate, with bounded switching cost if you already have a provider abstraction layer. OpenAI is pricing directly into the developer cohort Anthropic just alienated.
Cross-Source Tension
Sources disagree on the durability of Anthropic's lead. Ramp data is a card-spend proxy and measures who gets billed, not token volume or production criticality. A 20-seat pilot weighs the same as a company at inference scale, and OpenAI notes large enterprises rarely pay by card. The directional signal looks real; the magnitude does not. What is not uncertain is that the market is now genuinely multi-vendor, and architecture should reflect that.
Action items
- Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap by end of this week
- Deploy an LLM gateway (LiteLLM, Portkey) with per-user, per-feature tagging and daily budget alerts within this sprint
- Run a 2-month Codex evaluation under OpenAI's enterprise switch promo with matched prompts against your Claude harness
- Reforecast Claude inference spend for any team using Zed/Conductor/OpenCode modeling the post-June-15 scenario
Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with · Ramp's AI Index shows Anthropic at 34.4%
02 59% Agentic Volume: Your Eval Harness and Cost Model Are Measuring the Minority
The Production Shift, Quantified
Vercel's AI Gateway telemetry — 200,000 teams over 7 months — reports that 59% of all token volume is now agentic: multi-turn, tool-calling traces, not single-shot completions. Six months ago the share was under 20%. The interesting structure is the spend-volume split. Anthropic captures 61% of spend via Opus on reasoning and planning nodes, while Google captures 38% of volume via Flash on throughput and utility calls. Vendor loyalty is not visible in the data. Customers churn freely.
This is not a leaderboard result. It is production telemetry from a multi-tenant gateway, which means most tokens in the wild are inside multi-step tool loops with retries, not the single-turn completions an eval harness was built to score.
What Breaks at 59%
Eval harnesses: single-turn accuracy on held-out prompts measures the 41% minority. The thing this doesn't tell you is whether a planner burns 40K tokens arguing with itself before giving up. Pass rate looks fine. Spend is 5x budget. Cost models: most were calibrated when input-output ratios sat at 3:1. Agentic traces run at ~15:1 on input, with heavy cache reuse on some providers and none on others. A forecast carried over from last year's ratio is off by roughly 5x on spend.
Metric Single-Turn World 59% Agentic World Input:Output ratio ~3:1 ~15:1 Cost driver Output tokens Trajectory length × retries Eval metric Accuracy on final answer Cost-per-successful-task Routing unit Single request Session with KV cache state Failure mode Wrong answer Correct answer at 10x budget If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out — update the harness before you update the model.
The Routing Architecture That Fits
The Vercel pattern is consistent with what Abridge disclosed across 80M+ clinical conversations: a constellation of models, cheap fast triage in front, expensive reasoning behind, per-task selection. Given those numbers, the plausible envelope for cost reduction from tiered routing is 20-40% at constant trajectory completion rate. Glean's published benchmark puts off-the-shelf MCP at 30% more tokens than retrieval-tuned knowledge graphs. The methodology is not disclosed and the source is the vendor, so treat it as directional. It is consistent with the MCP context-window bloat people have measured independently.
The Enterprise Convergence
SAP (€100M partner investment) and ServiceNow (Action Fabric) have landed in the same place: agents need Knowledge Graph grounding + MCP-exposed workflows. The architectural drift is from RAG-over-docs to structured, entity-resolved context. The eval needs to move with it. Tool-use accuracy against grounded context is the production differentiator. It is on no public leaderboard.
Action items
- Add trajectory-level metrics to your eval harness this sprint: tool-call F1, steps-to-completion, cost-per-successful-task, recovery-from-error rate
- Instrument per-node token cost across your agent graphs and route utility calls (summarization, extraction, query rewriting) to Flash/Haiku-class models within 2 weeks
- Run a 1-hour spike measuring token overhead of current MCP/tool-calling setup vs. a retrieval-first baseline on 100 production traces
- Add LLM-judge ↔ human-annotator agreement as a tracked SLI this quarter; re-calibrate when judge model changes
Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · AI Gateway data puts agentic workloads at fifty-nine percent · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs
03 Training Efficiency Frontier: Three Results That Change Unit Economics This Quarter
Three Independent Wins, One Direction
The marginal dollar in model development has moved from raw compute to architecture and curation. The evidence this week comes from three separate labs, each working a different layer of the training stack, and the results point the same way. The marginal dollar in model development has moved from raw compute to architecture and curation.
Work Claim Scale Validated Inference Impact Replication Risk Nous Research TST 2-3x wall-clock at matched FLOPs 270M → 10B-A1B MoE None — no architecture change Medium; single-source, clean claim NVIDIA Star Elastic 360x cheaper model-family derivation; 7x vs SOTA compression Not specified Produces family of sizes from one run High; big number, lab-reported Datology VLM +11.7 pts on 20 benchmarks at 17x less compute 2B and 4B params 3.3x lower response FLOPs (serving win) Medium; benchmark selection risk TST Is the One to Spike First
Token Superposition Training is a pretraining recipe change with no inference-side downstream. If it replicates, it is a 2-3x improvement on wall-clock for free. The serving architecture is unchanged. The efficiency lives entirely in how gradients are computed during training. Validated from 270M up to a 10B-A1B MoE, which is the band most teams actually run for continued pretraining and domain adaptation.
Datology: Curation Beats Compute
Datology's result is the cleanest evidence so far that data curation now dominates compute scaling for VLMs. A 2B model lands about 10 points above InternVL3.5-2B at 17x less training compute, on data selection alone. At 4B, they reach near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B, which is a serving win and not just a training one. For any team maintaining a vision pipeline, this reorders the next budget line. Curation tooling before more GPUs.
Star Elastic: Speculative but High-Leverage
NVIDIA's claim that one post-training run produces a full family of model sizes at 360x lower cost than pretraining each is the kind of number that always shrinks under independent evaluation. The thing this number doesn't tell you is what holds at the small end versus the large end. Even a 30x hold would restructure how teams produce size tiers for routing. Paired with the 59% agentic routing pattern, cheap access to a 1B-to-70B spectrum from a single run is what a tiered architecture actually needs.
TST is worth a spike on your next continued-pretraining run. If wall-clock comes in at even 1.6x with no val-loss regression, it pays for itself immediately.
The Parallel Signal: Only 15% Are Ready
Fivetran's readiness index reports that 15% of organizations have the data foundation for agentic AI. Data quality and lineage are the top blocker, cited by roughly 50%. Read against the efficiency results, the picture is consistent. Training is getting cheaper. Most teams cannot capitalize because the data layer is not there. Half the agent projects funded this quarter are data-platform projects with an agent bolted on.
Action items
- Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline within the next 2 weeks
- Run an ablation benchmarking Qwen-2.5/3 or DeepSeek-V3 against your current production model on in-domain evals this quarter
- Audit ANALYZE/compute-stats coverage across top-20 Iceberg/Delta tables; add stats freshness to table-level SLAs
- Score target domains against Fivetran readiness dimensions (quality, lineage, governance) before greenlighting agentic AI projects
Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week
◆ QUICK HITS
DuckDB shipped Quack HTTP client-server protocol — credible replacement for Spark-on-Glue for sub-100GB jobs; default config is localhost-only, no SSL, token auth
DuckDB shipped a client-server mode this week
Kafka Share Groups report ~linear 8x throughput at 32 consumers on I/O-bound workloads — decouple consumer parallelism from partition count for embedding/enrichment pipelines
DuckDB shipped a client-server mode this week
Update: Mythos cleared both AISI attack ranges (first model ever) — capability stepped from 'advanced persistence' to 'full network takeover'; AISI is building harder tests because current ladder is saturating
Mythos cleared the AISI attack ranges this week
Apache Iceberg CVE-2026-42812 (CVSS 9.9): attacker with table-write permission can redirect metadata to attacker-controlled S3, poisoning training data on next read — audit Polaris catalog write-path allowlisting
LiteLLM landed in the KEV catalog this week
Gemini reproducibly outputs real phone numbers from training data — three independent extraction cases; add PII canary + divergence-attack probes to LLM CI before next release
Gemini is the latest model to surface PII
Duolingo publicly pegs AI-generated content rejection at ~20% requiring human QC — use as calibration anchor for your own generation pipeline acceptance rate
Duolingo's twenty percent AI slop rate
TML-Interaction-Small hits 0.40s turn-taking latency vs 0.57s Gemini Flash and 1.18s GPT-Realtime — full-duplex voice becoming its own model class with 200ms micro-turn architecture
TML is reporting 0.40 seconds of full-duplex latency
SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs across 3K repos and 16 languages — largest open agentic trace corpus; pull before licensing frictions arrive
Claude just metered your agent SDK calls
Microsoft's agent memory architecture (consolidation + forgetting + delayed maturation) stabilizes at 400-500 memories with 97.2% retention precision — alternative to bloated prompts or flat vector top-k
DuckDB shipped a client-server mode this week
Persona drift measurable within 8 dialogue turns per Li et al. (COLM 2024) — embed a verbal tic as a free drift canary via regex; costs one hour to instrument
AI personas drift within eight turns
◆ Bottom line
The take.
Anthropic killed the flat-rate developer discount, tripled image costs, and announced a June 15 credit split — all while 59% of production tokens are now agentic and your eval harness still measures single-turn completions. The two cheapest things you can do this week: reconcile every Claude workload against the new metered credit cap, and add trajectory-level cost-per-successful-task to the eval harness that's currently scoring the minority of your traffic.
Frequently asked
- What changed in Anthropic's Claude subscription pricing?
- Claude subscriptions now convert to dollar-matched API credits for all programmatic usage, including Agent SDK, claude-p, GitHub Actions, and third-party harnesses. This eliminates what had been a 70-90% effective discount on batch and eval workloads, so any pipeline budgeted against flat subscription cost is now charging at full API rates.
- How should I reforecast Claude spend before the June 15 third-party tool change?
- Model the post-June-15 scenario for any team using Zed, Conductor, OpenCode, or T3 Code, since third-party tool usage moves to a separate credit bucket equal to plan value with no rollover and overflow at API rates. Combine this with a per-workload audit of Agent SDK and batch eval token burn to catch silent overruns before credits exhaust mid-month.
- Why are single-turn eval harnesses inadequate now?
- Vercel gateway telemetry shows 59% of token volume is agentic multi-turn tool-calling, while most harnesses still measure single-turn accuracy on the 41% minority. Agentic traces also run at roughly 15:1 input-to-output versus the historical 3:1, so cost models calibrated on single-turn assumptions can be off by about 5x. Track tool-call F1, steps-to-completion, and cost-per-successful-task instead.
- Is Token Superposition Training worth piloting now?
- Yes — it's the lowest-risk efficiency bet on the table. TST is a pretraining recipe change with no inference-side impact, claiming 2-3x wall-clock improvement at matched FLOPs, validated from 270M up to a 10B-A1B MoE. Even a 1.6x replication on a continued-pretraining spike pays for itself immediately, since the serving architecture is unchanged.
- Does the OpenAI Codex switch promo justify a real evaluation?
- It's an asymmetric bet worth taking if you have a provider abstraction layer. The 2-month-free Codex enterprise promo lets you run matched prompts against your Claude harness at zero marginal cost, and Ramp's April data already shows a genuinely multi-vendor market (34.4% Anthropic vs 32.3% OpenAI on card spend). Worst case you validate confidence in Claude; best case you find a cost-competitive alternative.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Princeton's ICML 2026 audit added GPT 5.5, Gemini 3.5 Flash, and Claude Opus 4.7 and found zero meaningful reliability improvement over pred…
- Hugging Face Transformers has an RCE path that fires from model config files — not pickle weights — across 2.2 billion installs.
- Anthropic ended the flat-rate Claude subsidy this week.
- Anthropic killed the flat-rate Claude subscription this week.
- Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credit…