Data Science daily

Edition 2026-06-06 · read as Data Science

AnthropicEndsClaudeSubsidyasAgentTokensHit59%

Sources
36
Words
1,717
Read
9min

Topics Agentic AI LLM Inference AI Capital

◆ The signal

Anthropic ended the flat-rate Claude subsidy this week. Programmatic calls now bill at metered API rates, in the same week Vercel's production telemetry put 59% of inference tokens inside agentic multi-turn traces rather than single-shot completions. The thing the old subscription price didn't measure was workload shape, and the workload shape moved. Any Claude-backed agent workflow still costed on subscription economics needs to be re-run against metered rates before June 15. Skipping that exercise is a pricing decision, just not a deliberate one.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic's Triple Shock: Metered Credits, 80x Capacity Miss, Market Lead

    act now

    Anthropic converted subscriptions to dollar-matched API credits (killing 70-90% effective discounts), admitted 80x growth vs 10x planned capacity, and leased xAI's entire 220K-GPU Colossus 1 cluster. ServiceNow burned its full-year Claude budget by May. Ramp shows Anthropic at 34.4% vs OpenAI 32.3% — first crossover.

    80x
    growth vs capacity plan
    12
    sources
    • Ramp share: Anthropic
    • Ramp share: OpenAI
    • Colossus GPUs leased
    • Effective discount lost
    1. Anthropic B2B share34.4
    2. OpenAI B2B share32.3
  2. 02

    59% Agentic Token Share Breaks Eval & Cost Models

    act now

    Vercel's AI Gateway production index shows 59% of tokens are multi-turn, tool-calling agentic traces. Anthropic captures 61% of spend (Opus for reasoning), Google captures 38% of volume (Flash for throughput). Single-turn eval harnesses now measure the minority of production traffic. Cost models built on 3:1 I/O ratios are off by ~5x.

    59%
    agentic token share
    5
    sources
    • Agentic token share
    • Anthropic spend share
    • Google volume share
    • I/O ratio (agentic)
    1. Agentic (multi-turn)59
    2. Single-shot41
  3. 03

    AI Cyber Capability Clears Autonomous-Exploit Threshold

    monitor

    Anthropic's Mythos is the first model to clear both AISI simulated attack ranges (full network takeover). Mozilla's custom harness yielded 271 Firefox bugs vs. curl's 1 CVE with same model — a 271:1 harness-engineering delta. Google confirmed a threat actor using AI to build cybercrime tooling. MDASH shipped 16 real Windows patches from multi-model bug hunting.

    271:1
    harness yield ratio
    7
    sources
    • Mozilla bugs found
    • curl bugs found
    • MDASH Windows fixes
    • AISI ranges cleared
    1. Mozilla (custom harness)271
    2. curl (generic scan)1
    3. MDASH (multi-model)16
  4. 04

    Training Efficiency Breakthroughs: 2x to 360x Cost Cuts

    monitor

    Three research drops change pre-training and distillation economics. Nous TST: 2-3x wall-clock speedup at matched FLOPs with no inference architecture change (validated 270M→10B). Datology: +11.7 pts on VLM benchmarks at 17x less training compute via pure data curation. NVIDIA Star Elastic: one post-training run produces a model family at 360x lower cost.

    17x
    compute reduction (VLM)
    2
    sources
    • TST speedup
    • Datology compute cut
    • Star Elastic savings
    • Datology benchmark lift
    1. Nous TST3
    2. Datology curation17
    3. Star Elastic360
  5. 05

    Compute Crunch Quantified: 4:1 Demand, Siting Backlash, Silicon Diversification

    background

    Nebius reports 4+ customers per GPU brought online, 684% YoY revenue growth, guiding $3-3.4B for 2026. Cisco AI orders jumping $5B→$9B. Cerebras IPO'd at $56B with OpenAI's $20B commitment. The 9GW Stratos project faces 4,000 complaints and a referendum. Inference hardware is diversifying but supply remains structurally tight.

    4:1
    GPU demand/supply ratio
    5
    sources
    • Nebius YoY growth
    • Cerebras valuation
    • Cisco AI order growth
    • Stratos complaints
    1. Nebius 2025 revenue530
    2. Nebius 2026 guide3200

◆ DEEP DIVES

  1. 01

    Anthropic's Pricing Reset: Your Claude Cost Model Broke Three Ways This Week

    The Convergence

    Anthropic is leasing xAI's entire Colossus 1 cluster, 220,000+ GPUs spanning H100, H200, and GB200, and targeting an October IPO. That is the context for the pricing change underneath it. Claude subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses, which removes the 70-90% effective discount power users extracted from Max plans. Dario Amodei admitted planning for 10x growth and hitting 80x in revenue and usage, which is why Claude Code degraded through April. It was a capacity miss, not a product decision, and the capacity fix is what the Colossus lease pays for. Production routing decisions should be made against this combined picture, not any single fact in it.

    ServiceNow's CDIO already burned the full-year Claude budget by May. National Life Group's CIO called Claude 'great for consumer usage but not great for companies' that want per-user monitoring. Anthropic provides no native per-user telemetry, no SLAs on latency or availability, and no budget alerts.


    Why Sources Disagree

    Ramp's data shows Anthropic at 34.4% vs OpenAI at 32.3% of paying businesses, the first crossover. OpenAI's objection is correct on its own terms: Ramp measures credit-card spend, not invoice-based enterprise contracts. The crossover is real for bottoms-up developer adoption. It likely overstates Anthropic's lead among $1M+ ACV accounts. Both can be true at the same time, and a routing policy should be informed by both.

    The vendor underneath most production stacks just converted from a developer-friendly flat rate to metered API economics. It is also leasing a competitor's datacenter to serve existing customers, with no SLA. Multi-provider routing stopped being optional.

    The June 15 Cliff

    Starting June 15, Claude usage through third-party tools (Conductor, Zed, OpenCode, T3 Code) gets a separate credit bucket equal to plan value. No subsidized tokens, no rollover, and overflow bills at API rates. Any cost model that assumed flat-rate Claude consumption through these tools is dead in 30 days.

    What the Capacity Fix Changes

    SurfaceBefore (April)After (announced May 7-14)
    Claude Code limits5-hour capDoubled
    Peak-hours throttleReduced limitsRemoved (Pro/Max)
    Opus API ratesSqueezed'Substantially raised'
    Fleet compositionAnthropic-managedHeterogeneous (incl. GB200)

    Any Claude benchmark run between mid-April and May 7 is contaminated for baselining. Re-run after the new caps land, not before. Otherwise capacity noise gets attributed to prompt or model changes, and the wrong variable gets the credit.

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap by end of next week
    • Deploy an LLM gateway (LiteLLM, Portkey) with per-user, per-feature tagging and daily budget alerts within this sprint
    • Add a second frontier provider with automatic failover on 429/5xx behind a router abstraction
    • Re-baseline Claude Code and Opus API benchmarks (throughput, p95 latency, rate-limit headroom) post-Colossus integration before locking Q3 architecture decisions

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with · Agentic traffic crossed fifty-nine percent

  2. 02

    59% Agentic: Your Eval Harness and Cost Model Are Measuring the Minority

    The Number

    Vercel's AI Gateway production index, drawn from 200,000 teams over 7 months, puts agentic workloads at 59% of all token volume. Anthropic takes 61% of spend through Opus on reasoning nodes. Google takes 38% of volume through Flash on throughput. Three different races. The leaderboard depends on which one you score.

    The thing this doesn't tell you sits inside the eval stack. Most eval harnesses still score single-turn responses against reference answers. That was the right design in 2023. It now measures the minority of 2026 production traffic. The median request is a multi-step tool loop with retries, and what breaks in prod is a planner burning 40,000 tokens arguing with itself before giving up.


    Where Cost Models Break

    Cost models were fit when input-output ratios sat near 3:1. Agentic traces run closer to 15:1 on input, with heavy cache reuse on some providers and none on others. A forecast built on last year's ratio is off by roughly 5x on spend, and the error is asymmetric across vendors.

    Glean's benchmark, vendor-published with methodology undisclosed, claims off-the-shelf MCP uses 30% more tokens and loses 2.5x head-to-head preference against an enterprise knowledge graph on agentic tasks. Read it as a hypothesis, not a result. The failure mode it points at is real: MCP tool listings balloon context windows.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out — update the harness before you update the model.

    The Routing Architecture That Emerged

    The Vercel data shows a textbook tiered-routing signature already running across 200K teams:

    ProviderPositionImplied Role
    Anthropic (Opus)61% of spendReasoning / planning nodes
    Google (Flash)38% of volumeHigh-throughput utility calls
    OpenAIFast-growing shareMixed; spiking post-model-update
    Open sourceRisingGaining traction, no loyalty

    The explicit evidence of no vendor loyalty means a provider-agnostic routing layer isn't aspirational. It describes present-tense production reality. Application code still pinned to a single vendor's SDK is out of step with what the market is already doing.

    Multi-Agent Decomposition Validates the Pattern

    Microsoft's MDASH (100+ agents) beat Anthropic's Mythos on CyberGym by decomposing vulnerability work into scan → adversarial debate → PoC exploitation stages. No cost or latency comparison was published, which limits what you can conclude. The architectural signal still points the same direction: specialized routing across model tiers outperforms a single frontier model on complex tasks. The 59% agentic share and the MDASH result are two views of the same shift.

    Action items

    • Add trajectory-level metrics to your eval harness this sprint: task success, tool-call precision/recall, steps-to-completion, cost-per-successful-task
    • Instrument per-node token cost in your agent pipelines and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models
    • Run a 1-hour spike measuring token overhead of current MCP/tool-calling setup vs. a retrieval-first baseline on 100 production traces
    • Prototype a multi-agent decompose-debate-verify pipeline against your best single-agent baseline on a task with auto-verifiable outputs

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · MCP plus knowledge graphs · ben's bites: Vercel AI Gateway

  3. 03

    AISI Cleared, 271:1 Harness Delta, First AI Cybercrime in the Wild — The Offensive Capability Ceiling Just Moved

    Three Data Points, One Direction

    This week's evidence points at a specific capability boundary: end-to-end exploit chain completion in controlled evaluations, with corroborating signal from production work and the wild.

    1. Mythos cleared both AISI attack ranges, the first model to complete full network takeover in controlled tests. The prior generation topped out at 'advanced persistence.' AISI is already building harder tests because the current ones are saturating.
    2. Mozilla's custom harness surfaced 271 Firefox bugs (sandbox escapes, UAFs, race conditions) with the same model family that found exactly 1 low-severity CVE in curl when run as a generic scanner. Same weights, 271:1 yield ratio.
    3. Google confirmed a threat actor using AI to build cybercrime tooling, the first production-grade in-the-wild incident behind post-Mythos misuse concerns.

    The Harness Is the Product

    The Mozilla vs. curl comparison is the most instructive data point for any team shipping LLM-powered tools:

    DimensionMozilla + FirefoxStenberg + curl
    ModelClaude Mythos PreviewClaude Mythos Preview
    HarnessCustom agentic, fuzzer-integrated, ephemeral VMsOut-of-box scan
    Bugs surfaced271 (incl. sandbox escapes)5 claimed → 1 real CVE
    False-positive rate~0% (sanitizer crash = truth)~80%

    The former Google Distinguished Engineer on Mozilla's team said it directly: model choice was not the dominant factor; the harness was. That transfers. A team debating Claude vs GPT vs Gemini is optimizing the wrong variable. A week of domain-specific harness engineering yields 50x+ more signal than a model swap.

    When a frontier model yields 271 bugs for one team and 1 CVE for another against the same language, the harness is the product, not the model.

    Implications for Teams Shipping Agents

    The AISI result means refusal-rate harnesses are measuring the wrong bottleneck. Gating agent releases on jailbreak catch rates does not capture end-to-end exploit chain completion. A staged rubric covering recon, initial access, lateral movement, persistence, and exfil, run against every model upgrade, is a closer match to actual failure modes.

    Google's in-the-wild confirmation means offensive AI tooling is now a detected event class, not a tabletop exercise. The cost structure favors the attacker: inference is cheap, orchestration is cheap, and the expensive part used to be the human operator, which is what the model replaced.

    PraisonAI as Case Study

    PraisonAI (open-source multi-agent framework) was weaponized within 4 hours of CVE disclosure. Agent frameworks have crossed the adoption threshold where threat actors watch their disclosure feeds. Any runtime holding API keys or tool-call permissions sits in the blast radius.

    Action items

    • Add a staged cyber-capability tier to your agent release gate (recon → lateral movement → persistence → exfil rubric) before the next model upgrade
    • Spike a domain-specific agentic harness on one internal tool (code review bot, data quality checker) modeled on Mozilla's pattern — reproducible test cases + ephemeral VMs + existing signal pipelines
    • Instrument agent action sequences in production logs and train a lightweight classifier on known-bad tool-call trajectories
    • Inventory all agent frameworks in use and set CVE-feed subscriptions with same-day patching SLA

    Sources:Mythos cleared the AISI attack ranges · The headline claim is that AI models have reached full network takeover · Mozilla shipped 271 bugs · PraisonAI exploited in four hours · Google's threat tracker describes an industrialized guardrail-bypass stack · Anthropic published the case study this week

  4. 04

    Training Efficiency Frontier Moved: Three Drops That Change the Unit Economics

    Three Results, One Direction

    Three research drops landed in the same week. Each one moves training economics in a direction that matters for anyone running pre-training, post-training, or distillation this quarter.

    WorkClaimScale ValidatedInference ImpactReplication Risk
    Nous TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone — no architecture changeMedium; single-source, clean claim
    Datology VLM curation+11.7 pts on 20 VLM benchmarks; 17x less compute2B and 4B paramsLower response FLOPs — real serving winMedium; benchmark-selection risk
    NVIDIA Star Elastic360x cheaper model-family production; 7x vs SOTA compressionNot specifiedFamily of sizes from one runHigh; big number, lab-reported

    Which to Spike First

    TST is the highest-signal, lowest-risk bet. It is a pretraining recipe change with no inference-side consequence. If it replicates, it is a free 2-3x. No new hardware, no architecture migration, no serving changes. Run it on a 1B continued-pretraining task against a matched-FLOPs baseline. If wall-clock comes in at 1.6x with no val-loss regression, it pays for itself on the next full run.

    Datology is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation. At 2B parameters, pure data curation beat InternVL3.5-2B by about 10 points at 17x less training compute. The near-frontier 4B model hits 3.3x lower response FLOPs than Qwen3-VL-4B, which is a real serving-cost win, not just a leaderboard win.

    Star Elastic's 360x is the kind of number that always shrinks under independent evaluation. The thing this number doesn't tell you is how much of it survives a different post-training distribution. Even a 30x hold would restructure how model-size tiers get produced for deployment. One post-training run producing a family eliminates the need to separately post-train each size variant.

    The marginal dollar in VLM training just moved from compute to curation. TST gives a free 2-3x on pretraining with no inference change. Both are actionable this quarter.

    Adjacent Signal: DuckDB + Kafka Share Groups

    Two infrastructure releases landed on the same architectural assumption. The single-node analytics stack is growing multi-node features.

    • DuckDB Quack protocol: HTTP client-server mode makes DuckDB viable as a shared analytics service. Combined with the ECS Fargate plus Terraform pattern, it is a credible path to deleting Spark-on-Glue jobs that were single-node workloads wearing a distributed costume.
    • Kafka Share Groups: consumer parallelism decouples from partition count with roughly linear 8x scaling at 32 instances on I/O-bound workloads. The caveat is the part that matters in production. 8x on I/O-bound is not 8x on CPU-bound consumers, and most consumer fleets are mixed.

    Both are worth spikes on the specific workloads they address. Neither requires an immediate architectural commitment.

    Action items

    • Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline this quarter
    • Audit Glue/EMR job catalog for single-node candidates (<100GB working set) and pilot one on ECS Fargate + DuckDB + Terraform pattern
    • Benchmark Kafka Share Groups against your most partition-bound consumer group (embedding/enrichment workloads first)
    • For VLM work: allocate next training budget iteration 60/40 curation/compute rather than 20/80, using Datology's result as the prior

    Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode · TLDR Data

◆ QUICK HITS

  • Update: LiteLLM added to CISA KEV (active exploitation confirmed) — rotate all provider API keys stored in its DB if running versions 1.81.16–1.83.7

    SANS AtRisk

  • Apache Iceberg CVE-2026-42812 (CVSS 9.9): attacker with table-write can redirect metadata to poisoned S3 prefix — training data corruption vector for any lakehouse

    SANS AtRisk

  • Gemini reproducibly emits real phone numbers from training data — 4 independent cases; add a PII extraction eval (canary insertion + divergence attacks) to LLM CI this sprint

    The Download from MIT Technology Review

  • TML-Interaction-Small reports 0.40s turn-taking latency vs. 1.18s for GPT-Realtime-2.0 — a 3x gap via multi-stream 200ms micro-turn architecture; research preview, unverified

    Simplifying AI

  • Duolingo publicly pegs AI-generated content 'slop' at ~20% requiring human QC — use as a calibration anchor for your own LLM acceptance-rate dashboard

    TLDR Marketing

  • Only 15% of organizations have the data foundation for agentic AI (Fivetran); data quality/lineage cited as #1 blocker by ~50% — score agent projects against readiness before committing compute

    TLDR Data

  • LLM-as-a-Verifier beats LLM-as-a-Judge on tie-rate and decision accuracy by decomposing criteria into repeated binary verifications at token granularity — cheapest variance reduction available this quarter

    TLDR InfoSec

  • SAP (€100M partner fund) and ServiceNow (Action Fabric) both converged on Knowledge Graph + MCP as the enterprise agent architecture — treat KG grounding as the default for entity-heavy domains

    TLDR IT

  • AI agents bypass legacy bot detection at 81% success rate — retrain abuse models with agent-generated traffic or your experiment populations are already contaminated

    TLDR IT

  • Persona drift in LLM agents is measurable within 8 dialogue turns (Li et al., COLM 2024) — embed a verbal-tic canary and log per-turn retention as a zero-cost drift detector

    Brian Ardinger, Inside Outside Innovation

◆ Bottom line

The take.

Anthropic killed the flat-rate Claude subsidy, leaked that they're running at 80x planned capacity (hence the April degradation), and is renting 220,000 GPUs from a competitor to keep the lights on — all while Vercel's production data shows 59% of tokens are now multi-turn agentic traces your eval harness doesn't measure and your cost model doesn't price correctly. Re-cost your Claude workloads before June 15, rebuild evals around trajectories not single turns, and add a second frontier provider behind a router before the next capacity miss becomes your outage.

— Promit, reading as Data Science ·

Frequently asked

What exactly changes for Claude billing on June 15, and which workloads are affected?
Subscription-based Claude usage through third-party tools (Conductor, Zed, OpenCode, T3 Code) and programmatic surfaces like Agent SDK, claude-p, and GitHub Actions converts to a separate credit bucket equal to plan value, with overflow billed at metered API rates. There are no rollovers and no subsidized tokens, which removes the 70-90% effective discount Max-plan power users were extracting. Any cost model assuming flat-rate consumption needs to be re-run at API pricing before that date.
Why are single-turn eval harnesses inadequate when 59% of tokens are agentic?
Single-turn harnesses score one response against a reference answer, but the median production request is now a multi-step tool loop with retries, planning, and cache reuse. Failure modes that matter — a planner burning 40,000 tokens arguing with itself, tool-call precision collapse, or runaway step counts — are invisible to single-shot scoring. Trajectory-level metrics (task success, tool-call precision/recall, steps-to-completion, cost-per-successful-task) are required to measure what's actually shipping.
How should I rebalance training spend given the Datology and TST results?
For VLM work, shift the next budget iteration toward roughly 60/40 curation-to-compute, using Datology's 17x compute reduction at 2B-4B as the prior that data quality now dominates scale below 10B parameters. For text pretraining, spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline; even a partial replication of the 2-3x wall-clock claim pays back on the next full run with no inference-side change.
Why does the Mozilla 271:1 result mean model choice is the wrong optimization?
The same Claude Mythos Preview weights surfaced 271 real Firefox bugs under Mozilla's custom agentic harness (fuzzer-integrated, ephemeral VMs, sanitizer-grounded truth) and exactly one low-severity curl CVE under an out-of-box scan. A week of domain-specific harness engineering — reproducible test cases, ephemeral execution, integration with existing signal pipelines — yields roughly 50x more signal than swapping frontier models. Teams A/B-testing Claude vs GPT vs Gemini before investing in harness design are optimizing the smaller variable.
What's the minimum instrumentation needed before the next Anthropic invoice lands?
Deploy an LLM gateway like LiteLLM or Portkey with per-user and per-feature tagging plus daily budget alerts, because Anthropic provides no native cost attribution, no per-user telemetry, and no budget alerts. Add a second frontier provider behind a router abstraction with automatic failover on 429/5xx, since the documented 8x capacity-plan miss quantifies single-provider risk. Then re-baseline Claude Code and Opus benchmarks after the post-Colossus capacity changes — pre-May numbers are contaminated and will misattribute capacity noise to prompt or model changes.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.