Data Science daily

Edition 2026-05-28 · read as Data Science

AnthropicEndsSubscription-to-APIDiscount,BreaksAgents

Sources
36
Words
1,343
Read
7min

Topics Agentic AI LLM Inference AI Capital

◆ The signal

Anthropic killed the 70–90% effective discount on programmatic Claude usage this week and is leasing xAI's entire 220,000-GPU Colossus 1 cluster to cover an 8x capacity miss (planned for 10x growth, got 80x). OpenAI launched a 2-month-free Codex enterprise switch promo the same day. If your agent stack runs on Claude subscriptions converted to API credits—Agent SDK, GitHub Actions, batch evals—your unit economics broke silently this week. Re-price before the June 15 third-party tool cutoff or you're making a vendor decision by default.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic Pricing Shock + 8x Capacity Miss

    act now

    Anthropic metered all programmatic Claude usage at list API rates, erasing the implicit subscription subsidy. An 8x capacity forecast miss forced emergency lease of xAI's 220K-GPU Colossus 1 cluster. ServiceNow burned its full-year Claude budget by May. OpenAI countered with free Codex for 60 days.

    8x
    capacity forecast miss
    11
    sources
    • Anthropic B2B share
    • OpenAI B2B share
    • Colossus 1 GPUs
    • Growth miss ratio
    • Claude Code limit
    1. Anthropic (Ramp)34.4
    2. OpenAI (Ramp)32.3
  2. 02

    59% Agentic: Eval Harness & Cost Models Are Stale

    monitor

    Vercel's AI Gateway shows 59% of production tokens are now agentic multi-turn traces. Anthropic captures 61% of spend (Opus), Google captures 38% of volume (Flash). Single-turn eval harnesses and 3:1 I/O cost models are measuring the minority of traffic. The real ratio is 15:1 input-heavy.

    59%
    tokens now agentic
    5
    sources
    • Anthropic spend share
    • Google volume share
    • Agentic I/O ratio
    • MCP token overhead
    1. Agentic traffic59
    2. Single-turn traffic41
  3. 03

    Compute Crunch Quantified: 4:1 Demand, Training Efficiency Responds

    monitor

    Nebius posted 684% YoY revenue with 4+ customers per GPU, Cisco AI orders jumping $5B→$9B. Three research drops respond: TST delivers 2-3x wall-clock at matched FLOPs, Datology achieves +11.7pts VLM improvement at 17x less compute, NVIDIA Star Elastic claims 360x cheaper model-family derivation.

    4:1
    GPU demand-to-supply
    6
    sources
    • Nebius 2026 guide
    • Nebius YoY growth
    • TST speedup
    • Datology compute savings
    • Cisco AI order growth
    1. TST (wall-clock)3
    2. Datology (compute)17
    3. Star Elastic (cost)360
  4. 04

    Autonomous Cyber Capability Clears AISI Threshold

    monitor

    Anthropic's Mythos is the first model to clear both UK AISI simulated attack ranges (full network takeover). Google confirmed a hacking group using LLMs to build offensive tooling in the wild. Palo Alto scanned 130+ products with AI-driven discovery. Eval harnesses without cyber-capability tiers are miscalibrated.

    2/2
    AISI ranges cleared
    6
    sources
    • Mythos AISI tests
    • GPT-5.5-cyber tests
    • Products scanned
    • Prior gen tier
    1. 01Mythos (new)Full takeover
    2. 02GPT-5.5-cyberPartial takeover
    3. 03Mythos (prior)Adv. persistence
  5. 05

    DuckDB Client-Server + Kafka Share Groups: Single-Node Era Ends

    background

    DuckDB shipped HTTP client-server (Quack protocol), eliminating the one-process constraint. Kafka Share Groups report 8x throughput by decoupling consumers from partition count. Combined signal: tools built for single-node analytics are growing multi-node features—expect a wave of migration posts from teams discovering their 2022 schemas need updating.

    8x
    Kafka Share Groups
    1
    sources
    • Share Groups scaling
    • Orgs ready for agents
    • Stats gap root cause
    1. Partition-bound (before)1
    2. Share Groups (after)8

◆ DEEP DIVES

  1. 01

    Anthropic's 8x Capacity Miss Broke the Product — Your Multi-Provider Deadline Is June 15

    What Actually Happened

    At Code with Claude on May 6, Dario Amodei conceded that Anthropic planned for 10x growth and got 80x in revenue and usage. That delta is the whole story behind weeks of reported Claude Code degradation, quality drift, and throttling. The emergency fix is leasing xAI's entire Colossus 1 cluster — 220,000+ NVIDIA GPUs spanning H100, H200, and GB200, from the CEO who called Anthropic 'misanthropic and evil' three months earlier.

    At the same time, Anthropic killed the implicit subsidy on programmatic usage. Every Claude subscription now converts to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. What was an effective 70–90% discount on alternative-harness usage is gone. Starting June 15, third-party tools (Zed, Conductor, OpenCode, T3) get a separate credit bucket with no rollover, overflowing at list API rates.


    Why This Is Different From a Normal Price Hike

    This is not a list-price increase. It is Anthropic closing the arbitrage between subscription-flat and metered-per-token that power users were exploiting. ServiceNow's CDIO burned through the full-year Claude budget by May. National Life Group's CIO publicly called Claude 'great for consumer usage but not great for companies' wanting per-user monitoring.

    Anthropic provides no native per-user telemetry, no tool-level consumption breakdown, no SLAs on latency or availability, and no budget alerts. The vendor offloaded observability to you.

    The thing the headline pricing change does not tell you is how wide the observability gap actually is. The table below quantifies it against enterprise SaaS norms:

    CapabilityIndustry StandardAnthropic Today
    Per-user attributionNative dashboardsNot exposed
    Budget alerts / soft capsStandardAbsent
    Latency/availability SLAsContractualNone
    Anomaly detectionBuilt-inAbsent

    The Counter-Offensive

    OpenAI shipped a 2-month-free Codex enterprise switch promo the same day. Ramp's April spend data put Anthropic ahead of OpenAI 34.4% vs 32.3%, the first apparent lead change in business adoption. Read this as OpenAI pricing directly into the cohort Anthropic just alienated. With Anthropic targeting an October IPO (CFO hired, margin-per-token now a board metric), the base rate says more monetization moves, not fewer.

    Capacity Recovery Timeline

    Rate limits are loosening: Claude Code 5-hour limits doubling, peak-hours throttling removed for Pro/Max, Opus API limits 'substantially' raised. Stitching GB200-class hardware into an existing H100/H200 serving fleet under one API contract is not a clean swap. Expect p95/p99 variance during the transition. Any Claude benchmark from before May 7 is stale and should be rerun before it informs a routing decision.

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against new credit cap this sprint
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily spend alerts by end of sprint
    • Run a 2-month Codex evaluation under OpenAI's enterprise switch promo using your existing eval harness with matched prompts
    • Re-run Claude Code and Opus API benchmarks (throughput, p95 latency) after Colossus 1 integration stabilizes before locking Q3 routing decisions

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with

  2. 02

    59% of Your Tokens Are Agentic — The Eval Harness, Cost Model, and Router All Need a Rewrite

    The Production Data

    Vercel's AI Gateway production index covers 200,000 teams over 7 months. It puts agentic workloads at 59% of all token volume, up from under 20% six months ago. The split inside that number: Anthropic takes 61% of spend through Opus on reasoning and planning nodes, Google takes 38% of volume through Flash on high-throughput utility calls. No vendor loyalty in the data. The market is already routing by task, whether the code is or not.

    A serving layer hard-coded to one provider SDK is out of step with what 200K production teams are actually running.


    Three Things That Break

    1. The Eval Harness

    Most eval suites score single-turn responses against reference answers. The median production request is a multi-step tool loop with retries, and accuracy on the final answer is 90%+ in both good and bad runs. The thing this doesn't tell you is the cost path. A planner burning 40,000 tokens arguing with itself before giving up looks identical on pass/fail. Microsoft's MDASH (100+ agents) beat Anthropic's Mythos on CyberGym via scan → adversarial debate → PoC exploitation staging. The topology won, not the model.

    2. The Cost Model

    Models fit on 3:1 input-output ratios are off by ~5x on agentic traces, which run closer to 15:1 input-heavy with variable cache-hit rates across providers. Glean's benchmark (vendor-published, no methodology disclosed) claims MCP uses 30% more tokens than retrieval-tuned knowledge graphs. Take the number with caveats. The directional signal is harder to dismiss: SAP and ServiceNow both converged on KG + MCP as the enterprise architecture, which means pure vector RAG is losing ground.

    3. The Router

    A router treating every call as independent leaves money and latency on the table. The components that matter for agentic traffic are session-aware routing, KV cache reuse across turns, and model selection based on tool-calling reliability rather than MMLU deltas.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

    The Production Architecture Pattern

    Abridge (80M+ clinical conversations, 250 health systems) has disclosed enough to confirm what the rest of the field is converging on: cheap fast model triages, larger model reasons only when needed, with LLM judges calibrated against human annotators and memory externalized from weights into event-driven stores. The 5-10x cost reduction at scale comes from routing, not from model swaps.

    PatternProduction ApproachCommon Mistake
    RoutingConfidence-gated: cheap triage → expensive reasoningFrontier model for every request
    EvalsLLM judge + periodic human re-anchoringLLM judge alone, never re-calibrated
    MemoryExternal event-driven storeBake state into fine-tuned weights
    Cost metric$/successful-task$/token (misleading for agentic)

    Action items

    • Add trajectory-level metrics to eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
    • Segment token spend by workload type (agentic vs single-shot) and benchmark Flash/Haiku substitution on non-reasoning nodes within 2 weeks
    • Run a 1-hour spike measuring MCP tool-calling token overhead vs. a retrieval-first baseline on 100 production traces
    • Add LLM-judge-to-human-annotator agreement as a tracked SLI, re-calibrated quarterly with Cohen's kappa

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · MCP plus knowledge graphs is the combination · Abridge runs model routing across 100M conversations

  3. 03

    Three Training Efficiency Drops That Reshape Your H2 Compute Budget

    The Supply Squeeze

    Nebius posted 684% YoY revenue growth with 4+ customers competing for every GPU brought online, and guided to $3–3.4B in 2026 from a $530M 2025 base. Capacity sold out in Q1. Cisco corroborates independently: AI product orders from hyperscalers moving from $5B → $9B (+80%) next fiscal year, with an explicit memory-hardware shortage forcing product redesigns. Cerebras IPO'd at $56B with OpenAI's $20B commitment signed in December. The compute supply is booked.

    Against that backdrop, three research drops this week change the unit economics of pretraining and distillation.


    The Efficiency Drops

    WorkClaimScale ValidatedInference ImpactReplication Risk
    Nous Research TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone — no architecture changeMedium; single-source, clean claim
    NVIDIA Star Elastic360x cheaper model-family derivation; 7x vs SOTA compressionNot specifiedProduces size family from one post-training runHigh; lab-reported headline
    Datology VLM+11.7pts on 20 benchmarks; 17x less compute2B and 4BLower response FLOPs — real serving winMedium; benchmark-selection risk

    TST is the one to replicate first. It's a pretraining recipe change with no inference-side cost, so the migration math is clean. If it holds at even 1.6x on internal data, it pays for itself on the next full run. Star Elastic's 360x will shrink under independent eval. A 30x hold still restructures model-family production. Datology's result is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation.

    Three research results in one week each claiming 2x–360x efficiency gains: even if each shrinks by half under replication, the compute budget anchored on last quarter's baselines is wrong.

    What This Means for H2 Planning

    The 4:1 demand-to-supply ratio means reserved capacity beats on-demand in H2. The efficiency drops point at a different question: what type of compute is worth booking. Full pretraining at frontier scale is consolidating into the top 3-5 labs. For everyone else, the higher-leverage path is:

    1. Post-training on proprietary data (Abridge's model: 80M+ conversations beats frontier on narrow tasks)
    2. Distillation (SWE-ZERO-12M-trajectories: 112B tokens, 12M trajectories, now open corpus)
    3. Data curation over compute scaling (Datology's thesis, validated at 2B and 4B params)

    Only 15% of organizations have the data foundation for agentic AI at scale (Fivetran). Data quality and lineage are cited as the #1 blocker by ~50% of respondents. The thing this doesn't tell you directly: half the agent projects funded this quarter are data-platform projects with an agent on top.

    Action items

    • Lock H2 2026 GPU reservations across 2+ providers (Nebius, CoreWeave, hyperscaler) before next quarterly sellout
    • Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline within 4 weeks
    • Pull SWE-ZERO-12M-trajectories and stand up preprocessing pipeline (dedup, license filter, language stratification) this month
    • Run a head-to-head Cerebras Inference API benchmark against current Nvidia H100 setup on representative Llama/Mistral workload

    Sources:Claude just metered your agent SDK calls · The 4:1 ratio is the headline number · The headline claim is that AI models have reached full network takeover · DuckDB shipped a client-server mode this week · Cerebras IPO validates non-Nvidia silicon

◆ QUICK HITS

  • Update: New critical CVEs hit lakehouse stack — Apache Iceberg (CVSS 9.9) allows metadata write redirect to attacker-controlled S3; Polaris (9.9) broadens credentials cross-tenant; Argo CD (9.6) leaks plaintext K8s secrets

    LiteLLM landed in the KEV catalog this week

  • PraisonAI auth bypass (CVE-2026-44338) exploited in 4 hours post-disclosure — agent orchestration frameworks are now a first-class attack surface with same-day patching requirements

    PraisonAI, an open-source multi-agent framework, was weaponized within four hours

  • TML-Interaction-Small hits 0.40s full-duplex turn-taking latency vs 0.57s Gemini Flash Live and 1.18s GPT-Realtime — a 3x spread on the metric that determines perceived naturalness

    TML is reporting 0.40 seconds of full-duplex latency

  • Duolingo publicly pegs AI-generated content slop at ~20% requiring human QC — a rare production quality number worth benchmarking your own generation pipeline against

    Duolingo's twenty percent AI slop rate

  • DuckDB shipped Quack (HTTP client-server protocol) making it viable as a shared service — audit Glue/EMR catalog for single-node candidates under ~100GB

    DuckDB shipped a client-server mode this week

  • Honeypot data: exposed Ollama/MCP endpoints indexed by Shodan in 3 hours, 23% of traffic targets AI-specific paths (/v1/models, /.well-known/mcp.json) — attackers now use dedicated LLM-Scanner tool

    An Ollama endpoint exposed to the public internet gets picked up by Shodan in about three hours

  • LLM-as-a-Verifier (decomposed binary verifications at token-level) beats LLM-as-a-Judge on tie-rate and decision accuracy — a one-day eval pipeline refactor for variance reduction

    The security newsletter this week has an item worth opening with

  • Persona drift measurable within 8 dialogue turns (Li et al., COLM 2024) — embed a distinctive verbal tic as a free canary token; regex detection costs zero and fires before expensive failures

    AI personas drift within eight turns

  • x402 batched sub-cent payments now built into AWS AgentCore Bedrock under Linux Foundation governance — 92.8% of real agentic payments on Base, 99.8% in USDC

    x402 + Bedrock AgentCore: sub-cent payments just became your agent's billing layer

  • New COSO/PCAOB guidance requires deterministic execution and tamper-evident audit trails for ML in regulated finance — stochastic LLM decoding is structurally non-compliant

    The transformer underwriting models are outperforming the gradient-boosted baselines

◆ Bottom line

The take.

Anthropic's 8x capacity miss and metering of programmatic usage broke the cost model for every Claude-dependent agent stack this week, while Vercel's production data shows 59% of tokens are already agentic — meaning your eval harness is scoring the minority of traffic and your cost model is off by ~5x. The three things that need action this sprint: a multi-provider router with spend telemetry (because Anthropic won't give you per-user attribution), trajectory-level eval metrics (because single-turn accuracy hides 15:1 token burn), and locked H2 compute reservations (because 4:1 demand-to-supply means today's spot prices are tomorrow's floor).

— Promit, reading as Data Science ·

Frequently asked

What changed in Anthropic's pricing this week and why does it break agent stack economics?
Anthropic eliminated the implicit 70–90% discount on programmatic Claude usage by converting subscriptions to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. Power users who exploited the gap between flat subscriptions and metered tokens now pay list rates. Starting June 15, third-party tools (Zed, Conductor, OpenCode, T3) get a separate credit bucket with no rollover, overflowing at full API rates.
Why are single-turn eval harnesses inadequate for current production traffic?
Vercel's data across 200,000 teams shows agentic workloads are now 59% of token volume, but most eval suites score single-turn responses against reference answers. Final-answer accuracy can hit 90%+ in both efficient and wasteful runs, hiding cost paths where a planner burns 40,000 tokens before completing. Trajectory-level metrics—tool-call precision/recall, steps-to-completion, cost-per-successful-task, and error recovery—are needed to score the majority of real traffic.
Which of the three training efficiency results should be replicated first and why?
Nous Research's Token Superposition Training is the cleanest replication target because it's a pretraining recipe change claiming 2-3x wall-clock at matched FLOPs with no inference-side architecture change. That makes the migration math straightforward—even a 1.6x hold on internal data pays for itself on the next full run. NVIDIA Star Elastic's 360x and Datology's 17x claims carry higher replication risk from lab-reported headlines and benchmark selection.
What observability gaps should teams expect when running Claude in production?
Anthropic exposes no native per-user attribution, no tool-level consumption breakdown, no latency or availability SLAs, no budget alerts, and no anomaly detection—all of which are standard in enterprise SaaS. Customers must deploy their own LLM gateway (LiteLLM, Portkey) with per-user and per-feature tagging plus daily spend alerts. Without that instrumentation, silent overruns accumulate for weeks before surfacing in monthly invoices, as ServiceNow experienced burning a full-year budget by May.
How should H2 2026 compute budgeting change given the supply-demand picture?
Reserved capacity across multiple providers (Nebius, CoreWeave, hyperscalers) should be locked now because the 4:1 demand-to-supply ratio means spot and on-demand vanish first—Nebius sold out Q1 capacity and guided to $3–3.4B in 2026. The higher-leverage spend for non-frontier labs has shifted from full pretraining toward post-training on proprietary data, distillation from open trajectory corpora like SWE-ZERO-12M, and data curation, since only 15% of organizations have the data foundation for agentic AI at scale.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.