Data Science daily

Edition 2026-05-21 · read as Data Science

AnthropicEndsClaudeSubscriptionDiscount,30-DayCliff

Sources
36
Words
1,766
Read
9min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Anthropic converted Claude subscriptions to dollar-matched metered API credits this week, killing the 70-90% effective discount that powered most agent SDK and batch eval workloads — and a June 15 cliff cuts third-party tool credits entirely. Meanwhile, Vercel's production telemetry across 200K teams confirms 59% of all tokens are now agentic multi-turn traces. Your cost model was already wrong; it just became quantifiably wrong, with a 30-day deadline attached.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic's Triple Economic Reset

    act now

    Claude subscriptions now cap at dollar-equivalent API credits (killing the 70-90% alt-harness subsidy). June 15 cuts third-party tool credits with no rollover. ServiceNow burned its full-year Claude budget by May. OpenAI launched a 2-month-free Codex enterprise switch promo the same day.

    70-90%
    subsidy eliminated
    9
    sources
    • Effective discount lost
    • Third-party cliff date
    • Capacity miss factor
    • Colossus GPUs leased
    1. May 7Credit metering announced
    2. May 14Rate limits doubled
    3. June 15Third-party tool credits severed
    4. H2 2026Colossus integration stabilizes
  2. 02

    59% Agentic: Eval and Cost Models Obsolete

    act now

    Vercel's 200K-team production data shows 59% of tokens are agentic. Anthropic captures 61% of spend via Opus, Google captures 38% of volume via Flash. MDASH's 100+ agent ensemble beat single models on CyberGym. Single-turn eval harnesses now measure the minority of production traffic.

    59%
    tokens now agentic
    6
    sources
    • Agentic token share
    • Anthropic spend share
    • Google volume share
    • Teams in dataset
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    Training Efficiency Step-Changes: 2-360x

    monitor

    Three research drops shift pretraining/post-training economics. Nous TST: 2-3x wall-clock speedup at matched FLOPs, validated to 10B. Datology: +11.7 pts on VLM benchmarks at 17x less compute. NVIDIA Star Elastic: 360x cheaper model-family derivation from one post-training run. TST has no inference-side change — spike it first.

    17x
    compute reduction (VLM)
    1
    sources
    • TST speedup
    • Datology compute savings
    • Star Elastic cost cut
    • Datology benchmark lift
    1. Nous TST3
    2. Datology VLM17
    3. Star Elastic360
  4. 04

    AI Cyber Capability Crosses Full-Takeover Threshold

    monitor

    Anthropic's Mythos is the first model to clear both UK AISI simulated attack ranges — full network takeover, not just advanced persistence. Mozilla's custom harness surfaced 271 Firefox bugs with the same model that found only 1 in curl. The harness, not the model, determined the 270x yield difference. Patch SLAs tuned to CVE cadence are measuring the wrong clock.

    271
    bugs found (harness)
    5
    sources
    • Mozilla bugs found
    • curl bugs found
    • AISI ranges cleared
    • PraisonAI exploit time
    1. Mozilla (custom harness)271
    2. curl (generic scan)1
  5. 05

    Data Infrastructure: Single-Node Tools Go Multi-Node

    background

    DuckDB shipped Quack HTTP client-server protocol, making embedded analytics a shared service without custom API servers. Kafka Share Groups decouple consumer parallelism from partition count with 8x throughput at 32 instances. Lakehouse column stats remain the silent query-planner killer — stale/missing stats cost 3x before anyone notices.

    8x
    Kafka throughput gain
    1
    sources
    • Kafka Share Group gain
    • Instances tested
    • Orgs ready for agents
    • Data modeling pain
    1. DuckDB Quack100
    2. Kafka Share Groups8
    3. Agentic AI readiness15

◆ DEEP DIVES

  1. 01

    Anthropic's 30-Day Pricing Cliff: Metered Credits, June 15 Cutoff, and the OpenAI Counter-Offensive

    What Changed This Week

    Anthropic shipped three changes simultaneously that compound into one budgeting problem. First, all Claude subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% subsidy that power users extracted from Max plans is gone. Second, on June 15, Claude usage through third-party tools (Conductor, Zed, OpenCode, T3 Code) moves to a separate credit bucket with no rollover. Overflow bills at list API rates. Third, Dario Amodei conceded Anthropic planned for 10x growth and got 80x, which explains months of quiet degradation and the emergency lease of xAI's full 220,000-GPU Colossus 1 cluster.


    Why This Breaks Your Stack

    The metering change is not a gentle price increase. It is a structural reclassification of how programmatic usage is billed. Any batch eval, enrichment pipeline, or agent loop running on a flat subscription is now burning metered tokens at list price. ServiceNow's CDIO confirmed publicly that they burned their full-year Claude budget by May, after price hikes hit an enterprise with no native per-user telemetry.

    Anthropic provides no native per-user, per-tool usage attribution. Customers must wire external analytics to see who is consuming what.

    The capacity story compounds the pricing story. The 80x miss means serving conditions your eval harness measured between mid-summer and now are contaminated for baselining. Rate limits on Opus are being raised, Claude Code 5-hour caps are doubling, and a heterogeneous fleet (H100 + H200 + GB200 via Colossus) means p95/p99 latency variance will increase during integration, not decrease. The thing this doesn't tell you is which slice of your traffic lands on which silicon.

    The OpenAI Counter-Move

    Sam Altman posted a 2-month-free Codex enterprise switch promo the same day Anthropic announced metering. Ramp's April data shows Anthropic edging OpenAI 34.4% vs 32.3%, the first apparent lead change. OpenAI is pricing explicitly against the developers Anthropic just alienated. The asymmetric free evaluation window expires in roughly 60 days.

    What the Market Data Actually Shows

    MetricAnthropicOpenAISignal Quality
    Ramp B2B share34.4%32.3%SMB card-spend biased
    ARR trajectory$9B → $30B+ in 4 monthsNot disclosedWSJ-sourced
    Valuation~$900B (offered)$852BPrivate marks
    October IPOTargetingN/ACFO hired

    Ramp measures who gets billed, not token volume or production criticality. A 210-basis-point gap in a monthly snapshot is inside noise. The directional signal, that second-vendor adoption is now the default, is the actionable read.

    Action items

    • Audit every Claude-backed workload (Agent SDK, claude-p, GitHub Actions, batch evals) and project token burn under new metered pricing by end of this sprint
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily budget alerts within 2 weeks
    • Activate OpenAI's 2-month Codex enterprise switch promo and run head-to-head against existing Claude eval harness using matched prompts
    • Re-baseline all Claude benchmarks (throughput, p95 latency, rate-limit headroom) after Colossus integration stabilizes — do not ship workarounds built against degraded measurements

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel's AI Gateway production index · Anthropic passed OpenAI in enterprise share this quarter

  2. 02

    59% of Production Tokens Are Agentic — Your Eval Harness and Cost Model Measure the Wrong Thing

    The Production Data

    Vercel's AI Gateway, covering 200,000 teams over 7 months, reports that 59% of all tokens are now agentic — multi-turn, tool-calling traces rather than single-shot completions. Six months ago the figure was under 20%. The spend/volume split is the more interesting cut: Anthropic captures 61% of spend via Opus on expensive reasoning, while Google captures 38% of volume via Flash on cheap fan-out. The data shows no vendor loyalty. Customers route by task.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

    Why Single-Turn Evals Are Now Measuring the Minority

    Agentic workloads are not 'just more tokens.' They are bursty, multi-turn, tool-calling, with 15:1 input-to-output ratios against the 3:1 most cost models assume. The cost function is latency-per-step times number-of-steps, plus retry risk from bad tool calls. A forecast built on last year's ratio is off by roughly 5x on spend, and the error is not symmetric across vendors. The median request stops being the right summary statistic. p95 does the work.

    MDASH Validates Multi-Agent Decomposition

    Microsoft's MDASH, a 100+ agent system, beat Anthropic's Mythos on the CyberGym vulnerability benchmark by decomposing into scan → adversarial debate → PoC construction stages. The thing this doesn't tell you is which stage drives the lift, or what it cost. There is no ablation and no cost comparison. The architectural pattern is consistent with ensemble priors — specialized agents with explicit disagreement tend to generalize better than monolithic calls — but consistent-with is not evidence-for.

    ArchitectureBest ForCost ProfileEval Approach
    Single frontier modelNarrow, latency-sensitivePredictable per-callSingle-turn accuracy
    Tiered routing (Opus+Flash)Mixed reasoning/throughput20-40% savings at parityPer-node quality + trajectory cost
    Multi-agent decompositionComplex, verifiable tasksHigher per-task, lower per-errorTrajectory-level + tool-call F1

    The Abridge Reference Architecture

    Abridge runs 80M+ clinical conversations through a 'constellation of models': cheap triage in front, expensive reasoning behind, LLM judges calibrated against human annotators, memory externalized to event-driven stores rather than model weights. The transferable production pattern is the confidence-gated router. Only 40% of requests reach frontier-class reasoning. The rest are handled by 7-13B models at 5-10x lower cost.

    What the 30% MCP Overhead Means

    Glean's benchmark is vendor-sponsored with undisclosed methodology. It claims off-the-shelf MCP uses 30% more tokens and loses 2.5x head-to-head against a tuned knowledge graph on agentic tasks. Treat that as a hypothesis, not a result. The failure mode is plausible regardless: MCP tool listings inflate context windows, and naive tool outputs return verbose blobs where a reranked snippet would do. The 30% number is one to falsify on your own workload this sprint, not to cite.

    Action items

    • Add trajectory-level metrics to eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
    • Instrument per-node token cost across your agent graph and route utility calls (summarization, extraction, query rewriting) to Flash/Haiku-class models within 2 weeks
    • Run a 1-hour spike measuring token overhead of current MCP/tool-calling setup vs. retrieval-first baseline on 100 production traces
    • Prototype a decompose-debate-verify pipeline on one auto-verifiable workload (code gen, SQL, extraction) and measure accuracy delta at comparable token cost

    Sources:The CyberGym result · Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs · Glean's benchmark

  3. 03

    Three Training Efficiency Results That Change Q3 Unit Economics

    The Landscape

    Three research drops landed in the same cycle, each targeting a different cost bottleneck in the training pipeline. Individually, each is worth a spike. Together they suggest the marginal dollar in training has moved from raw compute to recipe optimization and data curation.


    Nous Research: Token Superposition Training (TST)

    TST reports 2-3x wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M through 10B-A1B MoE scale. The mechanism superimposes multiple token sequences during training, lifting effective throughput without touching the served model structure.

    Why spike this first: nothing changes downstream at serving. If it replicates on a 1B continued-pretraining run, it is a free 2-3x on every subsequent training run. Replication risk is medium. Single source, but the claim is clean and falsifiable.

    Datology: Data Curation Beats Compute

    At 2B params, Datology reports +11.7 points on 20 VLM benchmarks, beating InternVL3.5-2B by about 10 points at 17x less training compute, purely through data curation. At 4B params, they hit near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B.

    This is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation.

    The serving win is real. Lower response FLOPs means cheaper inference on every request. The risk is benchmark-selection bias. Twenty benchmarks sounds broad, but the specific selection matters, and the thing this doesn't tell you is how the curated data performs on slices outside that suite.

    NVIDIA Star Elastic

    Claims one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining a family, and 7x better than SOTA compression. Scale not specified in the ledger.

    Caveat: this is the kind of headline number that shrinks under independent eval. Even a 30x hold would restructure how teams produce size tiers for deployment. 360x from a lab-reported result deserves heavy discounting until reproduced.

    Comparative Assessment

    WorkClaimValidated ScaleInference ImpactReplication RiskSpike Priority
    Nous TST2-3x wall-clock270M → 10B MoENoneMediumFirst
    Datology+11.7 pts at 17x less compute2B, 4BLower serving FLOPsMediumSecond
    Star Elastic360x cheaper familiesNot specifiedProduces size tiersHighWait for repro

    What This Means for GPU Budget Planning

    Against the 4:1 demand-to-supply ratio at Nebius and capacity selling out quarterly, these results offer a hedge: recipe-level efficiency gains can partially offset a capacity squeeze. TST alone, if it replicates at 2x, means the same training budget buys twice the runs. Combined with Datology's curation thesis, the highest-leverage Q3 investment is likely a data quality team, not more GPU hours.

    Action items

    • Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline within 3 weeks
    • Audit your VLM/multimodal training data pipeline for curation quality — measure data diversity, duplication rate, and quality score per shard
    • Lock H2 GPU reservations across 2+ providers before quarterly sellouts tighten further
    • Track Star Elastic replication attempts — do not plan model-family production around 360x until an independent lab confirms even 30x

    Sources:Claude just metered your agent SDK calls

  4. 04

    Harness Dominates Model: The 271-to-1 Ratio and What It Means for Agent Evaluation

    The 270x Yield Gap

    Two teams ran Claude Mythos Preview against large C codebases in the same window. Mozilla wrapped a custom agentic harness around their existing fuzzing infrastructure and reported 271 bugs in Firefox 150, including sandbox escapes, use-after-frees, and race conditions. Daniel Stenberg pointed the same model at curl with a generic scanner and got exactly 1 low-severity CVE alongside 4 false positives.

    Same weights. Roughly two orders of magnitude in yield. The variable that moved was the scaffolding.

    When a frontier model yields 271 bugs for one team and 1 CVE for another against the same language, the harness is the product, not the model.

    Why This Generalizes Beyond Security

    The pattern maps onto ordinary ML evaluation. A team debating Claude 4 vs. GPT-5 vs. Gemini for an internal tool is optimizing the wrong variable. On this evidence, the gap puts harness investment ahead of model selection by at least 50x. Mozilla's harness emits reproducible test cases, scales across ephemeral VMs, and feeds existing signal pipelines. The model contributes capability. The harness converts capability into measurable signal. The thing the 271 number doesn't tell you is how much of that yield is unique vs. duplicate clusters, which would tighten the multiple. It probably loosens it instead.

    The AISI Threshold Crossing

    Mythos is also the first model to clear both UK AISI simulated attack ranges, meaning full network takeover rather than advanced persistence alone. GPT-5.5-cyber cleared one of two. AISI is building harder tests because the current ladder is saturating, which is consistent with a discrete capability unlock rather than smooth interpolation. The closest analog is GPT-3.5 to GPT-4 on agentic benchmarks. It is possible the gap is narrower than two ranges suggests once variance is accounted for, but two-of-two vs. one-of-two is hard to explain by noise.

    Operational implications for agent deployments

    Old AssumptionNew RealityWhat Changes
    Refusal-rate evals gate releasesChain-completion evals requiredAdd staged attack-chain rubric (recon → access → lateral → persist → exfil)
    Patch SLA tuned to CVE cadenceModel release cadence is fasterCompress critical-patch windows; add model-release-triggered security review
    Model choice is the key decisionHarness architecture is the key decisionInvest in domain-specific scorers and reproducible test-case emitters over model swaps

    LLM-as-Verifier: A Methodological Upgrade

    A parallel finding reinforces the harness thesis. LLM-as-a-Verifier beats LLM-as-a-Judge on tie-rate and decision accuracy by decomposing evaluation into repeated binary verifications with token-level scoring. The mechanism is straightforward: one high-variance categorical judgment replaced by k lower-variance binary ones. The same argument moved human eval from Likert scales to pairwise preferences a decade ago.

    Practical translation: rewrite one pairwise judge as a decomposed verifier, measure tie rate before and after, and the width of the bootstrap CI on a known A/B pair. If CIs tighten at equal compute, roll to the rest of the suite. Cheapest variance reduction available this quarter.

    Action items

    • Spike a domain-specific agentic harness on one internal tool (code review bot, data quality checker) modeled on Mozilla's pattern: reproducible test cases + ephemeral VM scaling + integration with existing pipelines
    • Add a staged cyber-capability tier to your agent release gate: recon → initial access → lateral movement → persistence → exfil, run against every model upgrade
    • Rewrite one pairwise LLM-judge eval as a decomposed binary verifier and compare tie-rate and CI width on a known A/B pair
    • Persist full agent trajectories (tool calls, intermediate state, file diffs) and audit a stratified random sample of 'passing' rollouts for reward hacking

    Sources:Mozilla shipped 271 bugs over the period · Mythos cleared the AISI attack ranges · Two data points this week broke the AI cyber trend · The UK AISI evaluations report · PraisonAI auth bypass exploited in 4 hours · LLM-as-a-Verifier reframes evaluation

◆ QUICK HITS

  • DuckDB shipped Quack HTTP client-server protocol — Spark/Glue jobs under 100GB on 2-node clusters are now candidates for a Fargate + DuckDB + Terraform pattern at ~50% cost/latency reduction

    DuckDB shipped a client-server mode this week

  • Duolingo publicly pegged AI-generated content slop at ~20% requiring human QC — the first honest production quality number at scale; benchmark your own acceptance rate against it

    Duolingo's twenty percent AI slop rate

  • Update: PraisonAI (multi-agent framework) auth bypass exploited in 4 hours of CVE disclosure — agent orchestration frameworks now have same-day patching requirements, not quarterly

    PraisonAI zero-day was popped in four hours

  • Cerebras IPO closed +70% at $311 with a $20B OpenAI commitment — first dollar-weighted proof that non-Nvidia inference silicon has a production buyer; benchmark CS-3 API this quarter

    Cerebras IPO validates non-Nvidia silicon

  • TML-Interaction-Small reports 0.40s full-duplex turn-taking latency vs. 1.18s for GPT-Realtime-2.0 — a 3x gap on the metric that determines perceived voice naturalness; architecture is continuous multi-stream, not turn-based

    TML is reporting 0.40 seconds of full-duplex latency

  • Only 15% of organizations have the data foundation for agentic AI per Fivetran; data quality/lineage is the #1 blocker cited by ~50% — gate agent projects on readiness scorecards, not model capability

    DuckDB shipped a client-server mode this week

  • SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages — largest open agentic trace corpus; pull now before licensing frictions accumulate

    Claude just metered your agent SDK calls

  • Persona drift in LLM agents is measurable within 8 dialogue turns per COLM 2024 — add a verbal-tic canary to system prompts and instrument per-turn retention as a free drift signal

    AI personas drift within eight turns

  • PCAOB/COSO guidance now requires deterministic execution and tamper-evident audit trails for ML in regulated finance — transformers are non-deterministic by default and the usual fixes (temp=0, fixed seeds) aren't sufficient for audit

    The transformer underwriting models are outperforming

  • Nebius reported 684% YoY revenue growth with 4+ customers per GPU brought online — lock H2 GPU reservations across 2+ providers before quarterly sellouts tighten further

    The 4:1 ratio is the headline number

◆ Bottom line

The take.

Anthropic killed the flat-rate subsidy that powered most agent SDK workloads, Vercel's 200K-team production data confirms 59% of tokens are now agentic multi-turn traces, and three training-efficiency results (2-17x) landed in the same week — meaning your cost model, your eval harness, and your training budget are simultaneously stale, with a June 15 deadline on the first, an already-shipping reality on the second, and a 4:1 GPU demand ratio making the third increasingly urgent.

— Promit, reading as Data Science ·

Frequently asked

How do I project token burn under Anthropic's new metered pricing before the June 15 cutoff?
Audit every Claude-backed workload — Agent SDK, claude-p, GitHub Actions, batch evals, and third-party harnesses like Conductor, Zed, or OpenCode — and re-price their token volume at list API rates, since the 70-90% subsidy from Max plans is gone. Because Anthropic ships no native per-user or per-tool attribution, deploy an LLM gateway like LiteLLM or Portkey with per-feature tagging and daily budget alerts within two weeks. ServiceNow burned its full-year Claude budget by May without this visibility.
Why are single-turn evals no longer sufficient for production agents?
Vercel's gateway data across 200K teams shows 59% of all tokens are now agentic multi-turn traces with roughly 15:1 input-to-output ratios, versus the 3:1 most cost models assume. Single-turn accuracy benchmarks measure the minority of production traffic and miss tool-call precision, steps-to-completion, cost-per-successful-task, and error recovery — the metrics that actually drive agent unit economics. Cost forecasts built on last year's ratios are off by roughly 5x on spend.
Which training-efficiency result should I spike first, and which should I wait on?
Spike Nous Research's Token Superposition Training first: it claims 2-3x wall-clock speedup at matched FLOPs from 270M through 10B-A1B MoE scale with zero inference-time architecture change, so a successful 1B continued-pretraining replication compounds across every future run. Datology's curation result (+11.7 points at 17x less compute on VLMs) is the second priority. Wait on NVIDIA Star Elastic's 360x family-production claim until an independent lab confirms even 30x — headline numbers in that range routinely shrink 5-10x under outside eval.
If the same model yields 271 bugs for one team and 1 CVE for another, how should that change model-vs-harness investment?
Mozilla's custom agentic harness around Firefox fuzzing produced 271 bugs from Claude Mythos while a generic scanner against curl produced 1 low-severity CVE — same weights, ~270x yield gap driven entirely by scaffolding. That puts harness investment ahead of model selection by at least 50x for most internal tools. Build domain-specific scorers, reproducible test-case emitters, and ephemeral VM scaling before debating Claude vs. GPT-5 vs. Gemini for the next workload.
What's the cheapest variance reduction I can apply to my eval suite this quarter?
Rewrite one pairwise LLM-as-a-Judge eval as a decomposed LLM-as-a-Verifier: replace a single high-variance categorical judgment with k lower-variance binary verifications using token-level scoring. Measure tie rate and bootstrap CI width on a known A/B pair before and after — if CIs tighten at equal compute, roll the pattern across the rest of the suite. This is the same shift that moved human eval from Likert scales to pairwise preferences a decade ago.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.