Data Science daily

Edition 2026-05-18 · read as Data Science

AnthropicEndsClaudeSubscriptionSubsidyonJune15

Sources
36
Words
1,729
Read
9min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

On June 15 Anthropic ends the programmatic discount: every Claude subscription converts to dollar-matched API credits, removing the 70-90% effective subsidy that quietly funded most Agent SDK, GitHub Action, and batch eval workloads. OpenAI shipped a 2-month-free Codex enterprise promo the same day, which is not a coincidence. The cap is denominated in dollars, but production token burn under agent workloads is what determines whether the next invoice matches the forecast, and teams have a 60-day window to measure that against the alternative.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic Credit Cliff + Capacity Crisis

    act now

    June 15 metering change eliminates 70-90% programmatic subsidy. Anthropic planned for 10x growth, got 80x — now leasing xAI's entire 220K-GPU Colossus 1 cluster. Enterprise share crossed OpenAI (34.4% vs 32.3% per Ramp). Rate limits doubling, but any benchmark from before May 7 is stale.

    80x
    capacity miss vs plan
    10
    sources
    • Enterprise share
    • Colossus 1 GPUs
    • Credit cliff date
    • Subsidy killed
    1. Anthropic B2B share34.4
    2. OpenAI B2B share32.3
  2. 02

    59% Agentic: Eval Harness Measuring the Minority

    monitor

    Vercel's AI Gateway production index shows 59% of tokens are now agentic multi-turn workloads. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. MDASH (100+ agents) beat single models on CyberGym. Single-turn eval harnesses are scoring the minority of production traffic.

    59%
    agentic token share
    6
    sources
    • Anthropic spend share
    • Google volume share
    • MCP token overhead
    • Vercel teams tracked
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    Training Efficiency: Three Papers Change Unit Economics

    monitor

    Nous TST delivers 2-3x wall-clock speedup at matched FLOPs with no inference architecture change (validated 270M→10B). Datology beats InternVL3.5-2B by 10pts on VLM benchmarks at 17x less compute via curation alone. NVIDIA Star Elastic produces a model-size family from one post-training run at 360x lower cost than pretraining.

    17x
    compute reduction (Datology)
    1
    sources
    • TST speedup
    • Datology benchmark lift
    • Star Elastic savings
    • TST scale validated
    1. Nous TST3
    2. Datology VLM17
    3. Star Elastic360
  4. 04

    Lakehouse & ML Infra: New CVSS 9.9 Wave

    act now

    Apache Iceberg (CVE-2026-42812, CVSS 9.9) lets attackers redirect table metadata to poisoned S3 paths. Apache Polaris (9.9) broadens credentials cross-tenant. Argo CD (9.6) exposes plaintext K8s Secrets. n8n and Kestra both hit 9.8 SQLi. Combined with LiteLLM on KEV, the entire ML reference architecture has CVSS 9+ exposure this cycle.

    9.9
    CVSS (Iceberg/Polaris)
    3
    sources
    • Iceberg CVSS
    • Argo CD CVSS
    • n8n CVSS
    • LiteLLM status
    1. 01Iceberg/Polaris9.9
    2. 02n8n / Kestra9.8
    3. 03Argo CD9.6
    4. 04Ollama GGUF9.1
    5. 05LiteLLM (KEV)9
  5. 05

    Compute Supply: 4:1 Demand Ratio Quantified

    background

    Nebius reported 4+ customers per GPU with 684% YoY revenue growth, guiding $3-3.4B for 2026. Cerebras IPO'd at $56B with OpenAI's $20B commitment. Cisco AI product orders jumping from $5B→$9B. Memory hardware shortage driving redesigns. Reserved capacity wins over on-demand in H2 2026.

    4:1
    demand-to-supply ratio
    5
    sources
    • Nebius YoY growth
    • Nebius 2026 guide
    • Cerebras IPO val.
    • Cisco AI orders
    1. Nebius 2025530
    2. Nebius 2026 (guide)3200

◆ DEEP DIVES

  1. 01

    Anthropic's June 15 Credit Cliff: Reconcile Every Claude Workload This Sprint

    What Changed

    Anthropic quietly converted every subscription plan into a dollar-matched API credit cap for programmatic usage. Starting June 15, Claude consumption through Agent SDK, claude-p, GitHub Actions, and third-party harnesses (Zed, Conductor, OpenCode, T3 Code) draws from a separate credit bucket equal to plan value. No rollover, no subsidized tokens, and overflow bills at API list rates. The 70-90% effective discount power users were extracting from $200 Max plans is gone.

    This lands the same week Dario Amodei admitted Anthropic planned for 10x growth and got 80x, forcing an emergency lease of xAI's entire 220,000-GPU Colossus 1 cluster (H100, H200, and GB200). Rate limits on Claude Code are doubling and peak-hours throttling is being removed. The serving fleet is now heterogeneous and still stabilizing.


    Why It Matters Now

    Any eval harness, batch job, or agent loop running through a Claude subscription was implicitly subsidized. ServiceNow already burned its full-year Claude budget by May under those economics. The problem compounds because Anthropic provides no native per-user or per-tool usage telemetry. The thing this doesn't tell you is which tenant or prompt is driving spend, and you cannot recover that without building your own gateway instrumentation.

    If the vendor cannot tell you which user burned the token, the cost problem is actually an observability problem, and the customer owns it before the next invoice.

    On the same day, OpenAI dropped a 2-month-free Codex enterprise switch promo, a targeted counter-offensive aimed at the exact developers Anthropic just alienated. Ramp's April data shows the first-ever Anthropic lead in enterprise billing share (34.4% vs 32.3%). The methodology captures who gets invoiced, not token volume or workload criticality, which is a different question.


    Cross-Source Tension

    Ten independent sources cited the Anthropic enterprise crossover, but the signal contradicts itself. Anthropic is winning enterprise share even as quality degrades and effective prices rise, with capacity leased from a competitor's datacenter. The resolution: bottoms-up developer adoption is real, and so is the capacity wall. Any benchmark run between mid-April and May 7 is contaminated by capacity-driven degradation and should be discarded for baselining purposes.

    ActionTimelineRationale
    Reconcile all Claude-backed workloads against new credit capThis weekSilent overrun already accruing
    Deploy LLM gateway with per-user/per-feature token taggingThis sprintAnthropic offloaded observability to customer
    Run OpenAI Codex evaluation under 2-month promoStart now (60-day window)Asymmetric-payoff free trial; compare on own harness
    Re-baseline Claude benchmarks post-Colossus integrationAfter June 1Serving conditions shifting again

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and project token burn against the new credit cap by end of this week
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-tenant, per-feature tagging and daily budget alerts within this sprint
    • Initiate OpenAI Codex evaluation under the 2-month enterprise promo by end of next week
    • Avoid locking in annual Anthropic commits until post-Colossus integration stability is observable (target late June assessment)

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with · Anthropic's ARR tripled ($9B → $30B+)

  2. 02

    59% Agentic: Your Eval Harness Is Scoring the Minority of Traffic

    The Production Data

    Vercel's AI Gateway production index covers seven months of telemetry across 200,000 teams. Agentic workloads now sit at 59% of token volume, up from under 20% six months ago. The structure worth acting on is the spend-volume split: Anthropic captures 61% of dollars through Opus on reasoning, while Google captures 38% of volume through Flash on high-throughput utility. Switching is frequent. Vendor loyalty is not observed in the data.

    This is consistent with MDASH on CyberGym, where Microsoft's 100+ agent ensemble beat Anthropic's single-model Mythos by decomposing the workload into scan → adversarial debate → PoC exploitation stages. The lift looks like it came from the decomposition pattern rather than raw agent count, but there is no ablation isolating which stage carries the work. Treat the attribution as a hypothesis.


    What This Breaks

    Most production eval harnesses still score single-turn responses against reference answers. When 59% of tokens are multi-turn tool-calling traces, final-answer accuracy hits 90%+ on both the expensive and the cheap path. The thing that number doesn't measure is the cost path to get there — a planner that burns 40K tokens arguing with itself before giving up looks identical to one that solves it in two calls. Cost models fitted when input-output ratios sat at 3:1 are off by roughly 5x on agentic traces, where input ratios run closer to 15:1 with heavy variance from cache reuse.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

    The routing pattern at scale

    Role in Agent GraphOptimal Model TierEvidence
    Planning/reasoningOpus/GPT-4 class61% spend share on reasoning nodes
    Utility (rewrite, extract, classify)Flash/Haiku class38% volume share; 5-10x cost reduction
    Critic/verifierMid-tier or specialistMDASH debate stage; Abridge LLM judges

    Abridge's architecture across 80M+ clinical conversations shows the same shape: cheap triage in front, expensive reasoning behind, LLM judges calibrated against annotated data. The non-obvious finding is that post-training on proprietary domain data still beats frontier models on cost and latency once volume gets serious. The crossover point is an empirical question on your stack, not a theoretical one.


    Two Reinforcing Data Points

    Glean's benchmark reports off-the-shelf MCP using 30% more tokens and losing 2.5x head-to-head preference against a tuned knowledge graph on agentic tasks. Vendor-published, methodology undisclosed. Treat as hypothesis until someone replicates it. Separately, SAP and ServiceNow both shipped Knowledge Graph + MCP architectures this quarter, which suggests RAG-over-docs is losing ground to structured KG grounding for enterprise accuracy. The evals that matter now are tool-use success@1, hallucinated-argument rate, and multi-step completion. MMLU does not measure the bottleneck.

    Action items

    • Add trajectory-level metrics (tool-call precision/recall, steps-to-completion, cost-per-successful-task) to your eval harness this sprint
    • Instrument per-node token cost and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models within 2 weeks
    • Run a 1-hour spike measuring token overhead of your current MCP/tool-calling setup vs a retrieval-first baseline on 100 sampled production traces
    • Add LLM-judge-to-human-annotator agreement (Cohen's kappa) as a tracked SLI, computed quarterly and alerted if it drops >5pp

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result is the kind of finding · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs is the combination · AI Gateway data puts agentic workloads at fifty-nine percent

  3. 03

    Lakehouse Trust Boundary Shrank: Iceberg/Polaris CVSS 9.9 Poisons Training Data

    The New Attack Surface

    This cycle's CVE disclosures land squarely on the ML data stack. Apache Iceberg (CVE-2026-42812, CVSS 9.9) lets a writer redirect table metadata to an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features. Apache Polaris (CVE-2026-42809/10/11, CVSS 9.9) widens credentials across tenants, which turns a compromised analyst notebook into S3/GCS credential theft.

    Add Argo CD (CVSS 9.6), which exposes plaintext Kubernetes Secrets in reachable namespaces, including model-registry tokens, HuggingFace PATs, and cloud credentials. The path from analyst notebook compromise → cross-tenant data access → training data poisoning → model corruption is no longer hypothetical.


    What Most Teams Miss

    Default lakehouse observability tracks row changes, not pointer changes. The Iceberg vulnerability shifts the storage location itself, and standard logging will not flag it. The thing this doesn't tell you is whether your features are still drawn from the table you think they are. There is no hard error, no schema violation, just gradually poisoned features or labels flowing into training.

    ComponentCVE / CVSSBlast RadiusDetection Gap
    Apache IcebergCVE-2026-42812 / 9.9Poisoned tables, corrupted training dataMetadata pointer mutation not logged by default
    Apache PolarisCVE-2026-42809-11 / 9.9S3/GCS creds, cross-tenant accessCredential scope expansion invisible at app layer
    Argo CD 3.2-3.3CVE-2026-42880 / 9.6Plaintext K8s Secret extractionRead-only role is a misnomer
    n8n / Kestra9.8 / 9.8Workflow DB, OAuth sessions, schedulesBroad credentials by design
    Credential exposure is a pivot into everything else the warehouse touches. The fix is boring: rotate credentials and narrow what the table catalog service is allowed to reach. The time to do it is before the rotation becomes an incident response task.

    The Compounding Factor

    This week's lakehouse stats audit from TLDR Data adds a layer worth pricing in. Stale or missing column stats on Iceberg/Delta tables produce silent query-plan degradation. An attacker who poisons metadata pointers can also strip stats, so downstream queries scan more data and return corrupted results. The failure mode is invisible to most monitoring. No error, just 3x compute spend and quietly wrong features.

    Workflow orchestrators (n8n at 9.8 SQLi, Kestra at 9.8) generally run with broad database and cloud credentials because they orchestrate everything. Scope service accounts to the minimum per workflow and audit network reach from orchestrator hosts. These are the tools most ML teams have neither constrained nor monitored.

    Action items

    • Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every Kubernetes Secret in reachable namespaces by end of this week
    • Audit Iceberg/Polaris catalog configurations: enforce explicit storage credential scoping and add write-path allowlisting for table metadata locations
    • Add metadata-pointer-change alerting to your lakehouse observability stack (log when table location, partition spec, or manifest list paths change)
    • Run ANALYZE/compute-stats coverage audit across your Iceberg/Delta tables and add stats freshness to table-level SLAs

    Sources:LiteLLM landed in the KEV catalog this week · An Ollama endpoint exposed to the public internet · DuckDB shipped a client-server mode this week

  4. 04

    Training Efficiency: Three Results That Change Your Q3 Compute Budget

    The Breakthroughs

    The week's research drops sit on different segments of the training cost curve. The combined read: the marginal dollar in ML training has moved off raw compute and onto dataset engineering.

    WorkClaimScale ValidatedInference ImpactSpike Priority
    Nous Research TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone — no architecture changeHigh: free speedup if it replicates
    Datology VLM Curation+11.7 pts / 17x less compute2B and 4B paramsLower response FLOPs (serving win)High: proves curation > compute
    NVIDIA Star Elastic360x cheaper model-family derivationNot disclosedProduces size tiers from one runMedium: big number, lab-reported

    Why TST Is the One to Spike First

    Token Superposition Training reports a 2-3x wall-clock speedup from a pretraining recipe change, with no inference-side architecture modification required. If it replicates at even 1.6x on a continued-pretraining run without a val-loss regression, it pays for itself on the next full run. The risk profile is low. The claim is single-source but clean, validated from 270M to 10B parameters, and the failure mode is 'no gain' rather than 'broken model.'

    Datology: The Marginal Dollar Moved to Curation

    Datology reports +11.7 points across 20 VLM benchmarks at 2B params, beating InternVL3.5-2B by roughly 10 points at 17x less training compute, purely via curation. Their 4B model matches near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B, which is a serving-cost win, not just a training one. This is the cleanest evidence this year that the binding constraint in VLM training has shifted from GPU hours to dataset engineering.

    Star Elastic: Discount the Headline, Keep the Direction

    NVIDIA claims one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining the family, and 7x better than SOTA compression. Lab-reported numbers of this magnitude shrink under independent eval; that's the base rate, not a critique. Even at a 30x hold, the result restructures how size tiers get produced for deployment. One training run plus elastic derivation replaces the current practice of training 3-5 separate checkpoints.

    The marginal dollar in VLM training has moved from compute to curation. A team spending more on GPU hours than on dataset engineering is optimizing the wrong layer.

    Combined Implication

    These results arrive in a quarter where GPU demand-to-supply sits at 4:1 at Nebius and H2 capacity is selling out. The direction is clear. Curation-first pipelines are the binding hedge against a capacity squeeze, with recipe-level wins like TST as the secondary lever. Teams that only know how to throw FLOPs at a problem will find those FLOPs priced at a premium and wait-listed besides.

    Action items

    • Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline within 2 weeks
    • Audit your VLM/multimodal training pipeline's data curation vs. compute spend ratio this quarter
    • Lock H2 2026 GPU reservations across 2+ providers before quarterly sellouts tighten further
    • Evaluate Star Elastic methodology when paper drops for producing size tiers from a single post-training run instead of training multiple checkpoints

    Sources:Claude just metered your agent SDK calls · The 4:1 ratio is the headline number · The UK AISI evaluations report

◆ QUICK HITS

  • DuckDB shipped Quack HTTP protocol turning it into a client-server engine — credible Spark/Glue replacement for single-node workloads under 100GB; spike one job onto ECS Fargate + DuckDB pattern

    DuckDB shipped a client-server mode this week

  • Kafka Share Groups report 8x throughput by decoupling consumer parallelism from partition count — validated on I/O-bound workloads only; benchmark your most partition-bound consumer group first

    DuckDB shipped a client-server mode this week

  • Update: LiteLLM remains on CISA KEV (active exploitation); scope expanded to include all 1.81.16-1.83.7 versions — rotate all upstream provider API keys stored in its DB if not already done

    LiteLLM landed in the KEV catalog this week

  • TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini Live and 1.18s GPT-Realtime — a 3x gap on the metric that determines perceived naturalness in voice agents

    TML is reporting 0.40 seconds of full-duplex latency

  • Only 15% of organizations have the data foundation for agentic AI at scale (Fivetran); data quality/lineage is the #1 blocker cited by ~50% — use as gating scorecard before greenlighting agent projects

    DuckDB shipped a client-server mode this week

  • Duolingo publicly pegs AI-generated content rejection rate at ~20% — a rare production quality number; benchmark your own pipeline acceptance rate against this anchor

    Duolingo's twenty percent AI slop rate

  • Gemini reproducibly emits real phone numbers from training data (4 independent cases) — add PII extraction eval (canary insertion + divergence attacks) to LLM CI before your next release

    Gemini is the latest model to surface PII from its training data

  • LLM-as-a-Verifier outperforms LLM-as-a-Judge by decomposing criteria into repeated binary verifications with token-level scoring — eliminates tie problem; swap one eval pipeline as a one-day experiment

    An Ollama endpoint exposed to the public internet

  • PraisonAI zero-day exploited in 4 hours of disclosure (CVE-2026-44338) — all agent frameworks (LangChain, CrewAI, AutoGen) in same risk class; version-pin and subscribe to CVE feeds

    Agent stacks are now in scope for attackers

  • Mythos cleared both AISI simulated attack ranges — first model ever; AISI building harder tests because current ones are saturating; add staged cyber-capability rubric to agent release gates

    Mythos cleared the AISI attack ranges this week

  • Mozilla found 271 Firefox bugs with Claude Mythos + custom harness vs 1 low-severity CVE in curl with generic scan — the 271x gap is harness engineering, not model capability

    Mozilla shipped 271 bugs over the period in question

◆ Bottom line

The take.

Anthropic's June 15 credit change kills your programmatic discount while 59% of production tokens are now agentic multi-turn workloads your eval harness wasn't designed to measure — and this week's CVSS 9.9 Iceberg/Polaris CVEs mean an attacker with table-write permission can silently poison training data through metadata redirects that default logging doesn't catch. Reconcile Claude spend, rebuild the eval harness for trajectories, and patch the lakehouse before the next training run ingests corrupted features nobody was watching.

— Promit, reading as Data Science ·

Frequently asked

What exactly changes for Claude subscriptions on June 15, 2026?
Every Claude subscription converts into a dollar-matched API credit cap for programmatic usage. Agent SDK, claude-p, GitHub Actions, and third-party harness traffic will draw from that bucket at metered API rates with no rollover, eliminating the 70-90% effective subsidy power users were extracting from $200 Max plans. Overflow bills at list price.
Why are single-turn eval harnesses inadequate when 59% of tokens are agentic?
Final-answer accuracy looks identical whether a planner solves a task in two calls or burns 40K tokens arguing with itself before giving up. Single-turn scoring cannot see the cost path, tool-call precision, or steps-to-completion that dominate agentic spend. Cost models fitted at 3:1 input-output ratios are off by roughly 5x on traces where input ratios run closer to 15:1.
How can the Iceberg CVE poison training data without triggering alerts?
CVE-2026-42812 lets a writer redirect table metadata to an attacker-controlled S3 prefix, so subsequent queries read poisoned Parquet from a different location than expected. Default lakehouse observability tracks row changes, not pointer mutations, so there is no schema violation or hard error — features and labels flow into training silently corrupted until model behavior degrades.
Why prioritize Token Superposition Training over the other training-efficiency results?
TST reports a 2-3x wall-clock speedup at matched FLOPs from a pretraining recipe change alone, with no inference-side architecture modification and validation from 270M up to 10B-A1B MoE. The failure mode is 'no gain' rather than 'broken model,' and even a 1.6x replication on continued pretraining pays back the spike cost on the next full run.
What is the practical takeaway from Datology's VLM curation result?
The binding constraint in VLM training has shifted from GPU hours to dataset engineering. Datology beat InternVL3.5-2B by ~10 points across 20 benchmarks at 17x less training compute, and their 4B model matches near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B. Teams spending >80% of training budget on compute rather than curation are optimizing the wrong layer.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.