Data Science daily

Edition 2026-05-30 · read as Data Science

Anthropic'sJune15CapEndstheHiddenAgentSubsidy

Sources
36
Words
1,833
Read
9min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Anthropic's June 15 credit metering removes what was effectively a 70-90% subsidy on Claude-backed agents and eval harnesses. Vercel's production index puts 59% of tokens in the agentic bucket, so the cost model is off on both price-per-token and tokens-per-task. The thing the headline number doesn't tell you is how multi-turn traces compound under the new cap. Without reconciled attribution, the pricing decision is being made by default, and the invoice is the place it shows up.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic June 15 Credit Reset + 80x Capacity Crisis

    act now

    Anthropic planned for 10x growth, got 80x, and is leasing xAI's entire 220K-GPU Colossus 1 cluster to recover. June 15 kills the implicit subsidy on third-party Claude tools (Zed, Conductor, OpenCode). ServiceNow already burned its full-year Claude budget by May. Multi-provider routing is no longer optional—it's the only hedge against an 8x capacity forecast error.

    80x
    growth vs 10x plan
    11
    sources
    • Anthropic B2B share
    • OpenAI B2B share
    • Colossus GPUs leased
    • Opus 4.7 image cost
    • Credit change date
    1. Planned growth10
    2. Actual growth80
  2. 02

    59% Agentic Tokens: Eval + Cost Models Simultaneously Broken

    act now

    Vercel's AI Gateway production index puts agentic workloads at 59% of token volume across 200K teams. Anthropic captures 61% of spend via Opus, Google captures 38% of volume via Flash. Single-turn eval harnesses now measure the minority of production traffic. Cost models built on 3:1 input-output ratios are off by ~5x when agentic traces run 15:1.

    59%
    tokens now agentic
    6
    sources
    • Anthropic spend share
    • Google volume share
    • Agent bot bypass rate
    • Agentic input:output
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    Training Efficiency Trifecta: TST 2-3x, Datology 17x, Star Elastic 360x

    monitor

    Three research drops in one week change pretraining and post-training economics. Nous TST delivers 2-3x wall-clock speedup at matched FLOPs with zero inference architecture change (validated to 10B). Datology beat InternVL3.5-2B by ~10 points at 17x less compute via data curation alone. NVIDIA Star Elastic claims one post-training run produces a model-size family at 360x lower cost than pretraining each.

    17x
    compute savings (curation)
    3
    sources
    • TST speedup
    • Datology VLM lift
    • Star Elastic savings
    • GPU demand ratio
    1. Nous TST3
    2. Datology curation17
    3. Star Elastic360
  4. 04

    Lakehouse Trust Boundary Shrank: Iceberg/Polaris CVSS 9.9

    monitor

    Apache Iceberg CVE-2026-42812 lets attackers redirect table metadata writes to attacker-controlled S3, poisoning downstream queries and training data silently. Apache Polaris has three CVSS 9.9 credential-broadening bugs enabling cross-tenant access. Argo CD 3.2-3.3 exposes plaintext K8s Secrets to read-only users. Combined path: compromised notebook → poisoned training data → cross-tenant credential theft.

    9.9
    CVSS (Iceberg/Polaris)
    3
    sources
    • Iceberg CVSS
    • Polaris CVEs
    • Argo CD CVSS
    • n8n SQLi CVSS
    1. 01Iceberg metadata redirect9.9
    2. 02Polaris cred broadening9.9
    3. 03Argo CD secret read9.6
    4. 04n8n SQLi9.8
  5. 05

    Autonomous Cyber Capability Crosses Threshold

    background

    Anthropic's Mythos is the first model to clear both AISI simulated attack ranges (full network takeover). GPT-5.5-cyber cleared one of two. Google's threat intel team observed actual AI-built cybercrime tooling in the wild. AISI is building harder tests because current ones are saturating. Patch SLAs calibrated to human-speed exploitation are now structurally behind model-release cadence.

    2 of 2
    AISI ranges cleared
    5
    sources
    • Mythos ranges cleared
    • GPT-5.5-cyber cleared
    • PraisonAI exploit time
    • Mozilla bugs found
    1. Prior MythosAdvanced persistence only
    2. New MythosFull network takeover (2/2)
    3. GPT-5.5-cyberFull takeover (1/2)
    4. AISI responseBuilding harder tests

◆ DEEP DIVES

  1. 01

    Anthropic's 80x Capacity Miss Has a June 15 Deadline Attached

    The Capacity Admission That Explains Everything

    At Code with Claude on May 6, Dario Amodei said Anthropic planned for 10x growth and got 80x in revenue and usage. An 8x forecast miss is sufficient to explain the Claude Code degradation, the quality complaints, and the infrastructure scramble. The patch is leasing xAI's entire Colossus 1 cluster—220,000+ NVIDIA GPUs spanning H100, H200, and GB200, from the CEO who three months ago called Anthropic 'misanthropic and evil'.


    The June 15 Credit Change Is a Hard Deadline

    Starting June 15, Claude usage through third-party tools (Zed, Conductor, OpenCode, T3 Code) moves to a separate credit bucket capped at plan value. No subsidized tokens, no rollover, overflow bills at API rates. What was effectively a 70-90% discount on programmatic usage through Max plans is gone. Any cost model assuming flat-rate Claude consumption through non-native IDEs is dead in 30 days.

    If the budget assumed flat subscription cost on Agent SDK, GitHub Actions, or claude-p pipelines, expect a silent overrun.

    The Enterprise Share Crossover

    Ramp's AI Index puts Anthropic at 34.4% vs OpenAI 32.3% of US businesses paying for AI. First documented crossover. The thing this doesn't tell you is what it measures: corporate card spend, not token volume, not production criticality. OpenAI correctly notes large enterprises pay by invoice. The gap is 2 points in one monthly snapshot. Read it as a bottoms-up adoption signal, not a share claim.

    What's changing in the next rate-limit window

    SurfaceBeforeAfter (May 7–14)
    Claude Code limits5-hour capDoubled
    Peak-hours throttleReduced for Pro/MaxRemoved
    Opus API rate limitsSqueezed during crunch'Substantially raised'
    Fleet compositionAnthropic-onlyHeterogeneous incl. GB200 via Colossus

    Any benchmark you ran between mid-April and early May is stale. Serving conditions changed, and they will change again as Colossus integrates. Re-baseline after the new caps land.


    The Telemetry Gap Compounds the Problem

    Anthropic provides no native per-user or per-tool usage telemetry. ServiceNow's CDIO burned through the full-year Claude budget by May. National Life Group's CIO calls Claude 'great for consumer usage but not great for companies' that need per-user monitoring. Token consumption in agentic workflows is non-linear: a reflection loop can 10x spend per task without proportional quality gain, and the signal arrives with the invoice.

    Action items

    • Reconcile every Claude-backed workload (Agent SDK, claude-p, GitHub Actions, batch evals) against the new credit cap by June 1
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily token budget alerts within 2 weeks
    • Add a second frontier provider behind a router with automatic failover on 429/5xx this sprint
    • Re-run Claude Code and Opus API benchmarks (throughput, p95 latency, rate-limit headroom) after Colossus integration stabilizes in late May

    Sources:Claude just metered your agent SDK calls... · Claude Code latency on long-context requests drifted upward... · Anthropic shipped without the telemetry hooks... · Anthropic passes OpenAI in B2B... · Vercel published a number worth sitting with...

  2. 02

    59% Agentic Tokens: Your Eval Harness and Cost Model Are Both Wrong

    The Production Shape Has Changed

    Vercel's AI Gateway production index is the only multi-tenant usage snapshot worth citing this quarter. It puts agentic workloads at 59% of all token volume across 200,000 teams. Six months ago that figure was under 20%. An eval harness built on single-turn benchmarks is now scoring the minority of production traffic.

    The spend-versus-volume split tells the routing story. Anthropic captures 61% of dollars via Opus on planning and reasoning nodes. Google captures 38% of tokens via Flash on high-throughput utility calls. The data shows no vendor loyalty. That is a textbook tiered-routing signature, already in production at scale.

    A serving layer that hardcodes one provider SDK is out of step with what 200K production teams are already running.

    Why Cost Models Are Off by 5x

    Most cost models were fit when input-output ratios sat around 3:1. Agentic traces run closer to 15:1 on input, with heavy cache reuse on some providers and none on others. A forecast built on last year's ratio is off by roughly a factor of five on spend, and the error is not symmetric across vendors. The median request stops being a useful planning unit. The p95 does the work.

    The Eval Gap Is Wider Than the Cost Gap

    Multi-agent decomposition (scan → adversarial debate → PoC construction) outperformed monolithic models on CyberGym. Microsoft's MDASH used 100+ agents. The thing this doesn't tell you is the inference bill for running 100+ agents, which the benchmark does not measure. The pattern works. The economics are the open question.

    What harness measuresWhat production breaks on
    Single-turn accuracyCost path through 40K-token planning loops
    Pass@1 on curated promptsTool-call reliability at the tail
    Mean latencyP95 under concurrent load
    Aggregate task successReward-hacking paths in 'passing' rollouts

    Sayash Kapoor's argument deserves to be the new default: outcome-only metrics systematically underestimate failure modes in capable agents. Stronger agents surface benchmark bugs and reward-hacking paths that weaker agents physically cannot reach. The pass@1 curve flattens exactly when real reliability is diverging.


    MCP Is Consolidating Faster Than Expected

    TikTok shipped a Model Context Protocol endpoint. SAP put €100M behind MCP-exposed Knowledge Graphs. ServiceNow shipped Action Fabric exposing workflows headlessly. MCP is no longer Anthropic-specific. Glean's benchmark claims off-the-shelf MCP uses 30% more tokens and loses 2.5x head-to-head against an enterprise knowledge graph. That is a vendor-published result with no methodology disclosed. Run the comparison on your own traffic before citing it.

    The bot-detection bypass rate of 81% is the quieter number. If ranking models, recsys, or experiment populations are ingesting agent traffic as human, the optimizer is converging on agent-preferred artifacts. Flag agent traffic in the experimentation platform before the next model refresh.

    Action items

    • Add trajectory-level metrics (tool-call precision/recall, steps-to-completion, cost-per-successful-task) to eval harness this sprint
    • Run a 1-hour spike: measure token overhead of current MCP/tool-calling setup vs. a retrieval-first baseline on 100 production agent traces
    • Segment token spend by workload type (agentic vs single-shot) and benchmark Flash/Haiku substitution on non-reasoning nodes
    • Add agent-traffic flagging to your experimentation platform and retrain bot-detection models with agent-generated traffic in the training set

    Sources:Agentic traffic crossed fifty-nine percent... · Vercel published a number worth sitting with... · The CyberGym result is the kind of finding... · MCP plus knowledge graphs is the combination... · AI Gateway data puts agentic workloads at fifty-nine percent...

  3. 03

    Three Training Efficiency Claims That Change This Quarter's Build-vs-Buy Math

    The Claims, Ranked by Actionability

    Three research drops landed the same week. Each one moves unit economics for teams running their own training or distillation, in directions worth measuring.

    WorkClaimScale ValidatedInference ImpactSpike Priority
    Nous TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone—no architecture changeHighest
    Datology VLM curation+11.7 pts on 20 benchmarks at 17x less compute2B and 4B paramsLower response FLOPs—real serving winHigh
    NVIDIA Star Elastic360x cheaper model-family derivationNot specifiedProduces family of sizes from one runMedium (verify first)

    TST Is the One to Spike First

    Token Superposition Training is a pretraining recipe change with no inference-side downstream. If it replicates, that is a 2-3x wall-clock speedup for free. The validation range covers 270M to 10B-A1B MoE, which is wide enough to take seriously. The risk is medium because it is single-source, but the claim is clean and falsifiable. A 1B continued-pretraining run against a matched-FLOPs baseline answers the question in a week.

    Datology: The Marginal Dollar Moved from Compute to Curation

    The clearest evidence this year that data curation dominates compute scaling for VLMs. A 2B model beat InternVL3.5-2B by about 10 points at 17x less training compute. A 4B near-frontier model hit 3.3x lower response FLOPs than Qwen3-VL-4B. Benchmark-selection risk is real, and the thing these numbers don't tell you is how the curation pipeline transfers to a different data distribution. Even so, half the claim still justifies redirecting spend from GPU hours to curation tooling.

    The marginal dollar in VLM training has moved from compute to curation. That's the clearest evidence this year.

    Star Elastic: Verify Before Extrapolating

    NVIDIA claims one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining each, and 7x better than SOTA compression. The 360x figure is the kind that always shrinks under independent evaluation. A 30x hold would still restructure how size tiers get produced. No external validation exists yet.


    The Compute Backdrop: 4:1 Demand Crunch

    Nebius reported 4+ customers competing for every GPU brought online, with Q1 revenue at $399M (+684% YoY) and full-year guidance of $3–3.4B. Cisco corroborates. AI product orders from hyperscalers jump from $5B to $9B (+80%) next fiscal year, with explicit memory hardware shortages called out. H2 training runs need reserved capacity locked in now across 2+ providers.

    This is what makes the efficiency results matter beyond academic interest. At current contention levels, a 2-3x speedup at matched FLOPs is the difference between shipping in H2 and waiting for Q1 2027.

    DuckDB Quack + Kafka Share Groups: The Single-Node Stack Grows Up

    DuckDB's HTTP client-server mode makes embedded DuckDB viable as a shared service. Combined with the published ECS Fargate + Terraform pattern, that is a credible path to deleting Glue/EMR footprint for sub-100GB jobs. Kafka Share Groups decouple consumer parallelism from partition count, with roughly linear 8x scaling at 32 instances on I/O-bound workloads. Both invalidate assumptions the current stack was probably built on.

    Action items

    • Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline this quarter
    • Lock H2 2026 GPU reservations across 2+ providers before quarterly sellouts tighten further
    • Audit Glue/EMR job catalog for single-node candidates and spike one onto ECS Fargate + DuckDB + Terraform pattern
    • Benchmark Kafka Share Groups against your most partition-bound consumer group (embedding/enrichment workloads first)

    Sources:Claude just metered your agent SDK calls... · DuckDB shipped a client-server mode this week... · The 4:1 ratio is the headline number...

  4. 04

    Lakehouse Data Poisoning Path: Iceberg/Polaris CVSS 9.9 + Argo CD Secret Disclosure

    A New Attack Surface on Your Training Data

    This week's advisory set is distinct from the LiteLLM/Ollama vulnerabilities covered earlier this week. The new vectors target the data and deployment layers specifically.

    Apache Iceberg (CVE-2026-42812, CVSS 9.9) lets an attacker with table-write permission redirect metadata writes to an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features. The thing this doesn't tell you is that most lakehouse observability does not monitor metadata pointer mutations. Default logging covers row changes, not pointer changes.

    Apache Polaris (CVE-2026-42809/10/11, CVSS 9.9) ships three credential-broadening bugs enabling cross-tenant access to S3/GCS credentials. Combined with the Iceberg redirect, there is a plausible path from compromised analyst notebook to poisoned training data to cross-tenant credential theft.

    Argo CD 3.2.x/3.3.x (CVE-2026-42880, CVSS 9.6) lets read-only users extract plaintext Kubernetes Secrets. For teams promoting models to prod via Argo, every K8s Secret in reachable namespaces should be treated as disclosed until patched and rotated.

    An attacker with table-write permission can point metadata at an attacker-controlled S3 prefix, so the next query reads poisoned Parquet and the next training run ingests silently corrupted features.

    The Orchestrator Layer Is Soft

    ComponentCVSSML Stack ImpactPatch Action
    Apache Iceberg9.9Poisoned tables, corrupted training dataEnforce explicit storage credential scoping + write-path allowlisting
    Apache Polaris9.9S3/GCS creds, cross-tenant accessPatch + rotate all catalog credentials
    Argo CD 3.2/3.39.6Plaintext K8s Secrets (HF tokens, SA keys)Patch to ≥3.2.12/≥3.3.10 + rotate every Secret
    n8n9.8Workflow DB, OAuth sessionsPatch + scope service accounts
    Kestra ≤1.3.39.8Pipeline metadata, schedulesPatch + audit reach

    Why This Is Different from Tuesday's Advisory

    The LiteLLM and Ollama vulnerabilities covered Tuesday targeted the inference layer: API keys, prompts, model memory. The Iceberg/Polaris bugs target the data layer: table metadata, storage credentials, training inputs. The failure mode is not a data breach that surfaces in access logs. It is silent data poisoning that compounds through every downstream model and decision trained on the corrupted tables. Larger blast radius, harder detection.

    The PraisonAI 4-hour exploitation window (CVE-2026-44338) confirms that agent frameworks are now first-class targets with sub-day weaponization. If agent orchestration holds API keys or tool-call permissions, assume a working exploit exists within a day of any disclosure. I would commit to that assumption and revise only if the next two cycles disagree.

    The Compound Threat

    Draw the reference architecture for a modern data team: Iceberg for storage, Polaris for catalog, Argo for deployment, n8n/Kestra for orchestration. Every component has a CVSS at or above 9.0 this cycle. The combination is what matters. Credential broadening (Polaris) feeds metadata redirect (Iceberg) feeds model poisoning (training pipeline) feeds secret disclosure (Argo) for persistence. This is not a single-patch problem.

    Action items

    • Audit Iceberg/Polaris catalog configurations today: enforce explicit storage credential scoping and add write-path allowlisting for table metadata locations
    • Patch Argo CD to ≥3.2.12/≥3.3.10 and rotate every Kubernetes Secret in namespaces it can read by end of week
    • Run a dependency scan for n8n, Kestra, Spring Cloud Config, and Redis in your ML orchestration stack this sprint
    • Add metadata-pointer-mutation monitoring to lakehouse observability—alert on storage-location changes separate from data-row changes

    Sources:LiteLLM landed in the KEV catalog this week... · An Ollama endpoint exposed to the public internet... · PraisonAI, an open-source multi-agent framework...

◆ QUICK HITS

  • Update: LiteLLM (KEV) — rotate all upstream provider API keys stored in its DB; versions 1.81.16–1.83.7 confirmed actively exploited

    LiteLLM landed in the KEV catalog this week...

  • Abridge runs 80M+ clinical conversations through model routing — cheap triage model in front, expensive reasoning behind — 5-10x cost reduction at scale when routing is confidence-gated

    Abridge runs model routing across 100M conversations...

  • TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini-3.1-flash-live and 1.18s GPT-Realtime-2.0 — full-duplex voice is becoming a distinct architecture class

    TML is reporting 0.40 seconds of full-duplex latency...

  • Duolingo pegs AI-generated content slop at ~20% requiring human QC — a rare production quality number from a real deployment; benchmark your own acceptance rate against it

    Duolingo's twenty percent AI slop rate...

  • Only 15% of organizations have the data foundation for agentic AI (Fivetran); data quality/lineage cited as #1 blocker by ~50% — half of funded agent projects are actually data-platform projects with an agent on top

    DuckDB shipped a client-server mode this week...

  • LLM-as-a-Verifier beats LLM-as-a-Judge on tie-rate and decision accuracy by decomposing criteria into repeated binary verifications at token granularity — swap one pairwise judge this sprint

    An Ollama endpoint exposed to the public internet...

  • SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages — positioned as largest open agentic trace corpus for SFT/RM training

    Claude just metered your agent SDK calls...

  • Cerebras IPO closed +70% at $311; OpenAI's $20B commitment signed Dec 2025 is the first dollar-weighted proof wafer-scale handles production LLM inference

    Cerebras IPO validates non-Nvidia silicon...

  • Gemini reproducibly outputs real phone numbers from training data — add PII extraction eval (canary insertion + divergence attacks + membership inference) to LLM CI before next release cut

    Gemini is the latest model to surface PII...

  • New COSO/PCAOB guidance requires deterministic execution and tamper-evident audit trails for ML in regulated finance — LLM stochastic decoding is structurally non-compliant by design

    The transformer underwriting models are outperforming...

  • Persona drift in LLM agents measurable within 8 conversational turns (Li et al. COLM 2024) — add a verbal-tic regex canary to agent logs as a zero-cost drift detector

    AI personas drift within eight turns...

◆ Bottom line

The take.

Anthropic's 80x capacity miss has a June 15 deadline attached—every Claude-backed agent burns metered tokens at list price in 30 days—while 59% of production tokens are now agentic and your eval harness still scores single-turn completions. Simultaneously, Apache Iceberg's CVSS 9.9 lets attackers silently poison your training data through metadata redirects that default logging won't catch. The three things your stack needs this week: a Claude credit reconciliation before June 15, trajectory-level eval metrics for the 59% of traffic you're not measuring, and an Iceberg metadata-pointer audit before the next training run ingests something it shouldn't.

— Promit, reading as Data Science ·

Frequently asked

What exactly changes for Claude usage on June 15?
Claude usage through third-party tools like Zed, Conductor, OpenCode, and T3 Code moves to a separate credit bucket capped at plan value, with no rollover and overflow billed at API rates. The effective 70-90% subsidy on programmatic usage through Max plans disappears, so any cost model assuming flat-rate Claude consumption through non-native IDEs needs to be reconciled before then.
Why are single-turn evals no longer sufficient for production?
Vercel's production index shows 59% of token volume is now agentic across 200,000 teams, but most eval harnesses still measure single-turn pass@1 on curated prompts. That misses tool-call precision, steps-to-completion, p95 latency under concurrent load, and reward-hacking paths in passing rollouts—failure modes that only surface in multi-turn traces and dominate real reliability.
How should teams instrument Claude spend given the lack of native telemetry?
Deploy an LLM gateway like LiteLLM or Portkey with per-user and per-feature tagging plus daily token budget alerts. Anthropic provides no native per-user or per-tool usage telemetry, so without a gateway the overage shows up on the invoice after credits are gone—as ServiceNow's CDIO discovered after burning through a full-year budget by May.
Which training efficiency claim is worth spiking first and why?
Nous Token Superposition Training, because it is a pretraining recipe change with no inference-side downstream and claims 2-3x wall-clock speedup at matched FLOPs, validated from 270M up to 10B-A1B MoE. A 1B continued-pretraining run against a matched-FLOPs baseline answers the question in roughly a week, and even a 1.6x replication pays for itself on the next full run.
What makes the Iceberg and Polaris CVEs different from inference-layer vulnerabilities?
They enable silent data poisoning rather than a detectable breach. CVE-2026-42812 lets an attacker redirect Iceberg metadata writes to an attacker-controlled S3 prefix so subsequent queries read poisoned Parquet, and Polaris credential-broadening bugs allow cross-tenant access to storage credentials. Default lakehouse logging tracks row changes, not metadata pointer mutations, so the corruption compounds through every downstream model trained on the affected tables.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.