Data Science daily

Edition 2026-05-31 · read as Data Science

AnthropicEndsClaudeSubscriptionDiscount,LeasesxAIGPUs

Sources
36
Words
1,723
Read
9min

Topics Agentic AI LLM Inference AI Capital

◆ The signal

Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses — while simultaneously admitting an 80x capacity miss that forced them to lease xAI's entire 220,000-GPU Colossus 1 cluster. OpenAI dropped a 2-month free Codex enterprise switch promo the same day. If you haven't reconciled your Claude token burn against the new credit cap this week, you're making a pricing decision by default.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic Credit Reset + Capacity Crisis

    act now

    Anthropic metered all programmatic Claude usage at API rates, killing the alt-harness subsidy. ServiceNow burned its full-year budget by May. The 80x capacity miss drove an emergency lease of xAI's 220K-GPU Colossus 1 cluster. OpenAI's 2-month free Codex promo targets the exact developers Anthropic just alienated.

    80x
    capacity miss vs plan
    9
    sources
    • Planned growth
    • Actual growth
    • Colossus GPUs
    • Anthropic B2B share
    • OpenAI B2B share
    1. Planned capacity10
    2. Actual demand80
  2. 02

    Agentic Traffic Is Now the Majority — Eval Harnesses Measure the Minority

    monitor

    Vercel's AI Gateway puts agentic workloads at 59% of all token volume across 200K teams. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Single-turn eval harnesses are now benchmarking the minority of production traffic — trajectory-level metrics are overdue.

    59%
    agentic token share
    5
    sources
    • Agentic token share
    • Anthropic spend share
    • Google volume share
    • Teams observed
    1. Agentic workloads59
    2. Single-turn/chat41
  3. 03

    AI Cyber Capability Crosses Full-Takeover Threshold

    monitor

    Anthropic's Mythos is the first model to clear both UK AISI simulated attack ranges — achieving full network takeover in controlled tests. GPT-5.5-cyber cleared one. Separately, Google confirmed a threat actor using AI to build cybercrime tooling in the wild. Patch SLAs calibrated to CVE cadence are now measuring the wrong clock.

    2/2
    AISI ranges cleared
    7
    sources
    • Mythos ranges cleared
    • GPT-5.5-cyber cleared
    • Palo Alto products scanned
    • PraisonAI exploit time
    1. 01Mythos (new)Full takeover
    2. 02GPT-5.5-cyberPartial takeover
    3. 03Mythos (prior)Advanced persistence
  4. 04

    Training Efficiency Breakthroughs: 2-360x Compute Reductions

    monitor

    Three research drops change unit economics for anyone pretraining or distilling: Nous TST delivers 2-3x wall-clock speedup at matched FLOPs (validated 270M→10B MoE). Datology beats InternVL3.5-2B by 10pts at 17x less compute via data curation alone. NVIDIA Star Elastic claims 360x cheaper model-family production from one post-training run.

    17x
    compute reduction (VLM)
    2
    sources
    • TST speedup
    • Datology compute savings
    • Star Elastic savings
    • Datology benchmark lift
    1. Nous TST3
    2. Datology VLM17
    3. Star Elastic360
  5. 05

    Compute Supply Crunch: 4:1 Demand Ratio, Siting Backlash

    background

    Nebius reports 4+ customers competing per GPU at 684% YoY revenue growth. Cerebras IPO'd at $56B with a $20B OpenAI commitment. Utah's 9GW Stratos project faces 4,000 complaints and a referendum. Cisco's AI order guidance jumps from $5B to $9B with explicit memory hardware shortage. H2 training capacity priced on today's availability is mispriced.

    $3.4B
    Nebius 2026 guide
    5
    sources
    • Nebius YoY growth
    • Demand:supply ratio
    • Cerebras IPO cap
    • Stratos complaints
    1. Nebius 2025 rev530
    2. Nebius 2026 guide3200

◆ DEEP DIVES

  1. 01

    Anthropic's Double Shock: Credit Metering Kills the Subsidy, 80x Miss Forces Colossus Lease

    The Pricing Reset

    Anthropic converted every Claude subscription into a dollar-matched API credit bucket. The implicit 70-90% discount teams were getting by running Agent SDK, GitHub Actions, or third-party harness workloads against a $200 Max plan is gone. Starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) draws from a separate credit allocation with no rollover and overflow at API rates. Any cost model built before this date is numerically wrong, not approximately wrong.

    ServiceNow's CDIO publicly confirmed they burned their full-year Claude budget by May after the price hikes. The thing this doesn't tell you is how much was preventable: Anthropic ships no native per-user, per-tool usage telemetry, and no SLAs on latency or availability. You cannot attribute spend you cannot measure.


    The Capacity Story Behind the Price Story

    Dario Amodei at Code with Claude admitted they planned for 10x growth and got 80x in revenue and usage. That 8x forecast error explains the degradation reports from the last several weeks. What users were reading as model regressions was a capacity wall. The emergency fix is leasing xAI's entire Colossus 1 cluster (220,000+ GPUs spanning H100, H200, and GB200) from the CEO who called Anthropic 'misanthropic' three months ago.

    SurfaceBeforeAfter (May 7-14)
    Claude Code limits5-hour capDoubled
    Peak-hours throttleReduced for Pro/MaxRemoved
    Opus API rate limitsSqueezed'Substantially raised'
    Fleet compositionAnthropic-managedHeterogeneous incl. GB200
    Any Claude benchmark from before May 7 is stale. Re-baseline after the new caps land, not before — otherwise the delta you attribute to a prompt change is mostly capacity noise.

    The OpenAI Counter-Offensive

    Hours after the metering announcement, OpenAI dropped a 2-month free Codex enterprise switch promo. Ramp's April data showed the first-ever Anthropic lead at 34.4% vs 32.3%, so OpenAI is pricing a direct assault on the developers Anthropic just alienated. Treat this as an asymmetric-payoff evaluation window: free head-to-head data on workloads you actually run, not on someone else's leaderboard.

    What This Means for Your Stack

    The combined read across nine sources: single-provider Claude dependency carries unpriced risk, and the forecast-error bound on that risk is now 8x. Anthropic is targeting an October IPO with a CFO hired specifically for margin improvement. The base rate says pricing stays sticky or rises from here.

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) against the new credit cap and flag jobs that will exhaust credits before month-end
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily budget alerts in front of all Claude traffic
    • Accept OpenAI's 2-month Codex evaluation under the enterprise switch promo; instrument head-to-head on your eval harness with matched prompts
    • Re-run Claude Code and Opus API baselines (throughput, p95 latency, rate-limit headroom) post-Colossus integration before shipping any workarounds designed for the degraded period

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward... · Anthropic shipped without the telemetry hooks... · Vercel published a number worth sitting with: 59%... · Anthropic passes OpenAI in B2B

  2. 02

    59% of Production Tokens Are Agentic — Your Eval Harness Is Scoring the Minority

    The Production Data

    Vercel's AI Gateway index, drawn from 200,000 teams over 7 months, puts agentic workloads at 59% of all token volume. That is measurement, not forecast. Anthropic captures 61% of spend through Opus. Google captures 38% of volume through Flash. The data shows no vendor loyalty. Customers route by task.

    The spend-versus-volume gap is the structural read: expensive models do the planning and reasoning nodes, cheap models do the high-throughput utility calls like retrieval rewriting, extraction, and classification. Teams paying Opus rates for every agent step are overspending on the 59% of calls that do not need it.


    Why Single-Turn Evals Are Now Dangerously Misleading

    Most eval harnesses in production still score single-turn responses against a reference answer. That was the right design in 2023. Once the median request is a multi-step tool loop with retries, the metric you want is different:

    Old metric (single-turn)New metric (agentic)Why it matters
    Accuracy on final answerCost-per-successful-taskA 40K-token argument with itself costs real money
    MMLU/HumanEvalTool-call precision & recallWrong tool selection cascades through the trajectory
    Mean latencySteps-to-completionp95 trajectory, not p95 request
    Pass@1Recovery-from-error rateReal agents fail and retry; pass@1 hides this

    Sayash Kapoor's framing is the cleanest version of this: outcome-only metrics systematically underestimate failure modes in capable agents. Stronger agents surface benchmark bugs and reward-hacking paths that weaker agents never reach. The pass@1 curve flattens at exactly the point where real reliability starts to diverge. The thing pass@1 doesn't measure is the long tail you will actually ship into.


    The Production Reference Architecture

    Abridge (80M+ clinical conversations, 250 health systems, $5.3B valuation) has disclosed enough to lift the pattern: cheap fast model for triage, expensive model for reasoning, confidence-gated routing, LLM judges calibrated against human annotators quarterly, memory externalized into event stores. Microsoft's MDASH reports the same decomposition on the security side, with scan, debate, and exploit stages beating monolithic approaches on CyberGym.

    A router that treats every call as independent is leaving money and latency on the table once a meaningful fraction of traffic is agentic. Session-aware routing and tool-calling reliability matter more than MMLU deltas.

    The Glean benchmark claiming MCP uses 30% more tokens than a retrieval-tuned knowledge graph is vendor-published with no methodology, so treat the magnitude as untrusted. The direction matches what the production traces show: naive tool listings balloon context windows. I would expect the real number to land somewhere south of 30% on a clean rerun, but still positive enough to matter.

    Action items

    • Add trajectory-level metrics to your eval harness this sprint: task success rate, tool-call F1, steps-to-completion, cost-per-successful-task, recovery-from-error rate
    • Instrument per-node token cost in your agent graphs and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models
    • Add LLM-judge-to-human-annotator agreement as a tracked SLI; re-calibrate quarterly with Cohen's kappa against gold labels
    • Run a 1-hour spike measuring token overhead of your MCP/tool-calling setup vs. a retrieval-first baseline on 100 sampled production traces

    Sources:Agentic traffic crossed fifty-nine percent of tokens... · Vercel published a number worth sitting with: 59%... · The CyberGym result is the kind of finding... · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs is the combination...

  3. 03

    Apache Lakehouse Stack Under Critical Attack: Iceberg, Polaris, Argo CD

    The New CVEs Landing on the Data Stack

    This week's advisory cycle concentrates on infrastructure data teams actually run in production. The LiteLLM KEV entry was flagged earlier this week. Three new critical CVEs landed that target lakehouse and MLOps infrastructure directly, which is a different class of problem.

    ComponentCVE / CVSSImpactBlast Radius
    Apache IcebergCVE-2026-42812 / 9.9Metadata write redirect to attacker-controlled S3Poisoned tables, corrupted training data
    Apache PolarisCVE-2026-42809/10/11 / 9.9Credential broadeningS3/GCS creds, cross-tenant access
    Argo CD 3.2.x/3.3.xCVE-2026-42880 / 9.6Missing authorizationPlaintext K8s Secret extraction
    n8nCVE-2026-42233 / 9.8SQL injection + OAuth theftWorkflow DB, OAuth sessions
    Kestra ≤1.3.3CVE-2026-38428 / 9.8SQL injectionPipeline metadata, schedules

    Why Iceberg CVE-2026-42812 Is the One That Matters

    An attacker with table-write permission can redirect metadata pointers at an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features and produces a model that looks fine on the eval set. The thing standard lakehouse observability doesn't cover is pointer changes; it covers row changes, not pointer changes. Most monitoring stacks will not see this.

    Combined with the Polaris credential-broadening issue, the plausible path runs from "compromised analyst notebook" to "cross-tenant data theft."

    Draw a reference architecture for a modern data team, throw darts at it, and every throw hits a CVSS of 9.0 or higher.

    Argo CD: Model Registry Tokens Are Exposed

    The missing-authorization flaw lets low-privilege users extract plaintext Kubernetes Secrets in reachable namespaces. For teams running model services or training jobs through Argo CD 3.2 or 3.3, that set includes model-registry tokens, HuggingFace PATs, database passwords, and cloud credentials. Rotation costs more than the patch. Skipping it is not a defensible decision.

    The Pattern

    These are not obscure memory-corruption bugs in C libraries. They are authorization failures and unsafe input handling in Python, Go, and Java tools that shipped fast. ML infrastructure was built at startup velocity and is now getting the security attention web frameworks got a decade ago. CISA is tracking AI-infra exploits the way it tracks Exchange or Ivanti. The downstream effect, predictable enough to underwrite, is procurement friction on anything LLM-adjacent for the next few quarters.

    Action items

    • Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every Kubernetes Secret in namespaces it can read — this week
    • Audit Iceberg/Polaris catalog configurations: enforce explicit storage credential scoping and add write-path allowlisting for table metadata locations
    • Run a dependency scan for n8n, Kestra, Spring Cloud Config, and Redis across your ML orchestration stack; pin to patched versions
    • Add metadata-pointer integrity checks to your lakehouse monitoring — alert on catalog location changes, not just row-level changes

    Sources:LiteLLM landed in the KEV catalog this week... · An Ollama endpoint exposed to the public internet... · Agent stacks are now in scope for attackers

  4. 04

    Training Compute Breakthroughs: TST, Datology, and Star Elastic Change Unit Economics

    Three Results That Move the Budget

    Three research drops landed the same week, each aimed at a different line item in the training compute bill. Read together, the marginal dollar in model development is moving from raw FLOPs toward training recipes and data curation. That is a claim about where to spend, not a claim that compute stopped mattering.

    WorkClaimScale ValidatedInference ImpactReplication Risk
    Nous TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone — no architecture changeMedium; single-source, clean claim
    NVIDIA Star Elastic360x cheaper model-family productionNot specifiedProduces family of sizes from one runHigh; big number, lab-reported
    Datology VLM+11.7 pts on 20 benchmarks; 17x less compute2B and 4B paramsLower response FLOPs — real serving winMedium; benchmark-selection risk

    What Each Means for Your Roadmap

    TST is the one to spike first. Token Superposition Training is a pretraining recipe change with no inference-side architecture change. If it replicates at even 1.6x on a continued-pretraining run with no val-loss regression, it pays for itself on the next full run. The mechanism — superposing multiple token targets per forward pass — is clean, and the authors validated it from 270M up to 10B MoE. The thing this doesn't tell you is how it behaves on your data mix at your context length, which is where most pretraining recipes lose half their reported gain.

    Datology is the clearest evidence this year that the marginal VLM dollar has moved from compute to curation. Their 2B model beats InternVL3.5-2B by about 10 points at 17x less training compute, through data selection and mixture optimization alone. At 4B params, they reach near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B. The training cost number is interesting; the response FLOPs number is what shows up on the serving bill.

    Star Elastic's 360x number is the kind of claim that always shrinks under independent eval. Given the setup, I expect it to hold up by roughly an order of magnitude less on third-party benchmarks. Even a 30x retention would change how teams produce size tiers. One post-training run yielding a family from 1B to 70B is categorically different from training each size independently.

    TST requires no inference-time changes. Datology requires no model architecture changes. Both are 'free' efficiency wins conditional on reproduction — and both are cheap enough to spike this quarter.

    The Data Curation Thesis

    Datology's result sits alongside the Fivetran readiness index, which finds that only 15% of organizations have the data foundation for agentic AI, with about 50% citing data quality and lineage as the top blocker. The correlation is suggestive, not causal: bigger models still help, and curated data is partially a proxy for teams that know what they are doing. The lakehouse stats gap compounds the problem. Iceberg, Delta, and Parquet treat column-level stats as optional, and stale or missing stats produce plans that cost 3x what they should without surfacing a hard error. That last failure mode is the one to instrument first, because it does not show up on any leaderboard.

    Action items

    • Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline this quarter
    • Run ANALYZE/compute-stats coverage audit across your Iceberg/Delta tables; add stats freshness to table-level SLAs
    • If running VLM training: replicate Datology's data-curation-first methodology on a 2B base before scaling compute
    • Score your next agent project against data readiness dimensions (quality, lineage, governance) before greenlighting compute spend

    Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week

◆ QUICK HITS

  • DuckDB shipped Quack HTTP client-server mode — Spark-on-Glue jobs under 100GB are now credible migration candidates to ECS Fargate + DuckDB at 50%+ cost reduction

    DuckDB shipped a client-server mode this week

  • Update: Mythos is first model to achieve full network takeover on both AISI simulated attack ranges; GPT-5.5-cyber cleared one — add staged cyber-capability rubric (recon → lateral movement → persistence → exfil) to agent release gates

    Mythos cleared the AISI attack ranges this week

  • Kafka Share Groups decouple consumer parallelism from partition count with ~linear 8x scaling at 32 instances on I/O-bound workloads — benchmark on embedding/enrichment consumers first

    DuckDB shipped a client-server mode this week

  • SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages — largest open agentic trace corpus, useful for SFT and reward-model training before licensing frictions arrive

    Claude just metered your agent SDK calls

  • TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini-3.1-flash-live and 1.18s GPT-Realtime-2.0 — full-duplex voice is becoming a distinct model class; add turn-taking latency (user-EOS → first audio byte) to voice eval

    TML is reporting 0.40 seconds of full-duplex latency

  • Duolingo CEO publicly pegged AI-generated content slop at ~20% requiring human QC — use as calibration anchor for your own generation acceptance rate before building custom benchmarks

    Duolingo's twenty percent AI slop rate is the number worth staring at

  • Gemini reproducibly leaks real phone numbers (4 independent incidents) — add PII extraction eval suite (canary insertion, divergence attacks, membership inference probes) to LLM CI before the next release cut

    Gemini is the latest model to surface PII from its training data

  • LLM-as-a-Verifier (decomposed binary verifications with token-level scoring) outperforms LLM-as-a-Judge on tie-rate and decision accuracy — a one-day rewrite of one pairwise judge is the cheapest variance reduction available

    An Ollama endpoint exposed to the public internet gets picked up by Shodan...

  • Only 15% of organizations have data foundations ready for agentic AI (Fivetran); ~50% cite quality/lineage as #1 blocker — score target domains against readiness dimensions before greenlighting agent compute

    DuckDB shipped a client-server mode this week

  • Opus 4.7 tripled image processing costs — re-price multimodal inference budget and run head-to-head vs GPT-4V and Gemini on your actual image workload this sprint

    Anthropic passes OpenAI in B2B

◆ Bottom line

The take.

Anthropic killed the programmatic Claude discount (70-90% gone overnight), admitted an 80x capacity miss that forced them to rent a competitor's entire GPU fleet, and still has no native cost-attribution telemetry — while Vercel confirmed 59% of production tokens are now agentic traffic that your single-turn eval harness doesn't measure. The three things to ship this sprint: a gateway with per-feature budget caps, trajectory-level eval metrics, and patches for Iceberg/Argo CD before someone poisons your training data through a CVSS 9.9 you didn't know existed.

— Promit, reading as Data Science ·

Frequently asked

How do I figure out if my Claude workloads will blow through the new credit cap?
Put an LLM gateway like LiteLLM or Portkey in front of all Claude traffic and tag every call by user, feature, and tool. Anthropic ships no native per-user, per-tool telemetry, so the attribution layer is yours to build. Once tagged, project current daily burn against the dollar-matched credit bucket and flag any job that exhausts before month-end — third-party tool usage (Zed, Conductor, OpenCode, T3 Code) draws from a separate allocation with no rollover starting June 15.
Are old Claude benchmarks still valid after the Colossus lease and the new rate caps?
No — re-baseline after the new caps land. The 80x usage miss caused weeks of capacity-driven degradation that users mistook for model regressions, and Anthropic has since doubled Claude Code limits, removed peak-hours throttling for Pro/Max, and substantially raised Opus API rate limits on a heterogeneous fleet that now includes GB200s. Any throughput, p95 latency, or rate-limit-headroom number from before May 7 mixes capacity noise into whatever delta you're trying to measure.
Why are single-turn evals insufficient now that 59% of tokens are agentic?
Single-turn accuracy scores the minority of production traffic and hides the failure modes that matter in tool loops. Trajectory-level metrics — cost-per-successful-task, tool-call precision and recall, steps-to-completion, and recovery-from-error rate — capture what actually breaks: wrong tool selection cascading through a trajectory, 40K-token self-arguments, and pass@1 curves that flatten right where real reliability diverges. Outcome-only metrics also systematically underestimate reward-hacking paths that stronger agents reach.
Which of the new lakehouse CVEs should I patch first, and why?
Patch Argo CD (CVE-2026-42880) this week and rotate every Kubernetes Secret in reachable namespaces, because the missing-authorization flaw exposes plaintext model-registry tokens, HuggingFace PATs, DB passwords, and cloud credentials. In parallel, harden Iceberg and Polaris: CVE-2026-42812 lets a write-permitted attacker redirect table metadata to an attacker-controlled S3 prefix, poisoning Parquet files that training runs ingest silently. Standard row-level lakehouse monitoring does not catch pointer mutations, so add catalog-location change alerts.
Is Token Superposition Training worth a spike this quarter compared to Star Elastic or Datology's VLM result?
TST is the cleanest spike candidate because it changes only the pretraining recipe — no inference-side architecture change — and was validated from 270M up to a 10B-A1B MoE with 2-3x wall-clock at matched FLOPs. Even a 1.6x replication on a continued-pretraining run with no val-loss regression pays for itself on the next full run. Star Elastic's 360x model-family number will likely shrink by an order of magnitude under independent eval, and Datology's gains are VLM-specific and depend on curation pipelines you may not have.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.