Data Science daily

Edition 2026-05-29 · read as Data Science

AnthropicEndsFlat-RateDiscount:RepricingAgentWorkloads

Sources
36
Words
1,752
Read
9min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Anthropic ended the flat-rate Claude discount this week. Programmatic usage through the Agent SDK, GitHub Actions, and batch evals now meters against API credits at list price, which removes a 70-90% effective subsidy. The thing the headline doesn't tell you: Vercel's production telemetry puts 59% of tokens in multi-turn agentic traces, and those run 5-15x heavier than single-shot completions. Two assumptions broke at once. Re-model before the June invoice prints.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic Pricing Shock + Capacity Crisis

    act now

    Claude subscriptions now meter programmatic usage at API rates (70-90% subsidy gone). Anthropic grew 80x vs. planned 10x, leased xAI's full 220K-GPU Colossus 1 to cope. ServiceNow burned its full-year budget by May. June 15 brings a separate third-party tool credit split for Zed/Conductor/OpenCode users.

    80x
    growth vs plan
    10
    sources
    • Growth vs forecast
    • Colossus GPUs leased
    • Ramp share lead
    • ARR trajectory
    • Valuation (rumored)
    1. Anthropic B2B share34.4
    2. OpenAI B2B share32.3
  2. 02

    Agentic Traffic Majority + Eval Harness Gap

    monitor

    Vercel's Gateway index: 59% of production tokens are multi-turn agentic traces. Anthropic captures 61% of spend (Opus), Google captures 38% of volume (Flash). Most eval harnesses still score single-turn completions — measuring the minority of traffic. Multi-agent decomposition (MDASH 100+ agents) beat monolithic models on CyberGym.

    59%
    tokens now agentic
    6
    sources
    • Agentic token share
    • Anthropic spend share
    • Google volume share
    • MCP token overhead
    • Teams on multi-model
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    ML Infrastructure CVEs Expand Beyond LiteLLM

    act now

    Apache Iceberg (CVSS 9.9) allows metadata-write redirect to attacker-controlled storage — silently poisoning training data. Apache Polaris (9.9) broadens cloud credentials. Argo CD (9.6) exposes K8s Secrets. PraisonAI agent framework weaponized in 4 hours. An 18-year NGINX rewrite RCE hits every model-serving gateway.

    4hrs
    time to exploit
    5
    sources
    • Iceberg CVSS
    • Polaris CVSS
    • Argo CD CVSS
    • PraisonAI exploit time
    • NGINX bug age
    1. 01Iceberg metadata9.9
    2. 02Polaris creds9.9
    3. 03n8n SQLi9.8
    4. 04Argo CD authz9.6
    5. 05Ollama GGUF9.1
  4. 04

    AI Autonomous Exploit Capability Crosses Threshold

    monitor

    Anthropic's Mythos is the first model to clear both UK AISI simulated attack ranges (full network takeover). Mozilla's custom harness surfaced 271 Firefox bugs with Mythos while curl's out-of-box scan found only 1 — proving harness engineering dominates model choice by 270x. Google confirmed AI-built cybercrime tooling observed in the wild.

    271
    bugs found (harness)
    6
    sources
    • Mozilla bugs found
    • curl bugs found
    • AISI ranges cleared
    • Palo Alto products scanned
    • Capability tier
    1. Mozilla (custom harness)271
    2. curl (generic scan)1
  5. 05

    Data Infrastructure Architecture Shifts

    background

    DuckDB shipped Quack (HTTP client-server), making it viable as a shared analytics service — the Spark-on-Glue jobs that were always single-node can finally stop pretending. Kafka Share Groups report 8x throughput by decoupling consumers from partitions. Only 15% of orgs have data foundations ready for agentic AI.

    8x
    Kafka throughput gain
    1
    sources
    • Kafka scaling
    • Orgs AI-ready
    • Data modeling pain
    • DuckDB protocol
    1. Kafka Share Groups8
    2. DuckDB vs Glue2
    3. Agentic readiness15

◆ DEEP DIVES

  1. 01

    Anthropic's Double Bind: Metered Credits + 80x Capacity Miss = Your Budget Just Broke

    What Changed

    Two pricing events landed on the same day. First, Anthropic converted all programmatic Claude usage (Agent SDK, claude-p, GitHub Actions, third-party harnesses) from flat subscription to dollar-matched API credits. The 70-90% implicit subsidy on alt-harness usage is gone, effective immediately. Second, starting June 15, third-party tools (Zed, Conductor, OpenCode, T3 Code) get a separate credit bucket. No rollover. Overflow billed at API rates.

    The context is worse than the headline. Dario Amodei admitted at Code with Claude that Anthropic planned for 10x growth and got 80x. That gap forced an emergency lease of xAI's Colossus 1 cluster — 220,000+ GPUs across H100, H200, and GB200. ServiceNow's CDIO burned through the full-year Anthropic budget by May.


    Why This Hits Data Science Teams Hardest

    Agentic workloads are the most token-intensive thing a DS team ships. A reflection loop or tool-use chain can 10x token spend per task with no proportional quality gain. The thing the old budget model doesn't tell you is which tasks went agentic in the last sixty days. Metered pricing plus agentic intensity means March's forecast is off by a multiple, not a percentage.

    SurfaceBeforeAfterImpact
    Agent SDK / batch evalsFlat subscriptionMetered at API list5-10x cost increase on heavy usage
    Third-party tools (Jun 15)SubsidizedCredit cap, overflow at API rateCost model numerically wrong
    Claude Code limits5-hour cap, peak throttleDoubled, throttle removedPositive — more capacity
    Opus API rate limitsSqueezedSubstantially raisedPositive — but stale benchmarks

    Anthropic ships no native per-user or per-tool usage telemetry. You cannot see which tenant, feature, or prompt drove spend without building the instrumentation yourself. Observability has been offloaded to the customer.

    If the vendor cannot tell you which user burned the token, the problem is not cost — it is observability, and it is yours to fix before the next invoice.

    The Counter-Move

    OpenAI dropped a 2-month-free Codex enterprise switch promo on the same day Anthropic metered credits. Ramp's April data put Anthropic ahead of OpenAI 34.4% to 32.3%, the first lead change. OpenAI is pricing a counter-offensive at exactly the developers Anthropic just alienated. Treat it as a free evaluation window with asymmetric payoff.

    One methodology note. Any Claude benchmark run between mid-April and early May was measured during the capacity crisis. Those numbers are now stale. Colossus integration and rate-limit relaxations will shift serving conditions again. Re-baseline after the new caps land, then decide.

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap by end of this sprint
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily token budget alerts within 2 weeks
    • Run OpenAI Codex evaluation under the 2-month free enterprise promo, instrumented with matched prompts and tool schemas
    • Re-run all Claude Code and Opus API benchmarks post-Colossus integration (expect stabilization by late May)
    • Retain vendor-abstracted agent layer — do not commit to Anthropic-exclusive harnesses until post-IPO pricing is clear

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with · Anthropic's ARR tripled

  2. 02

    59% Agentic: Your Eval Harness and Cost Model Are Both Measuring the Wrong Workload

    The Production Snapshot

    Vercel's AI Gateway index is the only multi-tenant production telemetry at scale we have right now: 200K teams, seven months. It reports 59% of all tokens are now agentic workloads, up from under 20% six months ago. The spend-versus-volume split tells the routing story. Anthropic captures 61% of spend through Opus on reasoning and planning nodes. Google captures 38% of volume through Flash on high-throughput utility calls.

    This is not a forecast. It is present-tense production behavior across 200K teams. An eval harness, cost model, or serving layer that still treats single-turn completions as the base case is optimizing for the 41% minority.


    What Breaks When the Majority is Agentic

    Eval harnesses built on single-turn benchmarks like MMLU and exact-match QA cannot score a planner that burns 40,000 tokens arguing with itself and then gives up. Final-answer accuracy lands at 90%+ in both cases. The thing this doesn't tell you is the cost path to get there, which is where the bill actually lives.

    Cost models fitted when input-output ratios sat at 3:1 are off by roughly 5x. Agentic traces run closer to 15:1 on input, with heavy cache reuse on some providers and none on others. A team that flagged this in Q2 watched one customer blow past monthly budget in 9 days.

    Multi-agent wins are real but expensive. Microsoft's MDASH (100+ agents) beat Anthropic's Mythos on CyberGym by decomposing into scan, adversarial debate, and PoC exploitation. No cost or latency comparison was published. The CyberGym result establishes that topology matters. It does not establish that topology is affordable.

    The Routing Architecture That's Already Winning

    Node TypeOptimal Model TierEvidence
    Planning / reasoningOpus-class (premium)61% of Vercel spend on Anthropic
    Utility calls (extraction, rewrite)Flash/Haiku-class (cheap)38% of Vercel volume on Google
    Routing decisionSmall classifier or confidence gateAbridge's 80M+ conversations pattern
    Verification / critiqueSecond model with narrower scopeMDASH debate stage; Abridge LLM judges
    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out — update the harness before you update the model.

    The Glean Counterpoint

    Glean's benchmark claims off-the-shelf MCP uses 30% more tokens and loses 2.5x head-to-head preference versus a retrieval-tuned knowledge graph on agentic tasks. This is vendor-published with no methodology disclosed, so treat the magnitudes as marketing. The failure mode it points at is well-known: MCP tool listings balloon context, and naive tool outputs return verbose blobs where a reranked snippet would suffice. SAP (€100M partner investment) and ServiceNow (Action Fabric) independently converged on Knowledge Graph + MCP as the enterprise agent architecture.

    Action items

    • Add trajectory-level metrics to eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
    • Instrument per-node token cost in your agent graph and route utility calls (summarization, extraction, query rewriting) to Flash/Haiku-class models
    • Run a 1-hour MCP overhead spike: replay 100 production agent traces under current MCP vs. BM25+rerank baseline, measuring tokens and task-win-rate
    • Prototype decompose-debate-verify pipeline on one workload with auto-verifiable outputs (code gen, SQL, extraction) to test multi-agent lift at matched token cost

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs · AI Gateway data puts agentic workloads at fifty-nine percent

  3. 03

    Iceberg, Argo CD, and NGINX: The CVEs That Poison Your Training Data This Week

    The Expanded Attack Surface

    Last week's LiteLLM KEV entry read like a canary. The full disclosure this cycle says the ML infrastructure layer is bleeding at multiple points at once. Pick any reference architecture and most of the boxes have a CVSS 9.0+ open against them.

    The one I would prioritize is Apache Iceberg CVE-2026-42812 (CVSS 9.9). An attacker with table-write permission can redirect metadata to an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features. The thing this doesn't tell you, if you only read the CVSS line, is that default lakehouse observability watches row changes, not pointer changes. Standard monitoring will not see this.

    Combined with Apache Polaris credential-broadening (CVSS 9.9), there is a plausible chain from compromised analyst notebook to cross-tenant data theft to poisoned model weights.


    Priority Triage for Monday's Standup

    ComponentCVSSBlast Radius in ML StackPatch Window
    Apache Iceberg9.9Poisoned tables → corrupted trainingImmediate
    Apache Polaris9.9S3/GCS creds → cross-tenant accessImmediate
    Argo CD 3.2/3.39.6K8s Secrets (model-registry tokens, HF PATs, DB passwords)Immediate + rotate
    n8n (orchestration)9.8Workflow DB, OAuth sessionsThis week
    NGINX rewrite moduleRCEModel-serving ingress → registry credsThis week
    PraisonAIAuth bypassAgent runtime → secrets, tool-callsSame-day

    PraisonAI was weaponized within 4 hours of disclosure. That is not a patching window. Agent frameworks have crossed the adoption threshold where threat actors watch their CVE feeds. LangChain, CrewAI, and AutoGen belong in the same bucket until evidence says otherwise.

    The 18-year NGINX rewrite-module RCE affects every model-serving gateway behind NGINX, which in practice means most of them. A compromise there hands the attacker whatever service account pulls from the model registry.

    If the reference architecture runs Iceberg, Argo CD, or NGINX, there is patching homework before the next experiment and credential-rotation homework before the next sprint.

    The Training-Data Integrity Gap

    Iceberg matters for ML specifically because the failure mode is silent data corruption, not a hard error. Metadata pointers can be mutated without tripping the usual data-quality checks: row counts, null rates, schema tests. The poisoning surfaces as a subtle distribution shift weeks later in training data, or never, if no one is looking at per-slice drift.

    The same shape applies to the Ollama GGUF heap OOB read (CVSS 9.1). Any automation fetching GGUFs from HuggingFace mirrors or community repos without signature verification is a data-exfil pipe. This is the pickle problem reborn for the quantized-model era.

    Action items

    • Patch Apache Iceberg/Polaris catalog configurations immediately: enforce explicit storage credential scoping and add write-path allowlisting for metadata locations
    • Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every Kubernetes Secret in reachable namespaces (model-registry tokens, HF PATs, cloud credentials)
    • Patch NGINX across all inference gateways; audit rewrite-module usage in model routing configs this week
    • Add GGUF model signature verification to Ollama/model-loader pipeline and upgrade to ≥0.17.1
    • Inventory all agent frameworks (PraisonAI, LangChain, CrewAI) and pin versions + subscribe to CVE feeds; set patching SLA to same-day for any framework in a production path

    Sources:LiteLLM landed in the KEV catalog this week · An Ollama endpoint exposed to the public internet · PraisonAI, an open-source multi-agent framework, was weaponized within four hours · Mozilla shipped 271 bugs over the period · Agent stacks are now in scope for attackers

  4. 04

    Mythos Cleared the Ranges: Autonomous Exploit Chaining Is Now a Documented Capability

    The Threshold Crossing

    The UK AI Security Institute evaluated Anthropic's Mythos and OpenAI's GPT-5.5-cyber on autonomous cyber-offense tasks. Both completed full network takeovers in controlled environments. Mythos cleared both of AISI's two hardest tests. GPT-5.5-cyber cleared one. The prior Mythos generation topped out at advanced persistence. AISI is already building harder evals because the current ladder is saturating, which is the usual signal that a benchmark is about to stop measuring the bottleneck.

    This is the first time a national evaluation body has publicly stated that frontier models can complete an end-to-end attack chain without a human in the loop. The capability tier moved from advanced persistence to full network takeover in one model generation. That is a one-generation slope, not a trend, and worth weighting accordingly.


    The 271-vs-1 Harness Result

    The headline number is not the most useful data point. The controlled comparison is. Mozilla wrapped a custom agentic harness around their existing fuzzing infrastructure and surfaced 271 bugs in Firefox 150, including sandbox escapes, use-after-frees, and race conditions. Daniel Stenberg pointed the same model at curl with a generic scan and got 1 low-severity CVE and 4 false positives.

    Same weights. 270x difference in yield. The variable was the harness. Mozilla's wrapper produces reproducible test cases, scales across ephemeral VMs, and integrates with their security lifecycle. The ex-Google Distinguished Engineer running it said model choice was not the dominant factor, which matches what the numbers say.

    DimensionMythos + Mozilla HarnessMythos + Generic Scan
    Bugs surfaced271 (UAFs, sandbox escapes)5 claimed (1 true positive)
    False positive rateTooling-filtered (low)~80% (4 of 5)
    Harness integrationFuzzer-integrated, ephemeral VMsOut-of-box scan
    CI integration plannedYes — patches scanned on landingNone
    When a frontier model yields 271 bugs for one team and 1 CVE for another against the same language, the harness is the product, not the model.

    What This Means for Your Release Gate

    Refusal-rate harnesses are measuring the wrong bottleneck. Prompt-injection catch rate is not a proxy for end-to-end chain completion. The eval that matches the threat model mirrors AISI's tiered rubric: recon → initial access → lateral movement → persistence → exfiltration, run against every model upgrade that gets tool or shell access.

    Google's threat-intel team has now confirmed AI-built cybercrime tooling observed in the wild, which moves this from red-team hypothesis to detected incident. Palo Alto's AI-driven scanning surfaced serious vulnerabilities across 130+ products. Inference is cheap and orchestration is cheap. The expensive input was the human operator, and that is the input the model replaced.

    For anyone shipping agents with code or infrastructure access, the time-to-first-exploit metric has a model-generated baseline that is materially shorter than the human one. Patch SLAs calibrated against human attacker speed are now miscalibrated.

    Action items

    • Add a staged cyber-capability eval tier to your agent release gate: recon, initial access, lateral movement, persistence, exfiltration — run against any model with tool or shell access
    • Spike a domain-specific agentic security harness (modeled on Mozilla's pattern) against one complex internal service — measure true-positive rate and time-to-first-finding vs. existing tooling
    • Instrument agent action sequences in production logs and train a lightweight trajectory classifier on known-bad patterns (recon → lateral movement)
    • Compress critical-patch SLA to model-release cadence (monthly, not quarterly) for any service exposed to autonomous scanning

    Sources:Mythos cleared the AISI attack ranges this week · Two data points this week broke the AI cyber capability extrapolation · The UK AISI evaluations report full network takeovers · Mozilla shipped 271 bugs over the period · Google's report of a threat actor using AI to build cybercrime tooling · Anthropic published the case study this week

◆ QUICK HITS

  • Nous Token Superposition Training reports 2-3x wall-clock speedup at matched FLOPs with no inference architecture change, validated 270M → 10B-A1B MoE — spike on next continued-pretraining run

    Claude just metered your agent SDK calls

  • Datology hit +11.7 points on 20 VLM benchmarks at 2B params, beating InternVL3.5-2B by ~10 points at 17x less training compute via data curation alone

    Claude just metered your agent SDK calls

  • GPU demand-to-supply ratio is 4:1 at Nebius (684% YoY revenue, $3-3.4B 2026 guide) — lock H2 reserved capacity across 2+ providers before quarterly sellouts

    The 4:1 ratio is the headline number

  • DeepSeek V4 Pro scored 77/100 on FlowGraph at $2.25/task — between Opus 4.7 and Kimi K2.6 — re-benchmark for cost-per-successful-task on your workload

    The CyberGym result

  • TML-Interaction-Small reports 0.40s turn-taking latency vs. 0.57s Gemini-flash-live and 1.18s GPT-Realtime-2.0 — a 3x spread on the metric that determines voice naturalness

    TML is reporting 0.40 seconds of full-duplex latency

  • Duolingo publicly pegs AI-generated content slop at ~20% requiring human QC — a rare usable production quality anchor for calibrating your own generation pipeline

    Duolingo's twenty percent AI slop rate

  • Anthropic's /goal command in Claude Code delegates termination to Haiku reading only the conversation transcript — not filesystem or command output; wire PostToolUse hooks to pipe deterministic test results in

    Anthropic shipped a /goal command in Claude Code

  • AI agents bypass legacy bot detection at 81% success rate — retrain bot/abuse classifiers with agent-generated traffic and behavioral/request-graph features

    MCP plus knowledge graphs

  • Persona drift in LLM agents measurable within 8 conversational turns (Li et al. COLM 2024); embed a distinctive verbal tic as a zero-cost canary signal for production monitoring

    AI personas drift within eight turns

  • Update: Exposed AI inference endpoints (Ollama, LangServe, MCP) get indexed by Shodan within 3 hours; 23% of honeypot traffic now targets AI-specific paths (/v1/models, /.well-known/mcp.json)

    An Ollama endpoint exposed to the public internet

◆ Bottom line

The take.

Anthropic killed the flat-rate Claude subsidy the same week production telemetry confirmed 59% of all tokens are multi-turn agentic traces — meaning your inference budget is wrong by a multiple, not a margin. Simultaneously, Apache Iceberg (CVSS 9.9) can silently poison your training data via metadata redirect, PraisonAI was exploited in 4 hours, and Mythos became the first model to autonomously complete full network takeovers in AISI testing. The stack needs three things before next sprint: a metered cost model, a patched lakehouse, and an eval harness that measures the 59% of traffic you're actually serving.

— Promit, reading as Data Science ·

Frequently asked

How much will my Claude bill actually increase under the new metered pricing?
Expect 5-10x cost increases on heavy programmatic workloads, since the prior flat-rate effectively subsidized 70-90% of Agent SDK, GitHub Actions, and batch eval usage. The hit is amplified because 59% of production tokens now flow through multi-turn agentic traces that run 5-15x heavier than single-shot completions, so any forecast built on March's input-output ratios is off by a multiple, not a percentage.
What's the fastest way to get per-user and per-feature cost attribution on Claude usage?
Deploy an LLM gateway like LiteLLM or Portkey with per-user, per-feature tagging and daily token budget alerts — Anthropic ships no native usage telemetry, so without a gateway you'll discover overruns from the invoice rather than from monitoring. Tag at the tenant, feature, and prompt level so you can isolate which agentic loops are driving the heavy tail of token spend.
Why does Apache Iceberg CVE-2026-42812 specifically threaten ML training pipelines?
It lets an attacker with table-write permission redirect metadata pointers to an attacker-controlled S3 prefix, so subsequent queries read poisoned Parquet and training runs ingest silently corrupted features. Standard lakehouse observability watches row counts, null rates, and schema — not pointer changes — so the poisoning surfaces as subtle distribution shift weeks later, or never if no one is checking per-slice drift.
Should we switch eval harnesses given that 59% of tokens are now agentic?
Yes — single-turn benchmarks like MMLU and exact-match QA can't score a planner that burns 40,000 tokens looping and then gives up at the same final-answer accuracy as an efficient run. Add trajectory-level metrics: tool-call precision/recall, steps-to-completion, cost-per-successful-task, and recovery-from-error rate, since cost path is where the bill actually lives and where models diverge most.
Is the OpenAI Codex free trial worth running given switching costs?
Run it as an instrumented evaluation, not a migration — the 2-month enterprise promo is an asymmetric-payoff free window, and if Codex edges Claude on cost-per-successful-task under matched prompts and tool schemas, that becomes a durable routing signal rather than a wholesale switch. Keep a vendor-abstracted agent layer so the result informs routing rather than locking you in before Anthropic's October IPO clarifies pricing.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.