Data Science daily

Edition 2026-05-20 · read as Data Science

AnthropicEndsClaudeSubscriptionDiscount,OpenAICounters

Sources
36
Words
1,545
Read
8min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Anthropic killed the 70-90% effective discount on programmatic Claude usage overnight — subscriptions now convert to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses. On the same day, OpenAI dropped a 2-month-free Codex enterprise switch promo. If you haven't reconciled projected token burn against the new credit cap, you're making a pricing decision by default. June 15 is the cliff for third-party tool credits (Zed, Conductor, OpenCode). Re-run unit economics this sprint.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic's Triple Squeeze: Metering, Capacity, and June 15

    act now

    Anthropic metered all programmatic usage at list price, disclosed an 80x capacity miss (planned for 10x), and leased xAI's entire 220K-GPU Colossus 1 cluster. June 15 kills third-party tool subsidies. OpenAI is counter-offering with a 2-month free Codex promo. Any Claude benchmark from before May 7 is stale.

    80x
    growth vs 10x plan
    11
    sources
    • Colossus GPUs leased
    • Ramp B2B share
    • ARR growth (4mo)
    • Subsidy cliff
    1. Anthropic B2B Share34.4
    2. OpenAI B2B Share32.3
  2. 02

    59% Agentic Tokens: Eval and Cost Models Are Measuring the Minority

    act now

    Vercel's 200K-team production data shows 59% of tokens are now agentic multi-turn traces. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Single-turn eval harnesses and input/output cost ratios from 2024 are off by ~5x on spend forecasting for agentic workloads.

    59%
    tokens now agentic
    6
    sources
    • Anthropic spend share
    • Google volume share
    • MCP token overhead
    • No vendor loyalty
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    Lakehouse & MLOps Stack: New CVSS 9.6-9.9 CVEs Across Iceberg, Polaris, Argo CD

    monitor

    Beyond the previously-flagged LiteLLM/Ollama exploits, this week adds Apache Iceberg (CVSS 9.9, metadata write redirect), Polaris (9.9, credential broadening), Argo CD (9.6, plaintext K8s Secret extraction), and PraisonAI (auth bypass weaponized in 4 hours). The entire data/ML reference architecture now has a CVSS 9+ at every layer.

    4 hrs
    PraisonAI time-to-exploit
    4
    sources
    • Iceberg CVSS
    • Polaris CVSS
    • Argo CD CVSS
    • n8n/Kestra CVSS
    1. Iceberg9.9
    2. Polaris9.9
    3. n8n9.8
    4. Argo CD9.6
    5. Ollama9.1
  4. 04

    Training Efficiency: TST 2-3x, Datology 17x, Star Elastic 360x

    monitor

    Three research drops landed in one week that change unit economics. Nous TST reports 2-3x wall-clock speedup at matched FLOPs with no inference architecture change (validated 270M→10B). Datology hit +11.7 pts on VLM benchmarks at 17x less compute via data curation alone. NVIDIA Star Elastic claims one post-training run produces a model family at 360x lower cost.

    17x
    Datology compute savings
    2
    sources
    • TST speedup
    • Star Elastic saving
    • Datology VLM lift
    • TST scale tested
    1. NVIDIA Star Elastic360
    2. Datology VLM17
    3. Nous TST3
  5. 05

    AI Cyber Capability Crosses AISI Threshold — Harness Beats Model 271:1

    background

    Mythos is the first model to clear both AISI simulated attack ranges. Mozilla's custom harness found 271 Firefox bugs with an older Claude model; the same model found 1 CVE in curl without the harness. The 271:1 delta proves eval/harness investment dominates model selection by 50x+ on real codebases. Palo Alto surfaced serious vulns across 130+ products.

    271:1
    harness vs no-harness bugs
    6
    sources
    • AISI ranges cleared
    • Mozilla Firefox bugs
    • curl bugs (same model)
    • Palo Alto products
    1. With custom harness271
    2. Generic scan1

◆ DEEP DIVES

  1. 01

    Anthropic's Triple Squeeze: Your Claude Budget Just Broke

    What Happened

    The pricing change is the surface event, and it compounds with two others. Anthropic metered all programmatic Claude usage: subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% discount power users were getting is gone. At Code with Claude, Dario Amodei disclosed that Anthropic planned for 10x growth and got 80x, which is the most plausible read on the quality degradation users have been logging since mid-April. The emergency fix is leasing xAI's entire Colossus 1 cluster (220,000+ GPUs), from the CEO who called Anthropic 'misanthropic' three months ago.

    The June 15 Cliff

    Starting June 15, Claude usage through third-party tools (Conductor, Zed, OpenCode, T3 Code) draws from a separate credit bucket equal to plan value. No subsidized tokens, no rollover, overflow billed at API rates. Claude Code and the Claude app are unaffected. Weekly rate limits get bumped fifty percent for two months as a softener.

    SurfaceBeforeAfter (announced)
    Claude Code limits5-hour capDoubled
    Peak-hours throttleReduced for Pro/MaxRemoved
    Opus API rate limitsSqueezed'Substantially raised'
    Third-party toolsSubsidizedCredit-capped, overflow at list

    The Counter-Offensive

    OpenAI dropped a 2-month-free Codex enterprise switch promo the same day. Ramp's April data shows Anthropic edging OpenAI 34.4% vs 32.3% in B2B spend, the first apparent lead change. The thing the Ramp number doesn't tell you is whether the lead survives the metering change. OpenAI is pricing a response against the exact developers Anthropic just alienated, and a free evaluation window is an asymmetric-payoff bet for the buyer.

    Why This Matters For Your Stack

    ServiceNow already burned its full-year Claude budget by May after the price hikes. Anthropic provides no native per-user or per-tool usage telemetry and no SLAs, which is unusual for a dependency on the critical path. The vendor dashboard cannot attribute spend to a tenant or a prompt. That instrumentation has to come from the consumer side.

    Any Claude benchmark from before May 7 is stale. Any cost model built before the metering change is numerically wrong, not directionally wrong.

    The Capacity Aftermath

    The 80x growth miss means observed 'quality degradation' was quantization, smaller-model routing under load, and scheduler unfairness, not product changes. The Colossus integration (H100, H200, GB200 heterogeneous fleet) will produce p95/p99 variance during transition. Tail latency gets weirder before it stabilizes, and that is the metric to watch, not the mean.

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) against the new credit cap and flag jobs that exhaust credits before month-end
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily token budget alerts in front of all Claude traffic
    • Run a 2-month Codex evaluation under OpenAI's free promo with matched prompts and tool schemas against your existing Claude harness
    • Re-baseline Claude throughput and latency benchmarks after Colossus integration stabilizes (target: late May)
    • Reforecast Claude spend for teams using Zed, Conductor, or OpenCode — model post-June-15 scenario where overflow hits API rates

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with · Anthropic passed OpenAI in enterprise share

  2. 02

    59% Agentic Tokens: Your Eval Harness and Cost Model Are Measuring the Minority

    The Production Data

    Vercel's AI Gateway, covering 200,000 teams over 7 months, reports 59% of production tokens as agentic now: multi-turn, tool-calling traces rather than single-shot completions. Six months ago the figure was under 20%. The composition has shifted faster than at any point since chat endpoints replaced completion endpoints.

    The Spend-Volume Divergence

    The provider landscape has split along two axes that do not line up:

    ProviderShare of SpendShare of VolumePrimary ModelImplied Role
    Anthropic61%OpusReasoning/planning nodes
    Google38%FlashHigh-throughput utility calls
    Others~39%~62%MixedMixed

    Teams are already routing across providers: expensive models for reasoning, cheap models for throughput. The Gateway data shows no vendor loyalty. A serving layer hardcoded to one provider SDK is out of step with what 200K production teams are actually running.

    Why Existing Harnesses Fail

    Most agent eval harnesses score single-turn responses against reference answers. That was the right call in 2023. Agentic traces now run at a 15:1 input-to-output ratio, with heavy cache reuse on some providers and none on others. A forecast built on last year's 3:1 ratio is off by roughly 5x on spend. The metrics that matter now:

    • Cost-per-successful-task, not cost-per-token
    • Tool-call precision and recall, not single-turn accuracy
    • Steps-to-completion and retry rates
    • Recovery-from-error rate across the trajectory

    Glean's benchmark claims off-the-shelf MCP burns 30% more tokens than retrieval-tuned knowledge graphs on agentic tasks. Vendor-published with no methodology, so treat it as a hypothesis. The thing this doesn't tell you is whether the gap holds outside Glean's own task distribution. The failure mode is real, though: MCP tool listings balloon context windows, and naive tool outputs return verbose blobs where a reranked snippet would do.

    The Routing Architecture That Wins

    Abridge (80M+ clinical conversations, 250 health systems) discloses the pattern: cheap fast model for triage, large model for reasoning only when needed. Post-training on proprietary data beats frontier on narrow tasks at lower cost. The plausible cost reduction envelope from tiered routing is 20-40% at constant trajectory completion rate. If it lands at half that, the routing layer still pays for itself.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

    Action items

    • Add trajectory-level metrics (task success, tool-call F1, steps-to-completion, cost-per-successful-task) to your eval harness this sprint
    • Instrument per-node token cost in your agent graph and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models
    • Run a 1-hour spike measuring token overhead of your current MCP/tool-calling setup vs. a retrieval-first baseline on 100 production traces
    • Add LLM-judge-to-human-annotator agreement as a tracked quarterly metric in your eval pipeline

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs

  3. 03

    Three Training Efficiency Results That Change Your Q3 Compute Budget

    The Drops

    A cluster of pretraining results landed in one week, each nudging training unit economics in a direction that matters for any team running pretraining, fine-tuning, or distillation.

    WorkClaimScale ValidatedInference ImpactReplication Risk
    Nous Research TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone — no architecture changeMedium; single-source
    NVIDIA Star Elastic360x cheaper model-family derivationNot specifiedProduces family of sizesHigh; lab-reported
    Datology VLM curation+11.7 pts, 17x less compute2B and 4BLower response FLOPsMedium; benchmark-selection

    Why TST Is the One to Spike First

    Token Superposition Training is a pretraining recipe change with no inference-side downstream. If it replicates, it's a free 2-3x speedup. The mechanism, superposing multiple token targets per forward pass, has been validated from 270M to 10B-A1B MoE scale. No architecture change means existing serving infrastructure stays untouched. This is the cleanest 'try it and know within a week' experiment on the list.

    Datology: The Marginal Dollar Moved to Curation

    Datology's result is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation. Their 2B model beat InternVL3.5-2B by ~10 points across 20 benchmarks while using 17x less training compute, and their 4B matched near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B. The serving cost reduction is real. The thing this doesn't tell you is whether the curation advantage is durable or copied within two quarters.

    Star Elastic's 360x Needs Discount

    NVIDIA's claim that one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining a family is the kind of headline number that always shrinks under independent evaluation. Even a 30x hold would restructure how size tiers get produced. But 360x→30x is a common compression pattern in lab-reported claims. Wait for replication before planning around it.

    Combined Implication

    Together these results suggest that Q3 2026 training runs priced at Q1 2026 efficiency assumptions are systematically overspending. A team about to allocate $X to a continued-pretraining run should first check whether TST halves the wall-clock, whether curation halves the data requirement, and whether Star Elastic removes the need to pretrain multiple sizes. The order matters: TST replicates fastest, Datology requires a curation pipeline you may not have, and Star Elastic is still on the wait-for-replication shelf.

    The marginal dollar in VLM training moved from compute to curation. The marginal dollar in pretraining moved from more FLOPs to better recipes. Budgets that don't reflect this are overpaying.

    Action items

    • Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline this month
    • Audit your VLM/multimodal training data pipeline for curation quality — measure per-shard quality scores and deduplicate before the next training run
    • Hold off on pretraining separate model size tiers until Star Elastic publishes reproducible details
    • Pull SWE-ZERO-12M-trajectories (112B tokens, 12M trajectories, 3K repos) and begin preprocessing before licensing frictions appear

    Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week

  4. 04

    Apache Iceberg/Polaris CVSS 9.9 and Argo CD 9.6: The Lakehouse Trust Boundary Shrank

    What's New This Week

    On top of the LiteLLM (KEV-listed) and Ollama exploits flagged last week, four more critical CVEs landed on the data and ML stack in the same window. The blast radius is the infrastructure most DS teams treat as trusted by default, which is the part that matters.

    The New CVEs

    ComponentCVE / CVSSAttack ClassBlast Radius
    Apache IcebergCVE-2026-42812 / 9.9Metadata write redirectPoisoned tables, corrupted training data
    Apache PolarisCVE-2026-42809/10/11 / 9.9Credential broadeningS3/GCS creds, cross-tenant access
    Argo CD 3.2.x/3.3.xCVE-2026-42880 / 9.6Missing authorizationPlaintext K8s Secret extraction
    PraisonAICVE-2026-44338Auth bypassAgent runtime, connected DBs
    NGINX rewrite18-year latent RCEUnauthenticated RCEModel-serving ingress

    Why Iceberg CVSS 9.9 Is the Worst

    An attacker with table-write permission can point metadata at an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features. The thing CVSS doesn't tell you is that most lakehouse observability does not detect metadata pointer mutations. Default logging covers row changes, not pointer changes. Combined with Polaris credential-broadening, there is a plausible path from compromised analyst notebook to cross-tenant data theft.

    Argo CD: Every Secret Is Disclosed

    For teams running training jobs or model services on Argo CD 3.2 or 3.3, treat every K8s Secret in reachable namespaces as disclosed until patched and rotated. That includes model-registry tokens, HuggingFace PATs, DB passwords, and cloud credentials. The rotation costs more than the patch. It is also the part that actually closes the incident.

    PraisonAI: Agent Frameworks Are First-Class Targets

    PraisonAI's auth bypass was weaponized within 4 hours of CVE disclosure. Four hours is the operational signal: agent frameworks have crossed the adoption threshold where threat actors monitor their disclosure feeds. Any team prototyping with LangChain, CrewAI, AutoGen, or PraisonAI whose runtime holds API keys sits in the same blast radius.

    Draw a reference architecture for a modern data team, throw darts at it, and every throw hits a CVSS 9.0 or higher this week.

    Action items

    • Patch Argo CD to ≥3.2.12/≥3.3.10 and rotate every Kubernetes Secret in namespaces it can read — including model-registry tokens, HF PATs, and cloud credentials
    • Audit Iceberg/Polaris catalog configurations: enforce explicit storage credential scoping and add write-path allowlisting for table metadata locations
    • Inventory all agent frameworks in use and pin versions; patch PraisonAI immediately if present; add CVE-feed subscriptions for LangChain, CrewAI, AutoGen
    • Patch NGINX across all inference gateways and audit rewrite-module usage in model routing configs

    Sources:LiteLLM landed in the KEV catalog this week · An Ollama endpoint exposed to the public internet · PraisonAI, an open-source multi-agent framework · SANS AtRisk newsletter

◆ QUICK HITS

  • DuckDB shipped Quack HTTP client-server mode — viable Spark/Glue replacement for sub-100GB single-node ETL; spike one job on ECS Fargate + DuckDB + Terraform pattern

    DuckDB shipped a client-server mode this week

  • Kafka Share Groups decouple consumer parallelism from partition count with ~linear 8x scaling at 32 instances on I/O-bound workloads — benchmark your most partition-bound consumer group

    DuckDB shipped a client-server mode this week

  • Only 15% of organizations have the data foundation for agentic AI at scale (Fivetran); data quality/lineage is #1 blocker cited by ~50% — use as gate before greenlighting agent projects

    DuckDB shipped a client-server mode this week

  • Microsoft agent memory architecture (consolidation + forgetting + delayed maturation) stabilizes at 400-500 memories with 97.2% retention precision — concrete alternative to flat vector top-k

    DuckDB shipped a client-server mode this week

  • DeepSeek V4 Pro scored 77/100 on FlowGraph at $2.25/task, sitting between Opus 4.7 and Kimi K2.6 — re-benchmark for cost-per-successful-task on your production distribution

    The CyberGym result is the kind of finding

  • PyTorch 2.12 shipped MX quantization export — the highest-leverage framework change this week for teams under inference cost pressure deploying low-precision models

    The CyberGym result is the kind of finding

  • TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini-3.1-flash-live and 1.18s GPT-Realtime-2.0 — full-duplex voice is becoming its own model class

    TML is reporting 0.40 seconds of full-duplex latency

  • Opus 4.7 tripled image processing costs — re-price multimodal inference budgets and evaluate GPT-4V/Gemini head-to-head on your actual vision workload

    Anthropic passes OpenAI in B2B

  • Compute demand-to-supply ratio is 4:1 at Nebius (684% YoY revenue, $3-3.4B 2026 guidance vs $530M 2025) — lock H2 GPU reservations across 2+ providers before quarterly sellouts

    The 4:1 ratio is the headline number

  • Duolingo publicly pegs AI-generated content rejection rate at ~20% requiring human QC — anchor your own generation pipeline acceptance-rate targets against this production baseline

    Duolingo's twenty percent AI slop rate

  • Update: LLM-as-a-Verifier reportedly beats LLM-as-a-Judge on tie-rate and decision accuracy by decomposing criteria into repeated binary verifications at token granularity

    An Ollama endpoint exposed to the public internet gets picked up by Shodan

  • PCAOB/COSO guidance now requires deterministic execution and tamper-evident audit trails for ML in regulated finance — audit GPU non-determinism, seed management, and model-artifact immutability

    The transformer underwriting models are outperforming

◆ Bottom line

The take.

Anthropic killed the implicit subsidy on programmatic Claude usage the same week Vercel confirmed 59% of production tokens are agentic — meaning your cost model and your eval harness are both measuring last year's workload. Meanwhile, Apache Iceberg, Polaris, and Argo CD shipped CVSS 9.9 vulnerabilities that can poison training data or disclose every secret in your cluster. Re-run unit economics before June 15, rebuild the eval harness around trajectories not turns, and patch the lakehouse before the next training job ingests corrupted features.

— Promit, reading as Data Science ·

Frequently asked

What changed with Claude subscriptions converting to API credits?
Anthropic ended the implicit 70-90% discount on programmatic Claude usage. Subscriptions now convert to dollar-matched API credits across the Agent SDK, claude-p, GitHub Actions, and third-party harnesses, so workloads that previously ran under flat-rate plans now meter against a hard credit cap with overflow billed at list API rates.
Why does June 15 matter for teams using Zed, Conductor, or OpenCode?
Starting June 15, Claude usage through third-party tools draws from a separate credit bucket equal to plan value, with no subsidized tokens and no rollover. Overflow bills at API rates. Claude Code and the Claude app are exempt. Teams need a reforecast before sprint planning locks, since spend on those surfaces can spike materially overnight.
How should eval harnesses change given that 59% of production tokens are now agentic?
Single-turn accuracy scoring no longer represents the majority workload. Add trajectory-level metrics: cost-per-successful-task, tool-call precision and recall, steps-to-completion, retry rate, and recovery-from-error rate. Vercel's data across 200K teams shows agentic share tripled in six months, and harnesses scoring single-turn responses against reference answers are measuring the minority case.
Which training efficiency result is worth spiking first?
Token Superposition Training from Nous Research. It's a pretraining recipe change validated from 270M to 10B-A1B MoE scale, claiming 2-3x wall-clock at matched FLOPs with no architecture change and no inference-side impact. Existing serving infrastructure stays untouched, making it a clean week-long experiment. Datology's curation result and NVIDIA's Star Elastic 360x claim need more replication before betting budget on them.
What's the most urgent action from this week's CVE cluster on the data stack?
Patch Argo CD to ≥3.2.12/≥3.3.10 and rotate every Kubernetes Secret in namespaces it can read, including model-registry tokens, HuggingFace PATs, DB passwords, and cloud credentials. The missing-authorization flaw (CVSS 9.6) means every secret was readable by any authenticated user, so rotation, not just patching, is required to close the incident.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.