How do I project token burn under Anthropic's new metered pricing before the June 15 cutoff?

Audit every Claude-backed workload — Agent SDK, claude-p, GitHub Actions, batch evals, and third-party harnesses like Conductor, Zed, or OpenCode — and re-price their token volume at list API rates, since the 70-90% subsidy from Max plans is gone. Because Anthropic ships no native per-user or per-tool attribution, deploy an LLM gateway like LiteLLM or Portkey with per-feature tagging and daily budget alerts within two weeks. ServiceNow burned its full-year Claude budget by May without this visibility.

Why are single-turn evals no longer sufficient for production agents?

Vercel's gateway data across 200K teams shows 59% of all tokens are now agentic multi-turn traces with roughly 15:1 input-to-output ratios, versus the 3:1 most cost models assume. Single-turn accuracy benchmarks measure the minority of production traffic and miss tool-call precision, steps-to-completion, cost-per-successful-task, and error recovery — the metrics that actually drive agent unit economics. Cost forecasts built on last year's ratios are off by roughly 5x on spend.

Which training-efficiency result should I spike first, and which should I wait on?

Spike Nous Research's Token Superposition Training first: it claims 2-3x wall-clock speedup at matched FLOPs from 270M through 10B-A1B MoE scale with zero inference-time architecture change, so a successful 1B continued-pretraining replication compounds across every future run. Datology's curation result (+11.7 points at 17x less compute on VLMs) is the second priority. Wait on NVIDIA Star Elastic's 360x family-production claim until an independent lab confirms even 30x — headline numbers in that range routinely shrink 5-10x under outside eval.

If the same model yields 271 bugs for one team and 1 CVE for another, how should that change model-vs-harness investment?

Mozilla's custom agentic harness around Firefox fuzzing produced 271 bugs from Claude Mythos while a generic scanner against curl produced 1 low-severity CVE — same weights, ~270x yield gap driven entirely by scaffolding. That puts harness investment ahead of model selection by at least 50x for most internal tools. Build domain-specific scorers, reproducible test-case emitters, and ephemeral VM scaling before debating Claude vs. GPT-5 vs. Gemini for the next workload.

What's the cheapest variance reduction I can apply to my eval suite this quarter?

Rewrite one pairwise LLM-as-a-Judge eval as a decomposed LLM-as-a-Verifier: replace a single high-variance categorical judgment with k lower-variance binary verifications using token-level scoring. Measure tie rate and bootstrap CI width on a known A/B pair before and after — if CIs tighten at equal compute, roll the pattern across the rest of the suite. This is the same shift that moved human eval from Likert scales to pairwise preferences a decade ago.

Edition 2026-05-21 · read as Data Science

AnthropicEndsClaudeSubscriptionDiscount,30-DayCliff

Sources: 36
Words: 1,766
Read: 9min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Anthropic converted Claude subscriptions to dollar-matched metered API credits this week, killing the 70-90% effective discount that powered most agent SDK and batch eval workloads — and a June 15 cliff cuts third-party tool credits entirely. Meanwhile, Vercel's production telemetry across 200K teams confirms 59% of all tokens are now agentic multi-turn traces. Your cost model was already wrong; it just became quantifiably wrong, with a 30-day deadline attached.

Key facts

Anthropic converted Claude subscriptions to dollar-matched metered API credits, eliminating the prior 70-90% effective discount for agent SDK and batch eval workloads.
On June 15, 2026, Claude usage through third-party tools like Conductor, Zed, OpenCode, and T3 Code moves to a separate credit bucket with no rollover, billing overflow at list API rates.
Vercel's AI Gateway telemetry across 200,000 teams shows 59% of all tokens are now agentic multi-turn traces, up from under 20% six months earlier.
Anthropic captures 61% of LLM spend via Opus while Google captures 38% of token volume via Flash, per Vercel's production data.
Mozilla's custom agentic harness with Claude Mythos Preview found 271 bugs in Firefox 150, while Daniel Stenberg's generic scanner with the same model found 1 low-severity CVE in curl plus 4 false positives.

◆ INTELLIGENCE MAP

01
Anthropic's Triple Economic Reset
act now
Claude subscriptions now cap at dollar-equivalent API credits (killing the 70-90% alt-harness subsidy). June 15 cuts third-party tool credits with no rollover. ServiceNow burned its full-year Claude budget by May. OpenAI launched a 2-month-free Codex enterprise switch promo the same day.
70-90%
subsidy eliminated
9
sources
- Effective discount lost
- Third-party cliff date
- Capacity miss factor
- Colossus GPUs leased
1. May 7Credit metering announced
2. May 14Rate limits doubled
3. June 15Third-party tool credits severed
4. H2 2026Colossus integration stabilizes
02
59% Agentic: Eval and Cost Models Obsolete
act now
Vercel's 200K-team production data shows 59% of tokens are agentic. Anthropic captures 61% of spend via Opus, Google captures 38% of volume via Flash. MDASH's 100+ agent ensemble beat single models on CyberGym. Single-turn eval harnesses now measure the minority of production traffic.
59%
tokens now agentic
6
sources
- Agentic token share
- Anthropic spend share
- Google volume share
- Teams in dataset
1. Agentic tokens59
2. Single-turn tokens41
03
Training Efficiency Step-Changes: 2-360x
monitor
Three research drops shift pretraining/post-training economics. Nous TST: 2-3x wall-clock speedup at matched FLOPs, validated to 10B. Datology: +11.7 pts on VLM benchmarks at 17x less compute. NVIDIA Star Elastic: 360x cheaper model-family derivation from one post-training run. TST has no inference-side change — spike it first.
17x
compute reduction (VLM)
1
sources
- TST speedup
- Datology compute savings
- Star Elastic cost cut
- Datology benchmark lift
1. Nous TST3
2. Datology VLM17
3. Star Elastic360
04
AI Cyber Capability Crosses Full-Takeover Threshold
monitor
Anthropic's Mythos is the first model to clear both UK AISI simulated attack ranges — full network takeover, not just advanced persistence. Mozilla's custom harness surfaced 271 Firefox bugs with the same model that found only 1 in curl. The harness, not the model, determined the 270x yield difference. Patch SLAs tuned to CVE cadence are measuring the wrong clock.
271
bugs found (harness)
5
sources
- Mozilla bugs found
- curl bugs found
- AISI ranges cleared
- PraisonAI exploit time
1. Mozilla (custom harness)271
2. curl (generic scan)1
05
Data Infrastructure: Single-Node Tools Go Multi-Node
background
DuckDB shipped Quack HTTP client-server protocol, making embedded analytics a shared service without custom API servers. Kafka Share Groups decouple consumer parallelism from partition count with 8x throughput at 32 instances. Lakehouse column stats remain the silent query-planner killer — stale/missing stats cost 3x before anyone notices.
8x
Kafka throughput gain
1
sources
- Kafka Share Group gain
- Instances tested
- Orgs ready for agents
- Data modeling pain
1. DuckDB Quack100
2. Kafka Share Groups8
3. Agentic AI readiness15

◆ DEEP DIVES

Anthropic's 30-Day Pricing Cliff: Metered Credits, June 15 Cutoff, and the OpenAI Counter-Offensive

What Changed This Week

Anthropic shipped three changes simultaneously that compound into one budgeting problem. First, all Claude subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% subsidy that power users extracted from Max plans is gone. Second, on June 15, Claude usage through third-party tools (Conductor, Zed, OpenCode, T3 Code) moves to a separate credit bucket with no rollover. Overflow bills at list API rates. Third, Dario Amodei conceded Anthropic planned for 10x growth and got 80x, which explains months of quiet degradation and the emergency lease of xAI's full 220,000-GPU Colossus 1 cluster.

Why This Breaks Your Stack

The metering change is not a gentle price increase. It is a structural reclassification of how programmatic usage is billed. Any batch eval, enrichment pipeline, or agent loop running on a flat subscription is now burning metered tokens at list price. ServiceNow's CDIO confirmed publicly that they burned their full-year Claude budget by May, after price hikes hit an enterprise with no native per-user telemetry.

Anthropic provides no native per-user, per-tool usage attribution. Customers must wire external analytics to see who is consuming what.

The capacity story compounds the pricing story. The 80x miss means serving conditions your eval harness measured between mid-summer and now are contaminated for baselining. Rate limits on Opus are being raised, Claude Code 5-hour caps are doubling, and a heterogeneous fleet (H100 + H200 + GB200 via Colossus) means p95/p99 latency variance will increase during integration, not decrease. The thing this doesn't tell you is which slice of your traffic lands on which silicon.

The OpenAI Counter-Move

Sam Altman posted a 2-month-free Codex enterprise switch promo the same day Anthropic announced metering. Ramp's April data shows Anthropic edging OpenAI 34.4% vs 32.3%, the first apparent lead change. OpenAI is pricing explicitly against the developers Anthropic just alienated. The asymmetric free evaluation window expires in roughly 60 days.

What the Market Data Actually Shows

Metric	Anthropic	OpenAI	Signal Quality
Ramp B2B share	34.4%	32.3%	SMB card-spend biased
ARR trajectory	$9B → $30B+ in 4 months	Not disclosed	WSJ-sourced
Valuation	~$900B (offered)	$852B	Private marks
October IPO	Targeting	N/A	CFO hired

Ramp measures who gets billed, not token volume or production criticality. A 210-basis-point gap in a monthly snapshot is inside noise. The directional signal, that second-vendor adoption is now the default, is the actionable read.

Action items

Audit every Claude-backed workload (Agent SDK, claude-p, GitHub Actions, batch evals) and project token burn under new metered pricing by end of this sprint
Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily budget alerts within 2 weeks
Activate OpenAI's 2-month Codex enterprise switch promo and run head-to-head against existing Claude eval harness using matched prompts
Re-baseline all Claude benchmarks (throughput, p95 latency, rate-limit headroom) after Colossus integration stabilizes — do not ship workarounds built against degraded measurements

Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel's AI Gateway production index · Anthropic passed OpenAI in enterprise share this quarter

59% of Production Tokens Are Agentic — Your Eval Harness and Cost Model Measure the Wrong Thing

The Production Data

Vercel's AI Gateway, covering 200,000 teams over 7 months, reports that 59% of all tokens are now agentic — multi-turn, tool-calling traces rather than single-shot completions. Six months ago the figure was under 20%. The spend/volume split is the more interesting cut: Anthropic captures 61% of spend via Opus on expensive reasoning, while Google captures 38% of volume via Flash on cheap fan-out. The data shows no vendor loyalty. Customers route by task.

If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

Why Single-Turn Evals Are Now Measuring the Minority

Agentic workloads are not 'just more tokens.' They are bursty, multi-turn, tool-calling, with 15:1 input-to-output ratios against the 3:1 most cost models assume. The cost function is latency-per-step times number-of-steps, plus retry risk from bad tool calls. A forecast built on last year's ratio is off by roughly 5x on spend, and the error is not symmetric across vendors. The median request stops being the right summary statistic. p95 does the work.

MDASH Validates Multi-Agent Decomposition

Microsoft's MDASH, a 100+ agent system, beat Anthropic's Mythos on the CyberGym vulnerability benchmark by decomposing into scan → adversarial debate → PoC construction stages. The thing this doesn't tell you is which stage drives the lift, or what it cost. There is no ablation and no cost comparison. The architectural pattern is consistent with ensemble priors — specialized agents with explicit disagreement tend to generalize better than monolithic calls — but consistent-with is not evidence-for.

Architecture	Best For	Cost Profile	Eval Approach
Single frontier model	Narrow, latency-sensitive	Predictable per-call	Single-turn accuracy
Tiered routing (Opus+Flash)	Mixed reasoning/throughput	20-40% savings at parity	Per-node quality + trajectory cost
Multi-agent decomposition	Complex, verifiable tasks	Higher per-task, lower per-error	Trajectory-level + tool-call F1

The Abridge Reference Architecture

Abridge runs 80M+ clinical conversations through a 'constellation of models': cheap triage in front, expensive reasoning behind, LLM judges calibrated against human annotators, memory externalized to event-driven stores rather than model weights. The transferable production pattern is the confidence-gated router. Only 40% of requests reach frontier-class reasoning. The rest are handled by 7-13B models at 5-10x lower cost.

What the 30% MCP Overhead Means

Glean's benchmark is vendor-sponsored with undisclosed methodology. It claims off-the-shelf MCP uses 30% more tokens and loses 2.5x head-to-head against a tuned knowledge graph on agentic tasks. Treat that as a hypothesis, not a result. The failure mode is plausible regardless: MCP tool listings inflate context windows, and naive tool outputs return verbose blobs where a reranked snippet would do. The 30% number is one to falsify on your own workload this sprint, not to cite.

Action items

Add trajectory-level metrics to eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
Instrument per-node token cost across your agent graph and route utility calls (summarization, extraction, query rewriting) to Flash/Haiku-class models within 2 weeks
Run a 1-hour spike measuring token overhead of current MCP/tool-calling setup vs. retrieval-first baseline on 100 production traces
Prototype a decompose-debate-verify pipeline on one auto-verifiable workload (code gen, SQL, extraction) and measure accuracy delta at comparable token cost

Sources:The CyberGym result · Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs · Glean's benchmark

Three Training Efficiency Results That Change Q3 Unit Economics

The Landscape

Three research drops landed in the same cycle, each targeting a different cost bottleneck in the training pipeline. Individually, each is worth a spike. Together they suggest the marginal dollar in training has moved from raw compute to recipe optimization and data curation.

Nous Research: Token Superposition Training (TST)

TST reports 2-3x wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M through 10B-A1B MoE scale. The mechanism superimposes multiple token sequences during training, lifting effective throughput without touching the served model structure.

Why spike this first: nothing changes downstream at serving. If it replicates on a 1B continued-pretraining run, it is a free 2-3x on every subsequent training run. Replication risk is medium. Single source, but the claim is clean and falsifiable.

Datology: Data Curation Beats Compute

At 2B params, Datology reports +11.7 points on 20 VLM benchmarks, beating InternVL3.5-2B by about 10 points at 17x less training compute, purely through data curation. At 4B params, they hit near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B.

This is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation.

The serving win is real. Lower response FLOPs means cheaper inference on every request. The risk is benchmark-selection bias. Twenty benchmarks sounds broad, but the specific selection matters, and the thing this doesn't tell you is how the curated data performs on slices outside that suite.

NVIDIA Star Elastic

Claims one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining a family, and 7x better than SOTA compression. Scale not specified in the ledger.

Caveat: this is the kind of headline number that shrinks under independent eval. Even a 30x hold would restructure how teams produce size tiers for deployment. 360x from a lab-reported result deserves heavy discounting until reproduced.

Comparative Assessment

Work	Claim	Validated Scale	Inference Impact	Replication Risk	Spike Priority
Nous TST	2-3x wall-clock	270M → 10B MoE	None	Medium	First
Datology	+11.7 pts at 17x less compute	2B, 4B	Lower serving FLOPs	Medium	Second
Star Elastic	360x cheaper families	Not specified	Produces size tiers	High	Wait for repro

What This Means for GPU Budget Planning

Against the 4:1 demand-to-supply ratio at Nebius and capacity selling out quarterly, these results offer a hedge: recipe-level efficiency gains can partially offset a capacity squeeze. TST alone, if it replicates at 2x, means the same training budget buys twice the runs. Combined with Datology's curation thesis, the highest-leverage Q3 investment is likely a data quality team, not more GPU hours.

Action items

Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline within 3 weeks
Audit your VLM/multimodal training data pipeline for curation quality — measure data diversity, duplication rate, and quality score per shard
Lock H2 GPU reservations across 2+ providers before quarterly sellouts tighten further
Track Star Elastic replication attempts — do not plan model-family production around 360x until an independent lab confirms even 30x

Sources:Claude just metered your agent SDK calls

Harness Dominates Model: The 271-to-1 Ratio and What It Means for Agent Evaluation

The 270x Yield Gap

Two teams ran Claude Mythos Preview against large C codebases in the same window. Mozilla wrapped a custom agentic harness around their existing fuzzing infrastructure and reported 271 bugs in Firefox 150, including sandbox escapes, use-after-frees, and race conditions. Daniel Stenberg pointed the same model at curl with a generic scanner and got exactly 1 low-severity CVE alongside 4 false positives.

Same weights. Roughly two orders of magnitude in yield. The variable that moved was the scaffolding.

When a frontier model yields 271 bugs for one team and 1 CVE for another against the same language, the harness is the product, not the model.

Why This Generalizes Beyond Security

The pattern maps onto ordinary ML evaluation. A team debating Claude 4 vs. GPT-5 vs. Gemini for an internal tool is optimizing the wrong variable. On this evidence, the gap puts harness investment ahead of model selection by at least 50x. Mozilla's harness emits reproducible test cases, scales across ephemeral VMs, and feeds existing signal pipelines. The model contributes capability. The harness converts capability into measurable signal. The thing the 271 number doesn't tell you is how much of that yield is unique vs. duplicate clusters, which would tighten the multiple. It probably loosens it instead.

The AISI Threshold Crossing

Mythos is also the first model to clear both UK AISI simulated attack ranges, meaning full network takeover rather than advanced persistence alone. GPT-5.5-cyber cleared one of two. AISI is building harder tests because the current ladder is saturating, which is consistent with a discrete capability unlock rather than smooth interpolation. The closest analog is GPT-3.5 to GPT-4 on agentic benchmarks. It is possible the gap is narrower than two ranges suggests once variance is accounted for, but two-of-two vs. one-of-two is hard to explain by noise.

Operational implications for agent deployments

Old Assumption	New Reality	What Changes
Refusal-rate evals gate releases	Chain-completion evals required	Add staged attack-chain rubric (recon → access → lateral → persist → exfil)
Patch SLA tuned to CVE cadence	Model release cadence is faster	Compress critical-patch windows; add model-release-triggered security review
Model choice is the key decision	Harness architecture is the key decision	Invest in domain-specific scorers and reproducible test-case emitters over model swaps

LLM-as-Verifier: A Methodological Upgrade

A parallel finding reinforces the harness thesis. LLM-as-a-Verifier beats LLM-as-a-Judge on tie-rate and decision accuracy by decomposing evaluation into repeated binary verifications with token-level scoring. The mechanism is straightforward: one high-variance categorical judgment replaced by k lower-variance binary ones. The same argument moved human eval from Likert scales to pairwise preferences a decade ago.

Practical translation: rewrite one pairwise judge as a decomposed verifier, measure tie rate before and after, and the width of the bootstrap CI on a known A/B pair. If CIs tighten at equal compute, roll to the rest of the suite. Cheapest variance reduction available this quarter.

Action items

Spike a domain-specific agentic harness on one internal tool (code review bot, data quality checker) modeled on Mozilla's pattern: reproducible test cases + ephemeral VM scaling + integration with existing pipelines
Add a staged cyber-capability tier to your agent release gate: recon → initial access → lateral movement → persistence → exfil, run against every model upgrade
Rewrite one pairwise LLM-judge eval as a decomposed binary verifier and compare tie-rate and CI width on a known A/B pair
Persist full agent trajectories (tool calls, intermediate state, file diffs) and audit a stratified random sample of 'passing' rollouts for reward hacking

Sources:Mozilla shipped 271 bugs over the period · Mythos cleared the AISI attack ranges · Two data points this week broke the AI cyber trend · The UK AISI evaluations report · PraisonAI auth bypass exploited in 4 hours · LLM-as-a-Verifier reframes evaluation

◆ QUICK HITS

DuckDB shipped Quack HTTP client-server protocol — Spark/Glue jobs under 100GB on 2-node clusters are now candidates for a Fargate + DuckDB + Terraform pattern at ~50% cost/latency reduction
DuckDB shipped a client-server mode this week
Duolingo publicly pegged AI-generated content slop at ~20% requiring human QC — the first honest production quality number at scale; benchmark your own acceptance rate against it
Duolingo's twenty percent AI slop rate
Update: PraisonAI (multi-agent framework) auth bypass exploited in 4 hours of CVE disclosure — agent orchestration frameworks now have same-day patching requirements, not quarterly
PraisonAI zero-day was popped in four hours
Cerebras IPO closed +70% at $311 with a $20B OpenAI commitment — first dollar-weighted proof that non-Nvidia inference silicon has a production buyer; benchmark CS-3 API this quarter
Cerebras IPO validates non-Nvidia silicon
TML-Interaction-Small reports 0.40s full-duplex turn-taking latency vs. 1.18s for GPT-Realtime-2.0 — a 3x gap on the metric that determines perceived voice naturalness; architecture is continuous multi-stream, not turn-based
TML is reporting 0.40 seconds of full-duplex latency
Only 15% of organizations have the data foundation for agentic AI per Fivetran; data quality/lineage is the #1 blocker cited by ~50% — gate agent projects on readiness scorecards, not model capability
DuckDB shipped a client-server mode this week
SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages — largest open agentic trace corpus; pull now before licensing frictions accumulate
Claude just metered your agent SDK calls
Persona drift in LLM agents is measurable within 8 dialogue turns per COLM 2024 — add a verbal-tic canary to system prompts and instrument per-turn retention as a free drift signal
AI personas drift within eight turns
PCAOB/COSO guidance now requires deterministic execution and tamper-evident audit trails for ML in regulated finance — transformers are non-deterministic by default and the usual fixes (temp=0, fixed seeds) aren't sufficient for audit
The transformer underwriting models are outperforming
Nebius reported 684% YoY revenue growth with 4+ customers per GPU brought online — lock H2 GPU reservations across 2+ providers before quarterly sellouts tighten further
The 4:1 ratio is the headline number

◆ Bottom line

The take.

Anthropic killed the flat-rate subsidy that powered most agent SDK workloads, Vercel's 200K-team production data confirms 59% of tokens are now agentic multi-turn traces, and three training-efficiency results (2-17x) landed in the same week — meaning your cost model, your eval harness, and your training budget are simultaneously stale, with a June 15 deadline on the first, an already-shipping reality on the second, and a 4:1 GPU demand ratio making the third increasingly urgent.

Frequently asked

How do I project token burn under Anthropic's new metered pricing before the June 15 cutoff?: Audit every Claude-backed workload — Agent SDK, claude-p, GitHub Actions, batch evals, and third-party harnesses like Conductor, Zed, or OpenCode — and re-price their token volume at list API rates, since the 70-90% subsidy from Max plans is gone. Because Anthropic ships no native per-user or per-tool attribution, deploy an LLM gateway like LiteLLM or Portkey with per-feature tagging and daily budget alerts within two weeks. ServiceNow burned its full-year Claude budget by May without this visibility.
Why are single-turn evals no longer sufficient for production agents?: Vercel's gateway data across 200K teams shows 59% of all tokens are now agentic multi-turn traces with roughly 15:1 input-to-output ratios, versus the 3:1 most cost models assume. Single-turn accuracy benchmarks measure the minority of production traffic and miss tool-call precision, steps-to-completion, cost-per-successful-task, and error recovery — the metrics that actually drive agent unit economics. Cost forecasts built on last year's ratios are off by roughly 5x on spend.
Which training-efficiency result should I spike first, and which should I wait on?: Spike Nous Research's Token Superposition Training first: it claims 2-3x wall-clock speedup at matched FLOPs from 270M through 10B-A1B MoE scale with zero inference-time architecture change, so a successful 1B continued-pretraining replication compounds across every future run. Datology's curation result (+11.7 points at 17x less compute on VLMs) is the second priority. Wait on NVIDIA Star Elastic's 360x family-production claim until an independent lab confirms even 30x — headline numbers in that range routinely shrink 5-10x under outside eval.
If the same model yields 271 bugs for one team and 1 CVE for another, how should that change model-vs-harness investment?: Mozilla's custom agentic harness around Firefox fuzzing produced 271 bugs from Claude Mythos while a generic scanner against curl produced 1 low-severity CVE — same weights, ~270x yield gap driven entirely by scaffolding. That puts harness investment ahead of model selection by at least 50x for most internal tools. Build domain-specific scorers, reproducible test-case emitters, and ephemeral VM scaling before debating Claude vs. GPT-5 vs. Gemini for the next workload.
What's the cheapest variance reduction I can apply to my eval suite this quarter?: Rewrite one pairwise LLM-as-a-Judge eval as a decomposed LLM-as-a-Verifier: replace a single high-variance categorical judgment with k lower-variance binary verifications using token-level scoring. Measure tie rate and bootstrap CI width on a known A/B pair before and after — if CIs tighten at equal compute, roll the pattern across the rest of the suite. This is the same shift that moved human eval from Likert scales to pairwise preferences a decade ago.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

AnthropicEndsClaudeSubscriptionDiscount,30-DayCliff

◆ INTELLIGENCE MAP

◆ DEEP DIVES

What Changed This Week

Why This Breaks Your Stack

The OpenAI Counter-Move

What the Market Data Actually Shows

The Production Data

Why Single-Turn Evals Are Now Measuring the Minority

MDASH Validates Multi-Agent Decomposition

The Abridge Reference Architecture

What the 30% MCP Overhead Means

The Landscape

Nous Research: Token Superposition Training (TST)

Datology: Data Curation Beats Compute

NVIDIA Star Elastic

Comparative Assessment

What This Means for GPU Budget Planning

The 270x Yield Gap

Why This Generalizes Beyond Security

The AISI Threshold Crossing

Operational implications for agent deployments

LLM-as-Verifier: A Methodological Upgrade

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS