How do I quickly estimate the cost overrun on existing Claude agent pipelines?

Multiply your prior subscription-flat token burn by roughly 3-10x to approximate the now-unsubsidized API-credit cost, since the implicit subsidy on Agent SDK, GitHub Actions, and third-party harness usage was 70-90%. Then re-project against the dollar-matched credit cap and the June 15 third-party split (no rollover). Any pipeline whose monthly trace volume × per-trace token count exceeds the credit pool is in overrun starting now.

What trajectory-level metrics should replace single-turn pass@1 in an agent eval harness?

Track tool-call precision and recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate, and per-node token cost across the agent graph. These capture the multi-turn, tool-calling, long-context behavior that dominates the 59% of production tokens now flowing through agentic traces. Pass@1 measures the minority of traffic and hides 5-15x compounding from retries and tool loops.

Where does the in-house post-training versus frontier-API crossover sit after this week's results?

It moved lower. Datology shows VLM curation beating a 2B baseline by ~10 points at 17x less compute, TST claims 2-3x wall-clock at matched FLOPs with no inference-side change, and NVIDIA's Star Elastic produces a model-size family from one post-training run. For teams with >100K proprietary task examples, SFT or DPO on a 7-13B base is now plausibly cheaper than frontier API calls at production volume.

Why is the Iceberg CVE more dangerous than typical lakehouse patch noise?

CVE-2026-42812 (CVSS 9.9) lets an attacker with table-write permission redirect table metadata to an attacker-controlled S3 prefix, so subsequent queries read poisoned Parquet and training runs silently ingest corrupted features. Default lakehouse observability tracks row changes, not metadata-pointer mutations, so the poisoning does not trip standard alerts. Combined with Apache Polaris credential-broadening, it opens a path from analyst notebook to cross-tenant theft and training-data poisoning.

Is OpenAI's two-month free Codex enterprise promo worth the integration effort to evaluate?

Yes, because it is a zero-cost asymmetric option. Run it through your own eval harness with matched prompts and tool schemas, scoring on trajectory-level metrics rather than pass@1. Even if you stay on Claude, the comparison gives you negotiating leverage and a measured fallback for the next capacity or pricing shock; skipping it leaves the comparison unmeasured at exactly the moment Anthropic's serving conditions are non-stationary due to the Colossus 1 integration.

Edition 2026-05-27 · read as Data Science

AnthropicEndsClaudeSubscriptionSubsidyforAgentPipelines

Sources: 36
Words: 1,553
Read: 8min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Vercel's production traces show 59% of tokens are now agentic, and agentic traces compound 5-15x per task against single-shot baselines. Anthropic picked this week to convert Claude subscriptions into dollar-matched API credits across the Agent SDK, GitHub Actions, and third-party harnesses, which removes the 70-90% effective subsidy those pipelines were quietly running on. Third-party tool credits split off further on June 15, with no rollover. Any pipeline still budgeted on flat-subscription economics is carrying an overrun it has not measured yet.

Key facts

Vercel's AI Gateway telemetry across 200,000 teams shows agentic workloads now account for 59% of token volume, up from under 20% six months earlier.
Anthropic converted Claude subscriptions into dollar-matched API credits across the Agent SDK, GitHub Actions, and third-party harnesses, eliminating a 70-90% effective subsidy on those pipelines.
On June 15 third-party tool credits (Conductor, Zed, OpenCode, T3 Code) split into a separate bucket with no rollover, while weekly limits were raised 50% as a softener.
Anthropic is absorbing xAI's Colossus 1 cluster of 220,000+ H100/H200/GB200 GPUs, roughly 45% of xAI's capacity, producing heterogeneous serving infrastructure.
Apache Iceberg CVE-2026-42812 (CVSS 9.9) allows metadata write redirection to attacker-controlled S3 prefixes, and PraisonAI's CVE-2026-44338 was weaponized within 4 hours of disclosure.

◆ INTELLIGENCE MAP

01
Anthropic's Triple Squeeze: Metering + Capacity + June 15 Deadline
act now
Anthropic metered all programmatic Claude usage (killing 70-90% alt-harness discount), admitted 80x growth vs 10x plan, leased xAI's entire 220K-GPU Colossus 1 cluster, and scheduled June 15 third-party credit separation. ServiceNow burned its full-year budget by May. Single-provider risk now has a number: 8x capacity miss.
80x
growth vs planned capacity
9
sources
- Growth vs plan
- Colossus GPUs leased
- Third-party deadline
- Enterprise share
- Opus image cost hike
1. Planned growth10
2. Actual growth80
02
59% Agentic Threshold: Your Eval Harness Measures the Minority
act now
Vercel's AI Gateway (200K teams, 7 months) shows 59% of production tokens are multi-turn agentic traces. Anthropic captures 61% of spend via Opus, Google captures 38% of volume via Flash. Single-turn eval harnesses and single-model routing are now measuring and optimizing the minority of production traffic.
59%
tokens are agentic
5
sources
- Agentic token share
- Anthropic spend share
- Google volume share
- No vendor loyalty
- MCP token overhead
1. Agentic tokens59
2. Single-turn tokens41
03
Training Efficiency Step-Change: 2-360x Cost Reductions
monitor
Three independent results this week move pre-training and post-training unit economics: Nous TST delivers 2-3x wall-clock speedup at matched FLOPs (validated 270M→10B MoE), Datology beats InternVL3.5-2B by 10pts at 17x less compute via curation alone, and NVIDIA Star Elastic produces model-size families at 360x less cost than pretraining each.
17x
compute reduction (VLM)
2
sources
- TST speedup
- Datology compute cut
- Star Elastic cost cut
- TST scale validated
- Datology benchmark gain
1. Nous TST3
2. Datology VLM17
3. Star Elastic360
04
Lakehouse & Agent Framework CVEs Cascade
monitor
Apache Iceberg (CVSS 9.9) lets attackers redirect table metadata to poisoned S3 prefixes. Polaris (CVSS 9.9) broadens credentials cross-tenant. Argo CD (CVSS 9.6) exposes K8s Secrets to low-priv users. PraisonAI was weaponized within 4 hours of disclosure. Draw your data architecture and every layer has a 9.0+ CVE this cycle.
4hrs
PraisonAI time-to-exploit
4
sources
- Iceberg CVSS
- Polaris CVSS
- Argo CD CVSS
- PraisonAI exploit time
- NGINX RCE age
1. 01Iceberg metadata redirect9.9
2. 02Polaris credential broadening9.9
3. 03Argo CD secret exposure9.6
4. 04n8n SQLi + OAuth theft9.8
5. 05NGINX rewrite RCE9.1
05
Compute Supply Crunch: 4:1 Demand Ratio, $56B Cerebras IPO
background
Nebius reports 4+ customers per GPU brought online, 684% YoY revenue growth, and $3-3.4B 2026 guidance. Cerebras IPO'd at $56B with OpenAI's $20B commitment validating non-Nvidia inference silicon. Cisco AI orders jumping from $5B to $9B. Memory hardware shortage driving product redesigns. Reserved capacity is the only hedge.
$56B
Cerebras IPO valuation
5
sources
- Nebius demand ratio
- Nebius YoY growth
- Cerebras valuation
- OpenAI commitment
- Cisco AI order growth
1. Nebius 2025 revenue530
2. Nebius 2026 guide3200

◆ DEEP DIVES

01
Anthropic's Triple Squeeze: Re-Price Everything Before June 15
The New Economics
The Claude pricing changes this week compound; the metering shift is the load-bearing one, and three things landed together:
1. Programmatic metering: Claude subscriptions now convert to dollar-matched API credits for Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% subsidy on alternative-harness usage is gone.
2. June 15 third-party credit separation: Usage through Conductor, Zed, OpenCode, and T3 Code gets a separate credit bucket. No rollover, overflow at API rates. Weekly limits bumped 50% as a two-month softener.
3. Capacity-driven degradation: Dario admitted planning for 10x and getting 80x. Rate limits doubled this week, peak-hours throttling removed for Pro/Max, Opus limits 'substantially raised' — all signs that the serving fleet was oversubscribed.
The Capacity Fix That Changes Your Baseline
Anthropic is absorbing xAI's Colossus 1 cluster: 220,000+ GPUs spanning H100, H200, and GB200. That is roughly 45% of xAI's current capacity moving providers overnight. The integration will produce heterogeneous serving infrastructure, which means p95/p99 latency will get weirder before it stabilizes.
Any Claude benchmark from before May 7 is stale. The serving conditions just shifted again, and the serving conditions will shift once more when Colossus integration completes.
ServiceNow's CDIO burned through the full-year Claude budget by May. National Life Group's CIO called Claude 'great for consumer usage but not great for companies' wanting per-user monitoring. Anthropic provides no native per-user telemetry, no SLAs on latency or availability, and no budget alerts. The thing benchmark numbers don't tell you is whether you can attribute spend to a team; here you cannot. The observability gap is structural, not a bug to be fixed next sprint.
The Competitive Counter
OpenAI dropped a 2-month-free Codex enterprise switch promo the same day Anthropic metered usage. Ramp's April data shows Anthropic 34.4% vs OpenAI 32.3% — the first apparent lead change in business adoption. OpenAI is explicitly pricing against the developers Anthropic just alienated.
The asymmetry is the part that decides this. OpenAI's promo is a free evaluation window with zero downside risk. Anthropic's change is an immediate cost increase for any team that was leveraging the subsidy. A two-month free trial is cheap signal; not running it leaves the comparison unmeasured.
Action items
- Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals, third-party IDEs) and project token burn against new credit caps by end of this sprint
- Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature token tagging and daily budget alerts within 2 weeks
- Run OpenAI's 2-month Codex promo through your own eval harness with matched prompts and tool schemas before the window closes
- Re-baseline all Claude benchmarks (throughput, p95 latency, rate-limit headroom) after Colossus 1 integration stabilizes in ~4-6 weeks
Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with
02
59% Agentic: Your Eval Harness and Cost Model Are Both Wrong
The Production Data
Vercel's AI Gateway telemetry — 200,000 teams, 7 months — now puts agentic workloads at 59% of token volume. Six months ago that share was under 20%. The composition has shifted faster than anything since completion-to-chat, and most production stacks were built before the shift started.
The spend-versus-volume gap is where the architecture leaks through:
Provider Share of Spend Share of Volume Primary Model Implied Role
Anthropic 61% — Opus Reasoning/planning nodes
Google — 38% Flash High-throughput utility calls
Others ~39% ~62% Mixed Mixed
Read it as tiered routing already in the wild: expensive models for planning, cheap models for throughput. The code should reflect what the traffic is doing.
Why Single-Turn Evals Are Now Dangerous
Agentic traces are multi-turn, tool-calling, bursty, and dominated by long contexts with small output deltas. Cost models pinned to a 3:1 input-output ratio are off by roughly 5x. Real agentic traffic runs closer to 15:1 on input, with aggressive cache reuse on some providers and none on others. A forecast that averages over that produces asymmetric errors by vendor, not a uniform miss.
If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.
The CyberGym result lines up with this. Microsoft's MDASH (100+ agent ensemble) beat Anthropic's Mythos by decomposing tasks into scan, adversarial debate, and PoC stages. The multi-agent topology won the leaderboard. The thing this doesn't tell you is whether the lift came from the decomposition or the agent count, because no ablation has been published. At production inference budgets I expect about half the benchmark gain to survive, and would not commit to migration on more than that.
MCP Overhead Is Real but Measurable
Glean reports off-the-shelf MCP burning 30% more tokens and losing 2.5x head-to-head against a retrieval-tuned knowledge graph. The number is vendor-published with undisclosed methodology, so treat it as directional. The failure modes it points at are real: MCP tool listings inflate context, and naive tool outputs return verbose blobs where a reranked snippet would do. Measure on your own corpus before you price the gap.
Action items
- Add trajectory-level metrics to eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
- Instrument per-node token cost across your agent graph and route utility calls (extraction, summarization, routing) to Flash/Haiku-class models within 2 weeks
- Run a 1-hour spike: replay 100 production agent traces under current MCP setup vs BM25+rerank baseline, measuring tokens and task-win-rate
- Add provider-agnostic routing abstraction (LiteLLM, OpenRouter, or in-house) if not already present — the Gateway data shows zero vendor loyalty
Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · MCP plus knowledge graphs · AI Gateway data puts agentic workloads at fifty-nine percent

Provider	Share of Spend	Share of Volume	Primary Model	Implied Role
Anthropic	61%	—	Opus	Reasoning/planning nodes
Google	—	38%	Flash	High-throughput utility calls
Others	~39%	~62%	Mixed	Mixed

Training Efficiency Breakthroughs: The Build-vs-Buy Math Just Shifted

Where the in-house/API crossover moves

Three results landed this week, each touching a different stage of the training pipeline. Together they shift the point at which proprietary post-training beats frontier API calls.

Work	Claim	Scale Validated	Inference Impact	Replication Risk
Nous Research TST	2-3x wall-clock at matched FLOPs	270M → 10B-A1B MoE	None — no architecture change	Medium; single-source, clean claim
Datology VLM curation	+11.7 pts on 20 benchmarks at 17x less compute	2B and 4B	Lower response FLOPs — real serving win	Medium; benchmark-selection risk
NVIDIA Star Elastic	360x cheaper model-size family; 7x better than SOTA compression	Not specified	Produces family of sizes from one post-training run	High; big headline, lab-reported

TST is the one to replicate first

Token Superposition Training is a pretraining recipe change with no inference-side cost. If it replicates at even 1.6x on a matched-FLOPs baseline, it pays for itself on the next full run. The mechanism, training on superimposed token representations, is validated across two orders of magnitude in model size. The thing this doesn't tell you is whether the speedup holds at the scale you actually train at; the encouraging signal is that the trend is consistent across the sizes they tested. No architectural change at inference means the deployment path is clean.

Datology: curation displaces compute

The Datology result is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation. At 2B parameters, their curated dataset beats InternVL3.5-2B by about 10 points on 20 benchmarks using 17x less training compute. The near-frontier 4B model reports 3.3x lower response FLOPs than Qwen3-VL-4B. That last number is a serving-cost win, not just a training one, which is the number worth tracking because it recurs every request.

Star Elastic: model families on one post-training run

NVIDIA reports that one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining each individually. Headline numbers from labs usually compress under independent eval. Given the numbers in the paper, I expect this to hold at closer to 30x once someone outside NVIDIA runs it. 30x still restructures how teams produce size tiers for edge-vs-cloud deployment.

If TST replicates at even half the claimed speedup, and Datology's curation-over-compute thesis holds, the case for in-house post-training on proprietary data gets materially stronger against frontier API rates.

This connects to Abridge's production architecture: 80M+ clinical conversations, a constellation of models, a dedicated post-training team. The implicit claim is that at sufficient data scale, vertical post-training beats frontier-model calls on cost and latency for narrow tasks. The threshold question is volume. At low volume, the frontier API wins because you avoid maintaining the pipeline. At Abridge's scale the math is straightforward. These three results lower the crossover point for everyone in between.

Action items

Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline this quarter
If you have proprietary task data >100K examples, scope a post-training experiment (SFT or DPO) on a 7-13B base vs current frontier API call
Pull SWE-ZERO-12M-trajectories (112B tokens, 12M trajectories, 3K repos) now for SFT/RM training corpus before licensing frictions appear
Audit your Iceberg/Delta tables for stats coverage (ANALYZE/compute-stats); add stats maintenance to table-level SLAs

Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week · Abridge runs model routing across 100M conversations

Lakehouse and Agent Framework Security: New 9.9 CVEs Across the Data Stack

The Pattern

This week's advisory is not generic patch noise. It is a direct hit on the modern data and ML architecture. Sketch the reference stack (lakehouse → orchestrator → inference gateway → agent framework → CD pipeline) and every layer shipped a CVSS 9.0+ vulnerability this cycle.

Priority Patching Order

Component	CVE / CVSS	Impact on Your Stack	Status
Apache Iceberg	CVE-2026-42812 / 9.9	Metadata write redirect → poisoned Parquet → corrupted training data	Disclosed
Apache Polaris	CVE-2026-42809/10/11 / 9.9	Credential broadening → S3/GCS creds, cross-tenant access	Disclosed
Argo CD 3.2.x/3.3.x	CVE-2026-42880 / 9.6	Low-priv → plaintext K8s Secrets in all reachable namespaces	Disclosed
PraisonAI	CVE-2026-44338	Auth bypass → agent runtime, secrets, connected DBs	Active, <4hr from disclosure
NGINX rewrite module	Unauthenticated RCE	Model-serving ingress → registry creds, cross-model lateral	18yr latent, disclosure-fresh

Why Iceberg/Polaris Is the Scariest

CVE-2026-42812 lets an attacker with table-write permission point metadata at an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features. The thing default lakehouse observability doesn't tell you is whether a pointer mutated; it tracks row changes. Combined with Polaris credential-broadening, there is a plausible path from compromised analyst notebook to cross-tenant data theft to training data poisoning, with none of it tripping standard alerts.

The 18-year NGINX RCE is a useful reminder that 'battle-tested' and 'audited' are not the same word. Every model-serving gateway behind NGINX is in the blast radius.

PraisonAI: The 4-Hour Window

PraisonAI was weaponized within 4 hours of CVE disclosure. That is not a patching window. It is a containment window. Agent frameworks inherit the attack surface of every tool they wire together, plus a new surface from prompt-mediated control flow, and maintainers are not staffed like the security teams at the vendors whose APIs they wrap. LangChain, CrewAI, and AutoGen sit in the same bucket until measured otherwise.

Update on LiteLLM (covered Tuesday): remains on CISA KEV as actively exploited. If you patched, verify key rotation completed. If you didn't, assume exfiltration.

Action items

Audit Iceberg/Polaris catalog configurations this week: enforce explicit storage credential scoping and add write-path allowlisting for table metadata locations
Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every K8s Secret in reachable namespaces by end of week
Inventory all agent frameworks (PraisonAI, LangChain, CrewAI, AutoGen) in production or staging; pin versions and subscribe to CVE feeds
Patch NGINX across all inference gateways and audit rewrite-module usage in model routing configs

Sources:LiteLLM landed in the KEV catalog this week · An Ollama endpoint exposed to the public internet · PraisonAI, an open-source multi-agent framework · Agent stacks are now in scope for attackers

◆ QUICK HITS

DuckDB shipped Quack HTTP client-server mode — eliminates the 'one process at a time' constraint; spike one Glue/EMR job under 100GB onto ECS Fargate + DuckDB pattern
DuckDB shipped a client-server mode this week
Kafka Share Groups decouple consumer parallelism from partition count with ~8x linear scaling at 32 instances on I/O-bound workloads — stop over-partitioning for scale
DuckDB shipped a client-server mode this week
DeepSeek V4 Pro scored 77/100 on FlowGraph at $2.25/task, sitting between Opus and Kimi K2.6 — benchmark against your production model on cost-per-successful-task
The CyberGym result
Mythos is the first model to clear both UK AISI simulated attack ranges; AISI already building harder tests because current ladder saturates
Mythos cleared the AISI attack ranges this week
Mozilla's custom harness found 271 Firefox bugs with Claude Mythos; curl's out-of-box scan found 1 CVE — same model, 271x yield difference from harness engineering alone
Mozilla shipped 271 bugs over the period in question
Duolingo publicly pegs AI-generated content slop at ~20% requiring human QC — rare production quality anchor; benchmark your acceptance rate against it
Duolingo's twenty percent AI slop rate
Only 15% of orgs have data foundations for agentic AI; data quality/lineage cited by ~50% as the #1 blocker — scope agent projects as data-platform projects with an agent on top
DuckDB shipped a client-server mode this week
TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini-flash-live and 1.18s GPT-Realtime — a 3x gap on the metric that determines voice-agent naturalness
TML is reporting 0.40 seconds of full-duplex latency
PCAOB/COSO now require deterministic execution and tamper-evident audit trails for ML in regulated finance — stochastic decoding is non-compliant by design
The transformer underwriting models are outperforming
LLM-as-a-Verifier (decomposed binary verifications with token-level scoring) beats LLM-as-a-Judge on tie-rate and decision accuracy — swap one eval pipeline as a day-long spike
An Ollama endpoint exposed to the public internet
Update: Exposed AI endpoints (Ollama, LangServe, MCP) scanned within 3 hours; 23% of honeypot traffic targets AI-specific paths including /.well-known/mcp.json
An Ollama endpoint exposed to the public internet
Persona drift in LLM agents measurable within 8 conversational turns (Li et al. COLM 2024) — add a verbal-tic canary to system prompts as a free drift detector
AI personas drift within eight turns

◆ Bottom line

The take.

Anthropic metered the developer discount, Vercel confirmed 59% of production tokens are agentic, and the data stack shipped five CVSS 9.0+ CVEs in a single cycle. If you haven't deployed a multi-provider gateway with per-workload cost attribution, rebuilt your eval harness around trajectory-level metrics, and patched your Iceberg catalog this week, you're absorbing pricing decisions by default, measuring the minority of your traffic, and leaving your training data exposed to silent poisoning.

Frequently asked

How do I quickly estimate the cost overrun on existing Claude agent pipelines?: Multiply your prior subscription-flat token burn by roughly 3-10x to approximate the now-unsubsidized API-credit cost, since the implicit subsidy on Agent SDK, GitHub Actions, and third-party harness usage was 70-90%. Then re-project against the dollar-matched credit cap and the June 15 third-party split (no rollover). Any pipeline whose monthly trace volume × per-trace token count exceeds the credit pool is in overrun starting now.
What trajectory-level metrics should replace single-turn pass@1 in an agent eval harness?: Track tool-call precision and recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate, and per-node token cost across the agent graph. These capture the multi-turn, tool-calling, long-context behavior that dominates the 59% of production tokens now flowing through agentic traces. Pass@1 measures the minority of traffic and hides 5-15x compounding from retries and tool loops.
Where does the in-house post-training versus frontier-API crossover sit after this week's results?: It moved lower. Datology shows VLM curation beating a 2B baseline by ~10 points at 17x less compute, TST claims 2-3x wall-clock at matched FLOPs with no inference-side change, and NVIDIA's Star Elastic produces a model-size family from one post-training run. For teams with >100K proprietary task examples, SFT or DPO on a 7-13B base is now plausibly cheaper than frontier API calls at production volume.
Why is the Iceberg CVE more dangerous than typical lakehouse patch noise?: CVE-2026-42812 (CVSS 9.9) lets an attacker with table-write permission redirect table metadata to an attacker-controlled S3 prefix, so subsequent queries read poisoned Parquet and training runs silently ingest corrupted features. Default lakehouse observability tracks row changes, not metadata-pointer mutations, so the poisoning does not trip standard alerts. Combined with Apache Polaris credential-broadening, it opens a path from analyst notebook to cross-tenant theft and training-data poisoning.
Is OpenAI's two-month free Codex enterprise promo worth the integration effort to evaluate?: Yes, because it is a zero-cost asymmetric option. Run it through your own eval harness with matched prompts and tool schemas, scoring on trajectory-level metrics rather than pass@1. Even if you stay on Claude, the comparison gives you negotiating leverage and a measured fallback for the next capacity or pricing shock; skipping it leaves the comparison unmeasured at exactly the moment Anthropic's serving conditions are non-stationary due to the Colossus 1 integration.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

AnthropicEndsClaudeSubscriptionSubsidyforAgentPipelines

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The New Economics

The Capacity Fix That Changes Your Baseline

The Competitive Counter

The Production Data

Why Single-Turn Evals Are Now Dangerous

MCP Overhead Is Real but Measurable