How much performance is actually on the table from harness work versus a model swap?

Recent benchmarks show 7–16 point gains from harness changes alone: Agentic Harness Engineering moved Terminal-Bench 2 from 69.7% to 77.0%, and HALO took AppWorld from 73.7 to 89.5 by recursively rewriting its own harness on a fixed model. Those gains transferred across model families and in one case cut token use 12%, so the harness is generally a larger lever than model choice.

What does it actually mean to version a harness, and what should ship this sprint?

Treat prompts, tool definitions, middleware, and error-recovery patterns as versioned artifacts evaluated per model family, the way LangChain's Harness Profiles formalize it. The minimum viable shipment is a per-feature harness repo, an eval set tied to it (Netflix's 600-example LLM-as-Judge blueprint is a reasonable target), and a CI gate that blocks merges that regress quality or token cost.

How do I justify this work over another model bake-off to leadership?

Frame it as unit economics, not engineering taste. IBM's Granite 4.1 8B matched Qwen3.5 9B using 19.5x fewer tokens, and one RAG app dropped from 10.4M to 3.7M tokens purely from harness and integration changes. With token spend growing 10–15x in six months and no enterprise discounts from Anthropic at $5M+/year, harness and routing work is the only near-term lever that improves quality and margin simultaneously.

When is a model swap still the right call instead of harness investment?

Swap models when a feature is capability-bound rather than orchestration-bound — for example, tasks where even a well-tuned harness can't reach the quality bar on your current tier, or where a smaller fine-tuned model would clearly dominate on a narrow, well-defined task (Shopify Flow is the canonical example). Route by error cost: features where mistakes cost hours earn frontier models, features where they cost minutes belong on cheaper tiers with a strong harness.

How does harness versioning interact with the security exposure from AI tooling?

A versioned harness gives you a single chokepoint to disable or pin tool wrappers when a CVE drops, which matters when weaponization windows are under 13 hours and tools like Claude Code, Cursor, and LiteLLM have shipped CVSS 9.8–10.0 bugs. Require every tool wrapper in the harness registry to have a named owner, a patch SLA in hours, and a documented kill switch before it's allowed in production.

Edition 2026-05-01 · read as Product

HarnessEngineeringBeatsModelSwapsonEvalGains

Sources: 40
Words: 1,265
Read: 6min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

A team swapped models three times last quarter chasing a four-point eval bump and shipped nothing, because the prompts and tool wrappers were rewritten each time and nobody versioned them. The numbers this week argue the harness is the product: Agentic Harness Engineering took Terminal-Bench 2 from 69.7% to 77.0% (past the 71.9% Codex-CLI baseline), HALO pushed AppWorld from 73.7 to 89.5 by rewriting its own harness, and IBM's Granite 4.1 8B matched Qwen3.5 9B on 19.5x fewer tokens. Fund harness versioning and evals this sprint, not another model bake-off.

◆ INTELLIGENCE MAP

01
Harness Engineering Beats Model Upgrades — By a Lot
act now
Multiple benchmarks confirm: optimizing prompts, tools, and middleware delivers larger quality gains than switching models. Terminal-Bench 2 went from 69.7% to 77.0% via harness alone. IBM's 8B model matched a 32B MoE through architecture optimization. LangChain launched Harness Profiles for commercial versioning.
77%
harness-optimized accuracy
5
sources
- Terminal-Bench baseline
- Human-designed
- Harness-optimized
- HALO AppWorld lift
- IBM token efficiency
1. Base model69.7
2. Human baseline71.9
3. Harness-optimized77
4. HALO (AppWorld)89.5
02
Token Costs Hit 10-15x in 6 Months — Anthropic's Extraction Era
act now
Survey of 15 companies confirms AI token spend grew 10-15x between Oct 2025 and April 2026. Individual devs burning $500/day. Anthropic offers zero discounts even at $5M+/year, is silently nerfing Claude Code, and banning companies. Open-weight alternatives at $1-3/M tokens are the pressure relief valve.
15x
6-month token cost growth
5
sources
- Token spend growth
- Peak dev/day spend
- Seed startup cost
- Open model pricing
- Cost cut: Opus→Sonnet
1. Oct 2025200
2. Jan 2026800
3. Apr 20263000
03
Agents Became Autonomous Buyers — Three Platforms Shipped Simultaneously
monitor
Stripe's Link CLI (agent payment credentials), Cloudflare's self-provisioning (agents create accounts/deploy), and Cursor's SDK (embedded agent runtime) shipped in the same week. An agent can now write code, pay for hosting, and deploy — autonomously. Your user model needs a third row: not human, not service account, but agent with delegated authority and a budget.
3
platforms shipping same week
7
sources
- Stripe
- Cloudflare
- Cursor
- x402 agent txns
- On-chain agents
1. Stripe Link CLIAgent payment credentials
2. CloudflareAgent self-provisioning
3. Cursor SDKEmbedded agent runtime
4. x402 protocol50M+ agent transactions
04
AI Dev Tool Security: 8+ Critical RCEs in a Single Week
monitor
Claude Code (CVSS 10.0), Cursor (RCE via git clone), GitHub Enterprise (RCE via push), and 73 malicious GlassWorm extensions in Open VSX all disclosed in one cycle. AI tooling ships with 2005-era security practices. 88% of GitHub Enterprise Server instances remain unpatched. Exploit weaponization now takes under 12.5 hours.
12.5hrs
disclosure-to-exploit
6
sources
- CVSS 10.0 AI tools
- GlassWorm extensions
- GHES unpatched
- Exploit window
- MOAK KEV exploit rate
1. Claude Code10
2. Paperclip RCE10
3. Flowise (5 CVEs)9.9
4. KTransformers9.8
5. Cursor RCE9.8
05
Your Interface Is Becoming Someone Else's Tool Library
background
Adobe exposed 50+ tools across 8 apps to Claude as the orchestration layer — explicitly conceding the interface to a chatbot. SaaS products that are 'routing friction' rather than systems of record face existential agent bypass risk. The question isn't 'does our product have an API' — it's whether AI agents reach for your tool when orchestrating a workflow.
50+
Adobe tools exposed to Claude
4
sources
- Adobe apps exposed
- Tools accessible
- Reject AI-first UI
- Magnific ARR
1. System of record (safe)30
2. Action endpoint (defensible)25
3. Routing friction (at risk)45

◆ DEEP DIVES

01
Harness Engineering Is Your New Competitive Moat — Not Model Selection
The Data Is In: Orchestration Beats Intelligence
The traditional assumption — 'wait for the next model upgrade to hit our quality bar' — is now empirically wrong. Multiple independent benchmarks released this week confirm that the harness (prompt design, tool definitions, error recovery, middleware) is a larger lever than model choice for AI feature quality.
Agentic Harness Engineering improved Terminal-Bench 2 pass@1 from 69.7% to 77.0%, beating a human-designed Codex-CLI baseline at 71.9% — and it transferred across model families while reducing token use by 12%.
HALO showed even more dramatic results: AppWorld improvement from 73.7 to 89.5 on Sonnet 4.6 through recursive self-improvement of the harness, not the model. These aren't marginal gains — they're the difference between a demo and a production feature.
The Cost Dimension Makes This Urgent
IBM's Granite 4.1 8B used only 4M output tokens on the Artificial Analysis Intelligence Index versus 78M for Qwen3.5 9B — a 19.5x efficiency gap that's entirely about harness and architecture optimization. Shopify fine-tuned a smaller open-source model on Flow-specific data and achieved higher accuracy, lower latency, AND lower cost than general-purpose frontier models. When your inference costs are growing 10-15x in six months, a 19.5x efficiency difference isn't a nice-to-have — it's the difference between a viable product and a margin crisis.
Commercial Infrastructure Is Emerging
This isn't just research anymore. LangChain launched Harness Profiles — versioning per-model prompts, tools, and middleware configurations. DeepAgents Deploy shipped low-code agent deployment with LangSmith tracing. The pattern is clear: harness quality is becoming a managed, versionable, evaluable artifact alongside your code.
What This Means For Your Team
- Model selection is now a product decision, not an engineering default. Features where errors cost minutes belong on cheaper models with optimized harnesses. Features where errors cost hours earn frontier models.
- Your harness — the prompts, tool definitions, error recovery patterns — should be versioned like code and evaluated per model family.
- Netflix's LLM-as-a-Judge blueprint (600 golden examples, 83-92% accuracy) provides the evaluation framework to measure harness improvements objectively.
Action items
- Create a harness versioning and evaluation pipeline for your top 3 AI features this sprint
- Run a model-tier routing analysis: classify every AI feature as 'errors cost minutes' vs 'errors cost hours' and assign model tiers accordingly
- Build 600 expert-labeled golden examples for your primary AI feature following Netflix's blueprint
- Evaluate Shopify's approach: benchmark a fine-tuned small model against your current frontier API for your most well-defined use case
Sources:Your AI cost model is about to break · An agent opened a checkout flow at 3:14 in the morning · A product manager at a mid-size SaaS company looked at her inference bill · Shopify proves fine-tuned small models beat GPT-class LLMs · A head of product opened her cloud bill last month · A product manager on an AI team opened her evaluation dashboard
02
The Token Cost Reckoning: 15x Growth, Zero Discounts, and the Vendor Lock You Didn't Price
The Numbers From 15 Companies Are Worse Than Expected
A staff engineer at a late-stage fintech watched one of her developers burn through a day of Claude Code and realized the bill looked like a second salary. That is the texture behind Gergely Orosz's survey of 15 companies, which is the most granular AI cost data available this quarter. Token spend grew 10-15x in six months at both a large enterprise and a seed-stage startup. Individual developers are spending $500/day on Claude Code, which staff engineers say has 'practically doubled' employee costs. One seed-stage company went from $200/dev/month to $3,000/dev/month.
About half of surveyed companies are on the 'let it rip' plan, which is a pitch-deck word for unforecasted cost. The Series D fintech that called its token spend 'unsustainable' at an all-hands is the preview.
Anthropic's Extraction Era Is Real
Anthropic offers zero enterprise discounts even at $5M+/year spend. Teams describe silent Claude Code nerfing, outright bans, and aggressive price increases, while nearly every company in the survey is Claude-dependent. What customers actually bought is not the API call. They bought every prompt, eval, and guardrail tuned against one model family, and that is the switching cost that keeps them paying.
The Cost Lives in the Wrong Budget Line
The PM decision this quarter is a budget-line decision, not a model decision: do tokens belong in COGS or R&D? If tokens sit in COGS, the product team owns unit economics and a 3x reduction in retrieval spend becomes a roadmap item with an owner. If tokens sit in R&D, nobody owns it and the bill keeps growing. One RAG app consumed 10.4M tokens via naive Supabase MCP integration. The same job ran on optimized infrastructure for 3.7M tokens. That is a 64% reduction from an architecture choice, not a model change.
Open-Weight Models Are the Pressure Valve
Open-weight pricing has collapsed to $1-3/M output tokens (Qwen 3.5 Plus at $3/M, MiMo-V2.5 Pro at $1/$3/M). LangChain staff explicitly state closed models are 'too expensive for many agent workloads.' Features that were economically impossible at $15-60/M tokens become viable at these prices. Three Chinese labs dropped frontier models in a single week: Alibaba's Qwen 3.6 Max, Moonshot's 1T-parameter Kimi-K2.6, and MiniMax M2.7 (fully open-source, 56% SWE-Pro).
Provider Output Token Cost Enterprise Discount
Anthropic (Claude) $15-60/M None at $5M+/yr
Qwen 3.5 Plus $3/M Open-weight
MiMo-V2.5 Pro $1-3/M Open-weight
MiniMax M2.7 Free (OSS) N/A
Action items
- Audit every AI-dependent feature for cost model accuracy this sprint — re-forecast COGS using current token pricing at 10x whatever was modeled 6 months ago
- Map Anthropic dependency across customer-facing features AND internal dev tooling — identify which capabilities break if prices rise 2-3x
- Implement per-feature token budgets with hard circuit breakers and real-time alerting before end of quarter
- Benchmark open-weight models (Qwen 3.5, MiniMax M2.7, Kimi-K2.6) against your current provider for your top 3 AI features
Sources:A product manager at a mid-size SaaS company looked at her inference bill · Your AI cost model is about to break · A head of product opened her cloud bill last month · Agents just hit production-grade benchmarks · A product lead at a Series B company opened the OpenAI status page
03
Your AI Feature Stack Is a Security Liability — 8+ Critical RCEs This Week Alone
AI Tooling Ships With 2005-Era Security
A security engineer opened her advisory feed on a Tuesday and counted eight AI-adjacent tools with CVSS 9.8-10.0 vulnerabilities disclosed in a single week. Claude Code hit 10.0 for a symlink sandbox escape. Cursor's AI agent allowed arbitrary code execution via malicious git clone. GitHub Enterprise Server exposed millions of repositories via crafted push requests. None of these are novel vulnerability classes. They are hygiene failures shipped at startup speed.
The evaluation question for an AI framework is not which one has the best features. It is which one will not be a 10.0 headline next week.
The Weaponization Window Collapsed
Sysdig observed attacks on LMDeploy within 12.5 hours of disclosure, without a public proof-of-concept, because advisory text alone was enough for an LLM to generate the exploit. MOAK autonomously exploits 98% of Known Exploited Vulnerabilities using publicly available models. Disclosure-to-exploit is measured in hours now. A patching SLA written against a weekly cadence is a patching SLA for a threat model that no longer exists.
Supply Chain Attacks Are Industrializing
73 malicious GlassWorm extensions appeared in Open VSX in April 2026 alone. The TeamPCP/UNC6780 campaign hit npm, PyPI, and Docker Hub within 48 hours. Dependabot carried the attack through Bitwarden CLI's merge gates. The automation did exactly what it was configured to do, which is the uncomfortable part. LiteLLM's SQL injection exposed every downstream API key (OpenAI, Anthropic, Bedrock) via pre-auth attack, exploited within 36 hours.
The PM Responsibility
Separate the thing being pitched from the thing being done. The pitch is "adopt AI tools to move faster." The thing being done is adding a new attack surface with its own incident timeline. 88% of GitHub Enterprise Server instances remain unpatched, and DPRK's HexagonalRodent is using Cursor and ChatGPT to vibe-code malware loaders. The forcing function for this sprint: no AI tool enters the stack without a named owner, a patch SLA measured in hours, and a documented kill switch. Teams that write that policy this sprint do not run the fire drill next quarter.
- Claude Code: CVSS 10.0 (sandbox escape, patched v2.1.64)
- Cursor: RCE via git clone (patched v2.5)
- GitHub Enterprise: RCE via push options (88% unpatched)
- LiteLLM: Pre-auth SQL injection exposing all API keys
- Open VSX: 73 malicious extensions in one month
Action items
- Verify Cursor is updated to v2.5+ and GitHub Enterprise Server is patched across all environments — escalate to platform team within 24 hours
- Add mandatory security review gate to your AI tooling evaluation process — no AI framework enters your stack without a vulnerability assessment
- Audit your LLM proxy/gateway layer for credential isolation — specifically verify that provider API keys are not stored in a single queryable database
- Implement dependency cooldowns across your CI/CD pipeline and pin container images by digest, not tag
Sources:A security engineer opened the vulnerability dashboard · A developer opened her IDE on Monday morning · A developer on your team opened the dependency graph · A product manager on call this week got paged at 2am · A security engineer on your team opened the weekly CVE digest · A platform trust lead opened her dashboard

Provider	Output Token Cost	Enterprise Discount
Anthropic (Claude)	$15-60/M	None at $5M+/yr
Qwen 3.5 Plus	$3/M	Open-weight
MiMo-V2.5 Pro	$1-3/M	Open-weight
MiniMax M2.7	Free (OSS)	N/A

◆ QUICK HITS

Copilot at 5% penetration (20M of 400M M365 customers) — if your AI feature adoption model assumes faster conversion than Microsoft achieves with maximum enterprise distribution, write down why
A product lead sat in a planning meeting this week
Update: Court ruled AI exercising 'ultimate authority' over content makes the platform the legal 'maker' under Rule 10b-5 — audit any auto-publish AI features for unquantified liability exposure
New court ruling means your AI features may make you legally liable
Voice AI hit $7B in Q1 2026 (excluding OpenAI/Anthropic) — Abridge's 150-doctor waitlist in 2 months at HonorHealth validates 'time given back' as the only AI value prop driving bottom-up enterprise adoption
A product manager at a mid-size SaaS company listened to three sales calls
Vertical AI agent valuations crystallizing: Manifest OS $750M (legal, Series A), Avoca $1B (home services, Series B), Rogo $160M Series D at 35K daily active bankers — horizontal AI features are commodities, vertical workflow ownership is the premium
AI vertical SaaS hitting $750M+ at Series A
Adobe exposed 50+ tools across 8 apps to Claude as orchestration layer — conceding the interface to a chatbot while retaining the capability. If your product value lives in the UI rather than the data, you're next
A designer opened Photoshop this week
Uber CTO confirmed Cursor cut hotel booking build time by 50% — a named C-suite exec at a $170B company quantifying the velocity multiplier. Recalibrate sprint capacity accordingly
AI halves your build time but corrupts 25% of docs
LLMs corrupt 25% of document content in long editing workflows across 52 professional domains — add corruption detection and content diff checkpoints to any AI editing feature
AI halves your build time but corrupts 25% of docs
Update: Chinese AI model provenance — House committees now investigating Airbnb and Anysphere (Cursor's parent) over Chinese AI connections. Audit which model families power your features before legal asks
A procurement lead pulled up her vendor inventory
Lifestage beats age as purchase intent signal by 26 points; feed curation behavior beats it by 47 — if your personas are built on demographics, Meta's global study says you're optimizing the wrong variable
A product manager on a growth team opened her segmentation dashboard
CPU shortage emerging: ~$100B in COVID-era CPUs (2020-2021) hitting 5-6 year end-of-life simultaneously while orgs starved CPU budgets for 2 years to fund GPU purchases — flag to infrastructure team for Q3-Q4 capacity planning
Your AI cost model is about to break

◆ Bottom line

The take.

The AI performance lever just moved from 'which model' to 'how you orchestrate it' — harness engineering delivers 10-20% quality gains without any model change, while your token costs grew 10-15x in six months and Anthropic refuses discounts at $5M+/year. Meanwhile, Stripe, Cloudflare, and Cursor simultaneously shipped the infrastructure for agents to buy, deploy, and operate autonomously — and 8+ AI developer tools disclosed critical RCEs in the same week. The PMs who win this quarter are the ones investing in harness optimization over model shopping, building multi-provider abstraction before Anthropic's next price hike, and treating every AI tool integration as an attack surface with a 12-hour exploitation window.

Frequently asked

How much performance is actually on the table from harness work versus a model swap?: Recent benchmarks show 7–16 point gains from harness changes alone: Agentic Harness Engineering moved Terminal-Bench 2 from 69.7% to 77.0%, and HALO took AppWorld from 73.7 to 89.5 by recursively rewriting its own harness on a fixed model. Those gains transferred across model families and in one case cut token use 12%, so the harness is generally a larger lever than model choice.
What does it actually mean to version a harness, and what should ship this sprint?: Treat prompts, tool definitions, middleware, and error-recovery patterns as versioned artifacts evaluated per model family, the way LangChain's Harness Profiles formalize it. The minimum viable shipment is a per-feature harness repo, an eval set tied to it (Netflix's 600-example LLM-as-Judge blueprint is a reasonable target), and a CI gate that blocks merges that regress quality or token cost.
How do I justify this work over another model bake-off to leadership?: Frame it as unit economics, not engineering taste. IBM's Granite 4.1 8B matched Qwen3.5 9B using 19.5x fewer tokens, and one RAG app dropped from 10.4M to 3.7M tokens purely from harness and integration changes. With token spend growing 10–15x in six months and no enterprise discounts from Anthropic at $5M+/year, harness and routing work is the only near-term lever that improves quality and margin simultaneously.
When is a model swap still the right call instead of harness investment?: Swap models when a feature is capability-bound rather than orchestration-bound — for example, tasks where even a well-tuned harness can't reach the quality bar on your current tier, or where a smaller fine-tuned model would clearly dominate on a narrow, well-defined task (Shopify Flow is the canonical example). Route by error cost: features where mistakes cost hours earn frontier models, features where they cost minutes belong on cheaper tiers with a strong harness.
How does harness versioning interact with the security exposure from AI tooling?: A versioned harness gives you a single chokepoint to disable or pin tool wrappers when a CVE drops, which matters when weaponization windows are under 13 hours and tools like Claude Code, Cursor, and LiteLLM have shipped CVSS 9.8–10.0 bugs. Require every tool wrapper in the harness registry to have a named owner, a patch SLA in hours, and a documented kill switch before it's allowed in production.

◆ Same day, different angle

Read this day as…

◆ Recent in product

HarnessEngineeringBeatsModelSwapsonEvalGains

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The Data Is In: Orchestration Beats Intelligence

The Cost Dimension Makes This Urgent

Commercial Infrastructure Is Emerging

What This Means For Your Team

The Numbers From 15 Companies Are Worse Than Expected

Anthropic's Extraction Era Is Real

The Cost Lives in the Wrong Budget Line

Open-Weight Models Are the Pressure Valve

AI Tooling Ships With 2005-Era Security

The Weaponization Window Collapsed

Supply Chain Attacks Are Industrializing

The PM Responsibility

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS