How do I actually measure the cache discount in my agent runtime?

Instrument three first-class metrics: cache-hit rate, prefix-reuse ratio, and effective $/1K tokens net of cache discounts. Standard runtimes emit tokens/sec and $/query but miss these, which is why a 3.2× effective discount can stay invisible. Stable system prompts and tool schemas are where the savings concentrate, so harness design becomes a cost lever rather than an infra detail.

Why do the same models produce contradictory benchmark results across sources?

Because evaluation harnesses (prompt format, tool schema, retry policy, context budget) are load-bearing variables that rarely get logged. GPT-5.5 beats Opus 4.7 on the Intelligence Index but loses inside the Claude Code harness; Grok 4.3 gains +321 Elo on GDPval-AA while regressing on Vending-Bench 2. Treat harness config as a required eval dimension before comparing vendors.

Is MCP or SKILL.md the right choice for exposing tools to an agent?

They are orthogonal, not competing. MCP is the integration plane for live systems, state, and auth, validated by typed JSON-RPC in a separate process. SKILL.md is the knowledge plane for procedures and playbooks. Picking MCP for a prompt template wastes infra; picking Skills for stateful auth flows costs correctness.

What is the cheapest prompt-injection defense to ship today?

Spotlighting: wrap all untrusted text in tags and add a system-prompt rule telling the model to treat that content as data, not instructions. It takes hours to deploy across RAG and tool-output paths and meaningfully raises the floor while you build the heavier Planner/Executor architecture for high-stakes flows.

Does the Grok 4.3 price cut actually beat DeepSeek V4 Pro on blended cost?

Probably not on cache-friendly agent workloads. Grok 4.3's $1.25/$2.50 per million tokens looks aggressive, but a $0.05 fee per safety-filter-blocked request erodes savings at typical 2–3% rejection rates, and DeepSeek's hours-long disk-backed KV cache produces a 3.2× effective discount on repetitive agent loops. Replay real traffic in shadow mode before migrating.

Edition 2026-05-03 · read as Data Science

CacheEconomicsandDomainRoutingReshapeModelSelection

Sources: 8
Words: 1,342
Read: 7min

Topics Agentic AI LLM Inference Data Infrastructure

◆ The signal

Cache economics now dominates agentic model selection, and price-per-token sheets no longer measure the bottleneck. DeepSeek V4 Pro holds its disk-backed KV cache for hours against a roughly five-minute industry norm; one production dashboard reports $3,351 in cache savings on $1,051 of spend, a 3.2× effective discount that shows up nowhere on a rate card. Grok 4.3 ranking first on CaseLaw while landing 11% on ProofBench is the other half of the story: domain routing beats vendor loyalty. Model-selection docs are cache-and-routing docs this week.

Key facts

DeepSeek V4 Pro holds its disk-backed KV cache for hours versus the roughly 5-minute industry norm, yielding a 3.2x effective discount in one production case ($3,351 saved on $1,051 spend).
Grok 4.3 launched at $1.25/M input and $2.50/M output tokens, 40-60% below Grok 4.2, but added a $0.05 fee per safety-filter-blocked request.
Grok 4.3 ranks #1 on CaseLaw and CorpFin benchmarks while scoring only 11% on ProofBench, showing within-model domain variance exceeds between-vendor gaps.
Nebius acquired Eigen AI for $615M to optimize inference, while Cursor reported -23% gross margins, underscoring that inference economics now bind production model selection.
Alibaba's Qwen3.6-27B reportedly beats a predecessor 15x its size on coding benchmarks, though contamination ablation was not disclosed.

◆ INTELLIGENCE MAP

01
Cache Economics Now Dominate Agentic TCO
act now
DeepSeek V4 Pro's disk-backed KV cache persists hours vs. the 5-min industry norm, producing a 3.2× effective discount ($3,351 saved on $1,051 spent). Grok 4.3 sets a new price floor at $1.25/M input but adds a $0.05 refusal fee that silently erodes savings. Nebius paid $615M for Eigen AI specifically for inference optimization — inference cost is now an acquisition-grade capability.
3.2×
effective cache discount
4
sources
- DeepSeek cache TTL
- Industry cache TTL
- Cache savings ratio
- Grok 4.3 input price
- Grok refusal fee
- Nebius → Eigen AI
1. Cache Savings3351
2. API Spend1051
3. Net Effective Cost-2300
02
Benchmarks Measure Harness Fit, Not Model Quality
act now
GPT-5.5 beats Opus 4.7 overall but loses on PostTrainBench inside the Claude Code harness. Grok 4.3 gained +321 Elo on GDPval-AA but regressed on Vending-Bench 2 so badly the agent 'preferred to sleep.' HF's 'engine vs car' framing confirms closed APIs bundle routing, harnesses, and tools that inflate gaps against open weights in single-harness comparisons.
~5 pts
open-to-closed gap
4
sources
- Open MoE Index score
- Closed frontier score
- Grok CaseLaw rank
- Grok ProofBench
- Grok Elo gain
- Grok hallucination Δ
1. 01GPT-5.560
2. 02Gemini 3.1 Pro57
3. 02Claude Opus 4.757
4. 04Grok 4.353
5. 04DeepSeek V4 Pro53
6. 06Kimi K2.652
03
Agent Security Architecture Crystallizes: Planner/Executor + MCP vs SKILL.md
monitor
Planner/Executor Split is hardening as the default for agents touching untrusted content: planner has tools but never sees untrusted text, executor reads text but has no tools. Separately, MCP vs SKILL.md is a systems decision — MCP for stateful integrations, SKILL.md for procedural knowledge — but Skills execute arbitrary bash in the agent's environment with zero isolation, a supply-chain surface most teams haven't audited.
2×
inference cost of defense
2
sources
- Planner/Executor cost
- OWASP LLM rank
- MCP runtime
- SKILL.md runtime
1. MCP (stateful integration)75
2. SKILL.md (procedural)25
04
Agents as Primary Platform Consumer by EOY 2026
background
Hugging Face is redesigning around agents.md, headless APIs, and token-efficient endpoints, forecasting agent traffic will surpass human traffic on ML platforms by late 2026. HF's ML InTern agent passed their researcher interview in 30 min. The workload-split prediction (99% proprietary API today → 95% local/specialized) is directionally plausible for docs and APIs; consumer workflows with humans in the loop are 2027+ at earliest.
95%
predicted local workload
2
sources
- Current proprietary %
- Forecast local %
- HF public models
- New repo frequency
1. Proprietary API (2025)99
2. Proprietary API (EOY 2026E)5
05
AI Coding Tool Unit Economics Diverge
background
Cursor reports −23% gross margins while Replit claims ~$1B ARR (from $2.8M in 2024) with 300% NRR. Both sit on the same foundation-model substrate. The spread is a natural experiment: thin inference wrappers converge to provider markup, while full IDE/runtime stacks that own workflow lock-in can drive expansion. Foundation-model passthrough is the dominant cost line for products that resell inference.
-23%
Cursor gross margin
2
sources
- Cursor gross margin
- Replit NRR
- Replit ARR growth
- Cerebras IPO target
1. Cursor Gross Margin-23
2. Replit Net Revenue Retention300

◆ DEEP DIVES

01
Cache Hit Rate Is the New Eval Metric — and It's a Bigger Cost Lever Than Model Quality
The Shift Nobody Priced In
Agentic workloads spend most of their tokens inside loops: retries, tool calls, multi-turn reasoning against stable system prompts. Repetitive prefixes are exactly what KV cache reuse exploits. DeepSeek V4 Pro ships a disk-backed KV cache that persists for hours, versus the roughly 5-minute TTL that is the industry standard. One shared production dashboard reported $3,351 in cache savings against $1,051 in API spend. That is a 3.2× effective discount, and it does not show up on a price-per-token comparison sheet.
This is not really a DeepSeek story. It is a serving-architecture story. V4 Pro's hybrid CSA/HCA attention compresses KV cache to 10% of standard size and reports ~4× lower long-context FLOPs. At the concurrency and context lengths a real coding agent harness produces, effective cost between models swings 2–4× depending on serving stack. That delta is larger than the quality gap between open and closed models on most coding tasks, which is the comparison most teams are actually running.
The Pricing Floor Has a Hidden Fee
Grok 4.3 sets the headline floor at $1.25/M input, $2.50/M output, 40–60% below Grok 4.2. xAI also introduced a $0.05 fee per safety-filter-blocked request. At a 2–3% filter rate on production prompts, that erodes token savings meaningfully at scale. Most cost dashboards won't catch it because they track tokens, not rejections.
A model that is cheap to serve is not the same as a model that is cheap to trust. The blended cost on a replay of real traffic is the number that decides the migration, not the sticker price.
Cross-Source Pattern
Four independent sources converge on the same read: inference economics, not benchmark scores, are the binding constraint on production model selection. Nebius paid $615M for Eigen AI specifically for inference optimization. Cursor's −23% gross margins show what a thin wrapper over expensive inference looks like on the P&L. Grok's own commentary notes the headline price cut may be subsidized by poor utilization and is unlikely to beat a well-cached DeepSeek workload.
What To Do
The immediate action is observability, not migration. Most agent runtimes emit tokens/sec and $/query but not cache-hit rate, prefix-reuse ratio, or effective $/1K tokens net of cache discounts. Those three metrics are now first-class cost variables. Stable system prompts and tool schemas are where the DeepSeek-style discount actually lives, which makes harness design a cost lever, not an infrastructure footnote.
Action items
- Instrument cache-hit rate, prefix-reuse ratio, and effective $/1K tokens (net of cache) as first-class metrics in your agent runtime by end of sprint.
- Replay last month's agent traffic through DeepSeek V4 Pro and Grok 4.3 in shadow mode; compare blended $/successful-task, not sticker $/M tokens.
- Add refusal-rate instrumentation to your LLM gateway and model the $0.05/blocked-request fee into unit economics.
- Evaluate whether your system-prompt and tool-schema structure maximizes prefix reuse; refactor for cache efficiency before switching models.
Sources:AINews · Techpresso · StrictlyVC · Turing Post
02
Your Benchmarks Are Measuring Harness Fit — Fix the Eval Before Fixing the Stack
The Evidence Harness-Dependence Is Load-Bearing
Cross-vendor benchmark comparisons are not just noisy. They are systematically confounded by the evaluation harness, and this week produced the receipts:
1. GPT-5.5 beats Opus 4.7 overall on the Intelligence Index but loses on PostTrainBench when evaluated inside the Claude Code harness. Same weights, different plumbing.
2. Grok 4.3 gained +321 Elo on GDPval-AA yet regressed on Vending-Bench 2 so badly that the agent reportedly preferred to 'sleep' rather than act. The same model on a different eval surface produces the opposite conclusion.
3. Grok 4.3 ranks #1 on CaseLaw and CorpFin but scores 11% on ProofBench. Within-model domain variance is wider than the between-vendor gap on any single domain.
Hugging Face's Clem Delangue frames this as the 'engine vs car' problem: closed APIs bundle routing, tool schemas, retry logic, and sometimes multiple models behind one URL. Benchmarking an open-weight model through a harness designed for Claude's tool-use schema produces degradation that looks like a model problem but is an instrumentation problem.
The Qwen Signal
Alibaba's Qwen3.6-27B reportedly beats a predecessor 15× its size on coding benchmarks. The thing this headline doesn't tell you is whether contamination was ablated (it wasn't disclosed), and coding benchmarks reward single-file completions rather than the multi-repo edits where production assistance actually fails. The cheapest experiment is an A/B on held-out pull requests, not a re-run of HumanEval.
Cross-harness comparisons correlate with production outcomes. They do not cause them, and they do not predict them well enough to justify a migration on their own.
The Fix
The eval harness itself needs to become a logged, versioned variable. Two changes make this concrete:
- Log harness config as an eval dimension. Prompt format, tool schema shape, retry policy, context budget all differ across model families and all affect measured quality. Re-test top open candidates (V4 Pro, Kimi K2.6, Qwen3.6-27B) with per-family tuned harnesses.
- Require vendors to publish harness methodology. Any procurement checklist that accepts a benchmark score without the harness definition, tool-call budget, and retry policy is accepting an unauditable claim.
The open-weight gap is now 5 points on the Intelligence Index (52–54 vs. 57–60), concentrated in HLE, CritPt, TerminalBench Hard, and hallucination-heavy Omniscience, rather than general coding or agentic tool use. On multi-turn agentic coding specifically, V4 Pro is called the first open-weight model that genuinely feels comparable to Codex or Claude Code. If internal evals confirm a gap of ≤4–5 points on task-completion metrics, a two-week shadow-traffic bake-off is warranted. If the gap is larger, re-baseline the evals first.
Action items
- Re-run your top-3 agent evals on DeepSeek V4 Pro, Kimi K2.6, and Qwen3.6-27B with per-model-family harness tuning (prompt format, tool schema, retry logic) this sprint.
- Add harness config (prompt format, tool budget, retry policy) as a required metadata field in all eval reports.
- A/B test Qwen3.6-27B against incumbent on held-out internal pull requests, measuring pass@1, latency, and $/1K tokens self-hosted vs. API.
- Build a per-domain benchmark slice dashboard showing model variance across task types before any migration decision.
Sources:AINews · Techpresso · Turing Post · THE DECODER

Agent Defense Patterns Harden: Planner/Executor Split, MCP vs SKILL.md, and the Security Surface Nobody Sandboxed

The Reference Architecture for Untrusted Content

The Planner/Executor Split is settling in as the default defense pattern for agents that touch untrusted content: emails, web pages, user uploads, RAG retrievals. The shape is simple. A planner LLM has tool access but never sees untrusted text. An executor LLM reads untrusted text but has no tools. Gmail is cited as a production reference. The honest cost is that it roughly doubles inference spend. That is the price that turns indirect prompt injection from catastrophic into recoverable.

The thing this doesn't solve is exfiltration through the planner's own tool calls. The eval you actually want is not "did the executor get fooled." It is "did a document change the planner's plan in a way the user didn't ask for." Build that eval before the architecture, or the architecture ships without a way to measure whether it works.

MCP vs SKILL.md: A Systems Decision, Not a Taste Decision

The framing that MCP and Skills are competing approaches is wrong. They are orthogonal primitives:

Dimension	MCP	SKILL.md
Purpose	Integration plane (live systems, state, auth)	Knowledge plane (procedures, playbooks)
Runtime	Separate process, containerized	Agent's own environment
Invocation	Typed JSON-RPC, schema-validated	Agent reads markdown, runs bash/python/curl
Primary risk	Infra overhead, auth sprawl	Arbitrary code execution with no isolation

The security surface most teams miss: Skills execute arbitrary bash/python/curl in the agent's own environment with no sandboxing. This is remote code execution by design. Every SKILL.md file deserves the same review rigor as a Dockerfile. Without Firecracker, gVisor, or at minimum a seccomp-restricted subprocess, a compromised Skill is a privilege-escalation vector.

The cost asymmetry: an MCP server that should have been a SKILL.md costs you a container, a deploy pipeline, and an on-call rotation. A SKILL.md that should have been MCP costs you correctness on a slice of traffic. One is a line item. The other is a bug.

Stacking Defenses

OWASP ranks prompt injection as the #1 LLM threat, and the guidance is explicit: no single fix exists. The layered approach combines:

Spotlighting: Wrap untrusted text in <UNTRUSTED> tags with a system-prompt rule. Hours of work, meaningful floor raise.
Instruction Hierarchy: Fine-tune to rank system > user > third-party content.
Planner/Executor Split: Architectural isolation for high-stakes paths.
Least-Privilege Tools: Minimize blast radius per tool call.

Spotlighting is the cheap floor and worth taking today. Planner/Executor earns its doubled inference cost on any path that touches the open web, and less obviously elsewhere. Skill sandboxing is the one that teams defer and regret; a compromised SKILL.md in production is not a bug you recover from quickly.

Action items

Implement Planner/Executor Split for any agent ingesting untrusted content (emails, web pages, user uploads, RAG retrievals) this quarter.
Ship Spotlighting (<UNTRUSTED> wrapper + system-prompt rule) on every RAG and tool-output path by end of week.
Audit all agent tools: classify as MCP-worthy (live state, auth) or Skill-worthy (procedural). Kill any MCP server that is wrapping a prompt template.
Sandbox all SKILL.md execution in Firecracker, gVisor, or seccomp-restricted subprocess before any Skill reaches production.

Sources:ByteByteGo · AINews

◆ QUICK HITS

Update: Anthropic Pentagon exclusion — analysis suggests deployment posture (API-only, RSP refusal patterns), not capability, drove the 'supply chain risk' label; build the dual-vendor fallback now, not under a procurement freeze
Morning Brew, Techpresso
CopyFail: new Linux root-access vulnerability covers training clusters, Ray/Airflow workers, Jupyter servers, and vector DB hosts — patch all ML infrastructure hosts and rotate credentials on any exposed node before Monday
StrictlyVC
Specialist model pattern is dead: OpenAI folded Codex back into GPT-5.5, Mistral bundled three models into one flagship — the code-vs-chat routing layer is now tech debt worth retiring
THE DECODER
ReaLM-Retrieve reports +10.1% absolute F1 vs. standard RAG with 47% fewer retrieval calls and 3.2× lower per-retrieval overhead — spike adaptive retrieval against your hardest multi-hop eval set this sprint
AINews
Mistral Le Chat repeated Iran-war disinformation in 60% of leading prompts — clone the methodology as a 50-prompt adversarial suite and run it quarterly against every shortlisted model
THE DECODER
Qwen-Scope ships Apache-2.0 SAEs for Qwen 3.5 from 2B to 35B MoE — arguably the largest open interpretability toolkit for dense/MoE models, surpassing GemmaScope in scale
AINews
Anthropic experiment: stronger models negotiate better prices 'without anyone noticing' — add counterfactual eval (weaker-model + human baseline comparison) to any agent loop touching pricing or procurement
THE DECODER
Meta FAIR self-improving pretraining: 36.2% relative factuality gain and 86.3% win rate on generation quality — watch for this to land in Llama 5 training recipes
AINews
MiniMax-M2.7 flipped from MIT to Non-Commercial license — audit all open-weight dependencies for license-change risk and keep a tier-2 fallback per workload
AINews
Mac mini $599 SKU retired (now $799) with supply crunch attributed to local AI demand — rerun local-inference fleet TCO against Ryzen AI Max and DGX Spark alternatives before next hardware refresh
Techpresso

◆ Bottom line

The take.

Cache hit rate is now a bigger cost lever than model quality for agentic workloads — DeepSeek's hours-long KV persistence delivers a 3.2× effective discount no benchmark captures — while Grok 4.3's domain profile (first place on legal, 11% on math, narcolepsy on agents) proves model selection is a routing decision, not a vendor decision; and if your agent's SKILL.md files run unsandboxed bash, you have RCE by design.

Frequently asked

How do I actually measure the cache discount in my agent runtime?: Instrument three first-class metrics: cache-hit rate, prefix-reuse ratio, and effective $/1K tokens net of cache discounts. Standard runtimes emit tokens/sec and $/query but miss these, which is why a 3.2× effective discount can stay invisible. Stable system prompts and tool schemas are where the savings concentrate, so harness design becomes a cost lever rather than an infra detail.
Why do the same models produce contradictory benchmark results across sources?: Because evaluation harnesses (prompt format, tool schema, retry policy, context budget) are load-bearing variables that rarely get logged. GPT-5.5 beats Opus 4.7 on the Intelligence Index but loses inside the Claude Code harness; Grok 4.3 gains +321 Elo on GDPval-AA while regressing on Vending-Bench 2. Treat harness config as a required eval dimension before comparing vendors.
Is MCP or SKILL.md the right choice for exposing tools to an agent?: They are orthogonal, not competing. MCP is the integration plane for live systems, state, and auth, validated by typed JSON-RPC in a separate process. SKILL.md is the knowledge plane for procedures and playbooks. Picking MCP for a prompt template wastes infra; picking Skills for stateful auth flows costs correctness.
What is the cheapest prompt-injection defense to ship today?: Spotlighting: wrap all untrusted text in <UNTRUSTED> tags and add a system-prompt rule telling the model to treat that content as data, not instructions. It takes hours to deploy across RAG and tool-output paths and meaningfully raises the floor while you build the heavier Planner/Executor architecture for high-stakes flows.
Does the Grok 4.3 price cut actually beat DeepSeek V4 Pro on blended cost?: Probably not on cache-friendly agent workloads. Grok 4.3's $1.25/$2.50 per million tokens looks aggressive, but a $0.05 fee per safety-filter-blocked request erodes savings at typical 2–3% rejection rates, and DeepSeek's hours-long disk-backed KV cache produces a 3.2× effective discount on repetitive agent loops. Replay real traffic in shadow mode before migrating.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

CacheEconomicsandDomainRoutingReshapeModelSelection

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The Shift Nobody Priced In

The Pricing Floor Has a Hidden Fee

Cross-Source Pattern

What To Do

The Evidence Harness-Dependence Is Load-Bearing

The Qwen Signal

The Fix