What metric should replace 'sessions per user' for AI features?

Track 'time-to-usable-output' and 'output acceptance rate' — the percentage of AI outputs that ship to the next person without a human rewrite. Engagement metrics like sessions and prompts-per-week perversely reward bad output, since users retry more when results are unusable. Replit's 300% net revenue retention correlates with shipped-artifact rate, not session count.

How do I decide between MCP and Skills for agent features?

Classify each backlog item as an integration story or a knowledge story. If the agent needs a fresh read from a live system with schema validation and isolation, use MCP. For reusable instructions and prompts with zero infrastructure overhead, default to Skills. Teams that skip this classification ship both patterns and pay the security tax twice.

Are open-weight models actually viable replacements for closed APIs now?

For coding, support, content generation, and workflow automation, yes — DeepSeek V4 Pro, Kimi K2.6, and MiMo V2.5 Pro score 52-54 on the Intelligence Index versus 57-60 for GPT-5.5 and Gemini 3.1 Pro. The 3-6 point gap concentrates in esoteric benchmarks like HLE and CritPt. Benchmark them inside your own production harness on your top workflows before deciding — leaderboards don't reflect your task mix.

Why does Cursor run negative margins while Replit hits $1B ARR?

Cursor owns only the AI layer over someone else's IDE and foundation models, so inference cost is the P&L. Replit owns compute, deployment, collaboration, and the AI layer on one surface, capturing the full draft-to-deploy loop. When the product is a thin interface on someone else's foundation, every token consumed is margin lost; when you own the loop, the output ships and the user returns.

What's the minimum viable defense against prompt injection in agent products?

Implement a Planner/Executor split: one LLM decides what to do and never sees untrusted content, while a separate LLM with narrower permissions carries it out. Gmail runs this pattern in production. It roughly doubles inference cost on sensitive flows, but prompt injection is OWASP's #1 LLM vulnerability — enterprise pen tests will surface it, so budget for the architecture before the test, not after.

Edition 2026-05-03 · read as Product

TheLast-MileGapBetweenAIOutputandSend-ReadyWork

Sources: 8
Words: 1,508
Read: 8min

Topics Agentic AI LLM Inference AI Capital

◆ The signal

A banker I spoke with last month pasted the model's draft into a client memo, then spent forty minutes rewriting it anyway. That is the texture behind the survey: five hundred bankers say AI output is unusable for client work, and twenty-two percent of non-tech jurors in San Francisco say it makes them slower. Replit crossed roughly one billion dollars ARR with 300% net revenue retention because it owns the draft-to-deploy loop where the output ships without a rewrite. Cursor, owning only the AI layer, runs at negative twenty-three percent gross margins. The sprint question is not model quality. It is the distance between the model's output and a send-ready artifact.

Key facts

A survey of 500 banking professionals found AI outputs consistently unusable for client-facing communication, and 22% of non-tech workers in San Francisco report AI makes their work slower.
Replit grew from $2.8M to roughly $1B ARR in 18 months with 300% net revenue retention by owning the full draft-to-deploy loop.
Cursor, operating only as an AI layer over VS Code, runs at -23% gross margins and is exploring a $60B exit to SpaceX.
Open-weight models DeepSeek V4 Pro, Kimi K2.6, and MiMo V2.5 Pro score 52-54 on the Artificial Analysis Intelligence Index, within 3-6 points of closed frontier models like GPT-5.5 at 60.
Hugging Face CEO Clem Delangue expects autonomous agents to outnumber the platform's 15 million human users by end of 2026, with a new repository created every 8 seconds.

◆ INTELLIGENCE MAP

01
The Ship-Ready Output Bar: AI's Biggest Product Gap Is Now Quantified
act now
500 bankers say AI outputs are unusable for client work. 22% of non-tech jurors say AI slows them down. Replit's $1B ARR proves the fix: own the full workflow from draft to deployed artifact. Platform giants agree — Google and Microsoft are embedding AI directly into artifact-producing surfaces.
300%
Replit net revenue retention
4
sources
- Bankers: output unusable
- Jurors: AI slows work
- Replit ARR
- Cursor gross margin
- Replit revenue growth
1. Replit (owns workflow)300
2. Cursor (AI layer only)-23
02
Open Weights Cross 90% of Frontier — Build-vs-Buy Flips
monitor
Three open-weight trillion-param models (DeepSeek V4 Pro, Kimi K2.6, MiMo V2.5 Pro) score 52-54 vs 57-60 for closed frontier on the Intelligence Index. The 3-6 point gap is closeable with harness work for most production tasks. OpenAI's Microsoft exclusivity ended — GPT-5.5 now on AWS Bedrock alongside Claude.
90%
of frontier capability
3
sources
- Open models index
- Closed frontier index
- DeepSeek cache savings
- Qwen3.6 vs predecessor
1. GPT-5.560
2. Opus 4.757
3. DeepSeek V4 Pro54
4. Kimi K2.653
5. MiMo V2.5 Pro52
03
Your Platform's Next User Isn't Human — Agent-First Design Arrives
monitor
Hugging Face is redesigning for agent users — CLIs, agents.md files, token-efficient APIs — expecting agents to outnumber its 15M human users by end 2026. Agent infrastructure is converging fast: LangChain shipped RBAC, Cloudflare launched durable execution, Flue launched as a programmable headless framework. MCP vs Skills is now the architecture decision of the quarter.
2026
agents > humans (HF)
3
sources
- HF human users
- New repo cadence
- HF employees
- Token reduction (RMAS)
1. LangChain RBACMulti-user auth + data isolation
2. Cloudflare Dynamic WorkflowsDurable agent execution
3. Hermes /goal loopSupervisor-forced agent goals
4. Flue launchProgrammable headless agents
5. HF agents.mdMachine-readable platform surfaces
04
AI Monetization Fragments: Usage Pricing, Safety Fees, and Ad Subsidies
monitor
GitHub will charge heavy Copilot users more starting June 2026. xAI introduced $0.05 per safety-filter-blocked request. Grok 4.3 is 40-60% cheaper but only excels in legal/finance — 11% on math. Nadella frames AI as a 'usage business.' Each model creates a different cost structure and platform risk profile.
$0.05
per blocked request (xAI)
2
sources
- Grok 4.3 input price
- Grok vs 4.2 savings
- Grok ProofBench score
- Reddit Q revenue
1. Grok 4.3 (legal/fin)1.25
2. Grok 4.2 (baseline)2.08
3. xAI blocked request0.05
05
Government AI Consolidates to 7 Vendors — Anthropic Locked Out
background
Pentagon signed classified AI deals with 7 vendors: OpenAI, Google, Microsoft, Amazon, Nvidia, xAI, and Reflection. Anthropic was explicitly labeled a 'supply chain risk.' Separately, Roblox lost 18% in one session after child-safety measures slowed growth. Compliance is a line item, not an afterthought.
7
approved DoD AI vendors
3
sources
- Approved vendors
- Anthropic status
- Roblox stock drop
- Nebius/Eigen AI deal
1. 01OpenAIApproved
2. 02GoogleApproved
3. 03MicrosoftApproved
4. 04Amazon (AWS)Approved
5. 05NvidiaApproved
6. 00AnthropicEXCLUDED

◆ DEEP DIVES

01
The Ship-Ready Output Bar — 500 Bankers, 22% Jury Skepticism, and the $1B Proof That Owning the Loop Is Everything
The Quality Gap Nobody Measured
A relationship manager at a bank opens a chat window, types a prompt about a client portfolio, gets a paragraph back, and rewrites it before sending. She does this four times a day. She is not a skeptic. She is doing the work the model did not finish. A survey of 500 banking professionals this week found AI outputs consistently unusable for client-facing communication. Jury selection in the Musk v. OpenAI trial surfaced the same pattern from the other end of the labor market: 22% of non-tech workers in San Francisco — nurses, caretakers, painters — say AI makes their work slower because they have to double-check output. Platform analytics are still counting sessions. The metric that matters is whether the output shipped without a human rewrite.
The bottleneck in AI product adoption has shifted from model capability to the distance between model output and a send-ready artifact.
Who Solved It and Who Didn't
Replit went from $2.8M to ~$1B ARR in 18 months with 300% net revenue retention. A user types a prompt, sees a working prototype, edits the parts that are wrong, and deploys. All of that happens on one surface. Cursor, pitched into the same category, reportedly runs -23% gross margins and is exploring a $60B exit to SpaceX. The model is not the difference. The loop is. Replit owns compute, deployment, collaboration, and the AI layer. Cursor is an AI panel over VS Code. When the product is a thin interface on someone else's foundation model, inference cost is the P&L.
The platform giants reached the same conclusion. Google Gemini now produces documents, spreadsheets, and presentations inside the chat interface. Microsoft embedded AI contract agents into Word that surface clauses, risks, and changes at a glance. Mistral launched Workflows to connect models to business processes. Standalone AI chat is a dead-end UX. The teams that are winning are shipping finished artifacts inside the tools where work already happens.
The Reddit Counter-Example
Reddit posted $663M revenue, beat estimates by $53.2M, and reported a 30% YoY jump in weekly search users driven by Reddit Answers. The feature summarizes human forum posts. This works because the output is the artifact. The user reads the summary and leaves. There is no draft to rewrite. Search DAUs, WAUs, and queries are up meaningfully, which is what "ship-ready AI output" looks like when the artifact is small enough to fit in the answer.
The Diagnostic
Draw a 2x2. One axis: does the AI output ship to the next person without a human rewriting it. Other axis: does the user come back next week. The cell with revenue in it is "ships without rewrite, user returns." A product sitting in "needs rewriting, user returns" is a tolerated workflow, not a moat, and it collapses the week a competitor crosses into the top-right. The 500 bankers are not a verdict on AI in finance. They are a verdict on which teams decided the quality bar was someone else's problem.
Action items
- Audit every AI feature that requires copy-paste, reformat, or human rewrite before the output is usable. Tag the top 3 by usage volume by end of this sprint.
- Redesign the highest-usage AI feature to produce an inline, editable artifact rather than chat output. Ship a prototype within two sprints.
- Replace 'sessions per user' and 'prompts per week' with 'time-to-usable-output' and 'output acceptance rate' as primary AI feature metrics. Instrument by end of June.
Sources:A banker at a mid-sized firm opened the AI-generated memo · A developer opened Replit this month · A juror in the Musk securities trial was asked · A product manager opened her cost dashboard on Monday
02
Open Weights at 90% of Frontier + Multi-Cloud = Your Single-Provider Lock-In Has a Shelf Life
The Gap Collapsed to 3-6 Points
A PM evaluating a coding copilot this quarter can now run three open-weight models that score within striking distance of the closed frontier: DeepSeek V4 Pro (1.6T total/49B active), Kimi K2.6 (1T/32B active), and MiMo V2.5 Pro (1T/42B active), all landing at 52-54 on the Artificial Analysis Intelligence Index. The closed tier sits at 57-60: GPT-5.5 at 60, Gemini 3.1 Pro and Opus 4.7 at 57. That 3-6 point gap is concentrated in the hardest, most esoteric benchmarks: HLE, TerminalBench Hard, CritPt, and hallucination-heavy tasks. For coding, customer support, content generation, and workflow automation, the open tier is functionally equivalent.
DeepSeek V4 Pro is described as 'the first open-weight model that genuinely feels comparable to Codex or Claude Code for multi-turn agentic coding.'
The Cache Economics Nobody's Optimizing
DeepSeek V4's dashboard shows $1,051 in actual spend against $3,351 in cache savings, a 3.2x cost multiplier from disk-based KV caching that persists for hours versus the 5-minute windows typical of most providers. The architecture also reduces KV cache to 10% with 4x lower inference FLOPs at long context. For multi-turn agents, the highest-leverage cost move is not switching models. It is redesigning agent architecture for cache reuse: stable system prompts, ordered tool calls, session continuity, warm-start guarantees.
OpenAI on Bedrock Ends the Exclusivity Era
OpenAI's Microsoft exclusivity ended. GPT-5.5, Codex, and a joint agent platform now run on AWS Bedrock alongside Anthropic's Claude. AWS hosts both. Separately, Alibaba's Qwen3.6-27B beats its predecessor that was 15x larger on coding benchmarks, which collapses the 'route to specialized model per task' architecture a lot of teams built last year. Hugging Face CEO Clem Delangue predicts AI workloads will invert from 99% proprietary API to 95% local or specialized, a number worth discounting but not ignoring. Local model downloads on Hugging Face are climbing, and Nebius acquired Eigen AI for $615M specifically to boost inference performance.
Where Sources Disagree
The tension worth naming: Delangue predicts 95% local. Capital flows say otherwise, with $725B in Big Four AI investment this year, mostly directed at centralized compute. The likely resolution is that undifferentiated workloads migrate to local and open, while frontier-dependent tasks stay proprietary. The PM decision this sprint is which category each feature in the roadmap falls into, and the honest answer is usually not the one the deck claims.
Action items
- Benchmark DeepSeek V4 Pro, Kimi K2.6, and your current closed-API model inside your actual production harness on your top 5 user workflows. Measure hallucination rate, latency (TTFT/TPS), and cache hit rate. Complete by mid-June.
- Add cache hit rate and session persistence to your LLM provider evaluation criteria. Negotiate cache SLAs with current providers this quarter.
- Evaluate AWS Bedrock as a multi-model routing layer now that OpenAI and Anthropic models both live there. Document a 90-day migration path for your primary AI workload.
Sources:A staff engineer ran the same eval suite · A developer on your platform opened the API docs at 2 a.m. · A banker at a mid-sized firm opened the AI-generated memo
03
Your Platform's Next User Isn't Human — And It Won't Read Your Docs
Hugging Face Is Rebuilding for Agents
A developer pushed a model to Hugging Face this week and never opened the web UI. She used the CLI, read an agents.md file, and her script parsed a token-efficient JSON response. Clem Delangue expects autonomous agents to outnumber the platform's 15 million human users by end of 2026, on a site creating a new repository every 8 seconds. The infrastructure is following the behavior: CLIs instead of GUIs, headless interfaces, agents.md files, token-efficient API responses. The agents.md file is the tell. It is a machine-readable contract replacing the README.md humans used to skim. Any product with an API surface is in the same conversation.
An agent hit the endpoint, parsed the error, retried with a correction, cached the schema, and moved on. It did not render your docs page. Your analytics pipeline recorded nothing because it assumes sessions start with pageviews.
Agent Infrastructure Is Converging Fast
Four launches, same shape. LangChain shipped multi-user RBAC with data isolation. Cloudflare announced Dynamic Workflows for durable agent execution. Hermes added supervisor-forced goal loops. Flue launched as a programmable headless agent framework. The primitives everyone converges on are durable execution, human-in-the-loop, auth boundaries, subagent orchestration, memory, and feedback loops. A Recursive Multi-Agent Systems paper reports 8.3% accuracy improvement with 34.6-75.6% token reduction across nine benchmarks. These primitives are commoditizing faster than most roadmaps assume.
MCP vs Skills: The Architecture Decision This Quarter
Teams pitch this as a feature comparison. What they are actually doing is a job classification. MCP (Model Context Protocol) connects agents to live systems via JSON-RPC in separate containers with schema validation and isolation. Skills give agents reusable instructions via a SKILL.md directory with zero infrastructure overhead, but they execute arbitrary commands in the agent's own process with no isolation. The forcing function fits on a napkin: if the agent needs a fresh read from a live system, use MCP. For everything else, default to Skills. Teams that refuse to classify each backlog item ship both patterns and pay the security tax twice.
The Security Layer You're Missing
Prompt injection sits at #1 on the OWASP LLM Top 10. The minimum viable defense is a Planner/Executor Split. One LLM decides what to do and never sees untrusted content. A separate LLM with narrower permissions carries it out. Gmail runs this in production. Inference cost on sensitive flows roughly doubles. For enterprise customers, shipping agents without this architecture is shipping an exploit with a UI on top.
Action items
- Audit your API's machine-readability this sprint: Can an autonomous agent discover, authenticate, and consume your API without a human in the loop? Create an agents.md file and test with an LLM agent.
- Classify every item on your agent feature backlog as an 'integration story' (→ MCP) or 'knowledge story' (→ Skills). Share classification with eng lead before committing to either architecture pattern.
- Budget for a Planner/Executor prompt injection defense in any agent feature touching enterprise customers. Scope the ~2x inference cost increase into your margin model.
Sources:A developer on your platform opened the API docs at 2 a.m. · An engineering lead opened the MCP specification on Tuesday · A staff engineer ran the same eval suite

◆ QUICK HITS

Roblox shares cratered 18% in one session after child-safety measures slowed user growth — the market punishes you for being unsafe, then again for the growth cost of becoming safe. Quantify your T&S feature impact on DAU before regulators do it for you.
A program manager at a defense agency opened her vendor approval dashboard
xAI introduced a $0.05 fee per safety-filter-blocked request — the first provider to monetize moderation refusals. Add refusal-rate as a cost variable in your LLM budget model; the provider profits from refusing your requests.
A product manager opened her cost dashboard on Monday
Update: OpenAI ad tracking is now on by default in ChatGPT (previously reported as 'testing'). If user prompts flow through OpenAI APIs, GDPR/CCPA exposure is no longer hypothetical — route to legal this week.
A product manager opened her cost dashboard on Monday
GitHub will charge heavy Copilot users more starting June 2026, with Nadella explicitly framing AI as a 'usage business' — validates that flat-rate pricing doesn't survive AI's top-percentile cost curve. Audit your own pricing before the same math hits.
A banker at a mid-sized firm opened the AI-generated memo
Meta acquired Assured Robot Intelligence for its 'Android of humanoids' strategy — supplying intelligence layers (whole-body control, e-Flesh tactile sensors, on-device quantization) while others build hardware. The edge-deployment tooling benefits any on-device AI roadmap.
A product manager opened her cost dashboard on Monday
Voice cloning from 120-second reference clips is now commercially available via xAI's Custom Voices suite. If you have voice UX, prototype personalization. If you have voice-based auth, red-team it immediately.
A product manager opened her cost dashboard on Monday
80% of Claude users live in households earning $100K+ — AI tools have an affluent-user concentration problem that limits TAM expansion and skews product feedback loops toward power users.
A banker at a mid-sized firm opened the AI-generated memo
AIEWF 2026 added tracks for Agentic Commerce, Memory, and Tokenmaxxing — conference track names are leading indicators. If your 2027 roadmap doesn't address agents that autonomously purchase or long-term context persistence, note the gap.
A staff engineer ran the same eval suite
ZaiNar exited stealth with ultraprecise tracking targeting $5B in deals as a GPS alternative for robotics — if your roadmap touches logistics, warehousing, or AR, evaluate whether you produce position data or consume it.
A juror in the Musk securities trial was asked

◆ Bottom line

The take.

The AI product bottleneck has moved from 'can the model do the task' to 'can the user ship the output without rewriting it' — 500 bankers say no, 22% of mainstream workers say AI slows them down, and Replit proved the fix is worth $1B ARR by owning the full draft-to-deploy loop while Cursor bleeds at -23% margins owning only the AI layer. Simultaneously, open-weight models crossed 90% of frontier capability, OpenAI's Microsoft exclusivity ended with GPT-5.5 landing on AWS Bedrock, and Hugging Face is redesigning its platform for agent users it expects will outnumber humans by year-end. The model layer is commoditizing from both directions — the teams that win this quarter own the workflow surface where AI output becomes a finished artifact.

Frequently asked

What metric should replace 'sessions per user' for AI features?: Track 'time-to-usable-output' and 'output acceptance rate' — the percentage of AI outputs that ship to the next person without a human rewrite. Engagement metrics like sessions and prompts-per-week perversely reward bad output, since users retry more when results are unusable. Replit's 300% net revenue retention correlates with shipped-artifact rate, not session count.
How do I decide between MCP and Skills for agent features?: Classify each backlog item as an integration story or a knowledge story. If the agent needs a fresh read from a live system with schema validation and isolation, use MCP. For reusable instructions and prompts with zero infrastructure overhead, default to Skills. Teams that skip this classification ship both patterns and pay the security tax twice.
Are open-weight models actually viable replacements for closed APIs now?: For coding, support, content generation, and workflow automation, yes — DeepSeek V4 Pro, Kimi K2.6, and MiMo V2.5 Pro score 52-54 on the Intelligence Index versus 57-60 for GPT-5.5 and Gemini 3.1 Pro. The 3-6 point gap concentrates in esoteric benchmarks like HLE and CritPt. Benchmark them inside your own production harness on your top workflows before deciding — leaderboards don't reflect your task mix.
Why does Cursor run negative margins while Replit hits $1B ARR?: Cursor owns only the AI layer over someone else's IDE and foundation models, so inference cost is the P&L. Replit owns compute, deployment, collaboration, and the AI layer on one surface, capturing the full draft-to-deploy loop. When the product is a thin interface on someone else's foundation, every token consumed is margin lost; when you own the loop, the output ships and the user returns.
What's the minimum viable defense against prompt injection in agent products?: Implement a Planner/Executor split: one LLM decides what to do and never sees untrusted content, while a separate LLM with narrower permissions carries it out. Gmail runs this pattern in production. It roughly doubles inference cost on sensitive flows, but prompt injection is OWASP's #1 LLM vulnerability — enterprise pen tests will surface it, so budget for the architecture before the test, not after.

◆ Same day, different angle

Read this day as…

◆ Recent in product

TheLast-MileGapBetweenAIOutputandSend-ReadyWork

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The Quality Gap Nobody Measured

Who Solved It and Who Didn't

The Reddit Counter-Example

The Diagnostic

The Gap Collapsed to 3-6 Points

The Cache Economics Nobody's Optimizing

OpenAI on Bedrock Ends the Exclusivity Era

Where Sources Disagree

Hugging Face Is Rebuilding for Agents

Agent Infrastructure Is Converging Fast

MCP vs Skills: The Architecture Decision This Quarter

The Security Layer You're Missing

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS