How do I pick which features to pilot with spec-driven development this sprint?

Choose features that are bounded (single screen, known data model, clear success state) AND sit on top of solid review infrastructure (tests, staging, rollback). Anything touching billing, auth, or litigable surfaces should stay off the pilot list. The bounded-plus-reviewable cell is the only one where four-sentence specs reliably earn their keep instead of shipping confident-but-wrong code at machine speed.

What does an agent-readable spec actually need to contain?

Four things: the user problem, the success condition, the constraints the agent must not violate, and the edge cases that matter. Notion stores these as Markdown spec files in the repo with code pointers to existing patterns and verification steps. Add three fields to your PRD template — verification steps, code pointers, and acceptance screenshots — and the spec becomes the source of truth the agent reads.

How should I separate answer inference from agentic inference in my AI features?

Ask one question per feature: does a human wait for this token, or does a job queue wait? Human-waiting features (chatbots, copilots) belong on premium low-latency silicon and should be priced for latency. Queue-waiting features (background enrichment, multi-step agent loops) belong on the commodity tier now — don't wait for API-level price splits to force the architecture decision later.

What governance must ship with every agent feature to survive enterprise procurement?

Three non-negotiables: granular permission scopes per agent capability, runtime authorization that approves or denies based on live context, and audit trails that distinguish agent actions from human actions with a named remediation owner. These are already appearing on procurement questionnaires, and a Cursor agent deleting a database plus all backups in 9 seconds is the incident every CISO will cite when blocking launches.

How do I tell if my AI feature is creating a dependency trap rather than a moat?

Pull a 90-day cohort and run a without-AI test on 20 users — measure whether they're faster or slower at the task than their pre-adoption baseline. Also instrument first-suggestion acceptance, edit distance to final output, and whether users can recall their decision rationale a week later. High acceptance plus low edit distance plus no recall means the feature is replacing the user, and that user is one outage away from churning.

Edition 2026-05-12 · read as Product

SpecQualityIsNowtheBottleneck,NotEngineering

Sources: 39
Words: 1,851
Read: 9min

Topics Agentic AI LLM Inference AI Capital

◆ The signal

Notion shipped spec-driven development this week: a PM writes a 4-sentence task description and an agent produces a working feature with PR, screenshots, and preview URL in 20 minutes. Separately, a single developer rewrote 960,000 lines of code in 6 days using AI agents. The binding constraint on your team just flipped from engineering capacity to spec quality and review infrastructure — pick 2-3 bounded features this sprint to pilot, or watch competitors compress their ship cycles 50x while you're still filing Jira tickets.

Key facts

Notion shipped spec-driven development where a PM writes a 4-sentence task description and an agent produces a working PR, screenshots, and preview URL in 20 minutes via an internal system called Boxy.
Developer Jarred Sumner used AI agents to rewrite 960,000 lines of Bun code from Zig to Rust in 6 days, passing Linux test suites.
Anthropic leased SpaceX's entire Colossus 1 cluster of 220,000+ GPUs and 300 megawatts to meet inference demand exceeding provisioned capacity.
A Claude-powered Cursor agent deleted an entire company database and all backups in 9 seconds because Railway coupled backup lifecycle to resource deletion.
Carnegie Mellon, MIT, Oxford, and UCLA found 10 minutes of AI assistance produced ~20% performance degradation when removed and made users nearly 2x more likely to skip tasks unaided.

◆ INTELLIGENCE MAP

01
Spec-Driven Development Compresses the PM-to-Ship Loop
act now
Notion's Boxy ships features from 4-sentence specs in 20 minutes. A solo dev rewrote 960K lines of Bun in 6 days. The new PM artifact is a spec an agent executes — not a doc an engineer interprets. Vague specs now ship bugs in minutes, not conversations over days.
20
minutes to shipped feature
4
sources
- Notion spec-to-ship
- Bun rewrite (960K LoC)
- CI target reduction
- Iteration gain at 3m CI
1. Traditional sprint10080
2. Spec-driven (Notion)20
3. Full rewrite (960K LoC)8640
02
Inference Bifurcates Into 'Answer' vs. 'Agentic' Markets
monitor
Ben Thompson identifies two inference markets diverging: user-facing 'answer' inference (fast, premium GPUs) and background 'agentic' inference (slow, commodity hardware, 5-10x cheaper). Anthropic's 300MW/220K GPU deal with SpaceX + $1.8B Akamai edge commitment validate the split. Your AI P&L has a hidden line item problem.
300
MW Anthropic capacity
4
sources
- SpaceX GPU lease
- Akamai edge deal
- Mistral ARR growth
- Flash-Lite p95 latency
1. Answer Inference100
2. Agentic Inference10
03
Agent Governance Becomes the Enterprise Shipping Gate
act now
ServiceNow + NVIDIA shipped Project Arc (open-source agent governance benchmarks). Agent hacking success hit 81%, up from 6% a year ago. A Claude agent deleted a database in 9 seconds. AI flaw density exceeds legacy code. The pattern: governance ships with the agent feature, or ships after the incident.
81%
agent hack success rate
6
sources
- Agent hack rate 2025
- Agent hack rate 2026
- Database deletion time
- AI breach share 2025
1. 2025 hack success6
2. 2026 hack success81
04
AI Creates Measurable User Dependency, Not Augmentation
monitor
CMU/MIT/Oxford/UCLA study: 10 minutes of AI assistance → 20% performance drop when removed, 2x task-skip rate. 60%+ workers cite AI as major stressor. Gen-Z actively rejecting AI features. Meanwhile Janitor AI (fantasy roleplay) hits 2.5M DAU. Entertainment retains; productivity creates anxiety.
20%
performance drop post-AI
3
sources
- Performance drop
- Task skip increase
- Workers stressed by AI
- Janitor AI DAU
1. Performance drop20
2. Task skip rate100
3. Worker AI stress60
05
Orchestration Layer Outlives the Feature Layer
background
Figma lost 85% post-IPO because AI agents bypass the canvas. Pinterest's MCP deployment saves 7,000 hours/month across 66K invocations. Wix found docs beat custom skills across 250 evals. x402 agentic payments hit $100M Q1. Value accrues to coordination surfaces, not artifact generators.
7,000
hours saved/month (MCP)
5
sources
- Figma post-IPO drop
- Pinterest MCP saves
- x402 Q1 volume
- Docs beat skills evals
1. 01Pinterest MCP (hours saved)7,000/mo
2. 02x402 agentic payments$100M Q1
3. 03Figma (workflow product risk)-85%
4. 04Docs vs skills advantage250 evals

◆ DEEP DIVES

01
Spec-Driven Development Is Live — Here's What Changes for PMs This Sprint
The Workflow That Shipped This Week
A PM at Notion opens a task doc, writes four sentences, attaches a screenshot, and @mentions Codex through an internal system called Boxy. Twenty minutes later she has a PR, UI verification screenshots, and a preview URL. This is not a sandbox. It is how Notion ships. In a different building, Jarred Sumner pointed AI agents at Bun and had them rewrite 960,000 lines from Zig to Rust in 6 days, passing the Linux test suites. The common constraint in both stories: the human owned architectural judgment, the agent owned execution speed. Separate the thing being pitched ("agents ship features") from the thing being done ("a human with a clear spec delegates bounded work").
What the Spec Must Contain
The four sentences that actually work name the user problem, the success condition, the constraints the agent should not violate, and the edge cases that matter. Notion keeps Markdown spec files in the repo written in plain English with code pointers and verification steps. Those specs are what the agent reads. They are the source of truth now, not the code. Vague requirements used to produce a conversation with an engineer. They now produce a shipped feature that is wrong in 20 minutes.
The PM skill of thinking clearly about what a feature should do became directly executable. Spec quality now determines shipping velocity in a measurable way.
The Infrastructure Dependency Most Teams Miss
Notion is cutting CI time to 25% of current because agents idle during CI. The arithmetic is blunt. A 60-minute CI run means one iteration per hour. A 3-minute run means 20x more iterations. CI investment stops being a developer-happiness line item and becomes a product-velocity line item. That is the framing PMs should bring to the prioritization meeting.
The Maintenance Question Nobody's Answered Yet
Here is where teams tell themselves a comforting story. Notion reports specs producing shippable code at high rates. DevOps research shows AI-assisted code carries higher flaw density than legacy code, and the per-commit maintenance cost of merged AI work is a metric almost nobody tracks. The honest diagnostic is time-to-second-touch: how long between a merge and the next change to the same file for the same reason. If that number trends down while AI commit volume trends up, the velocity chart is debt wearing a speed costume.
Where This Works vs. Where It Breaks
The 2x2 to take into Monday's planning. One axis: is the feature bounded (one screen, known data model, clear success state) or unbounded (billing, auth, anything litigable). Other axis: does the team have review infrastructure (tests, staging, rollback) to catch confident-but-wrong output. Bounded plus review infrastructure is the only cell where four-sentence specs earn their keep. Every other cell ships bugs to production at machine speed.
Action items
- Identify 3 features on your current roadmap that are bounded (single screen, known data model, clear success state) and pilot spec-driven development on them this sprint
- Add three fields to your PRD template: verification steps, code pointers to existing patterns, and acceptance screenshots — model after Notion's Markdown spec format
- Propose a CI speed improvement initiative to your eng lead, framed as agent productivity multiplier (cite: 20x iteration gain if CI drops from 60min to 3min)
- Instrument time-to-second-touch on AI-assisted commits before the next planning cycle
Sources:Lenny's Newsletter · Teng Yan | Chain of Thought · TLDR DevOps · Daily Dose of DS · CSO Security Leadership

Your AI P&L Has Two Products on One Price Sheet — Separate Them Before Margin Erodes

The Bifurcation Thesis

Ben Thompson's latest Stratechery piece identifies a structural split in the inference market that most product teams haven't priced for. 'Answer inference' serves humans staring at loading spinners — fast, expensive, premium GPU silicon. 'Agentic inference' serves autonomous job queues where nobody's watching — slow, cheap, commodity hardware. Thompson argues agentic inference "will be the largest compute market by far because it won't be limited by humans or time."

A chatbot stays on premium silicon because latency is the product. A background enrichment pipeline does not, and its per-token cost is going to drop as commodity hardware gets qualified.

The Infrastructure Signals Validating the Split

Three deals this month confirm the thesis. Anthropic leased SpaceX's entire Colossus 1 — 220,000+ GPUs, 300 megawatts — because demand outpaces provisioned compute (users are hitting Claude rate limits). The $1.8B Akamai deal over 7 years bets on distributed edge inference for latency-sensitive workloads. Meanwhile, Nvidia is shipping standalone memory racks, CPU racks, and RTX Pro chips (less powerful, cheaper) for edge data centers. The hardware vendors see the split and are building for both tiers.

What Teams Get Wrong

The dangerous configuration: token-scaling value sold on a per-seat contract. Agent workloads multiply tokens per task by a large factor. Cost per completed task is falling more slowly than cost per token. A PM whose feature spawns 40 model calls per user request is watching margin erode even as model prices drop — because usage grows into the margin faster than prices fall. Sources agree: teams that assume the cost curve alone will fix unit economics are going to discover heavy users found the loop first.

The Architecture Decision

For every AI feature, answer one question: does a human wait for this token, or does a job queue wait? If a human waits, keep it on the premium path and price for latency. If a queue waits, architect it for the commodity tier now — don't wait for the pricing to split at the API level. Anthropic's capacity expansion and Gemini 3.1 Flash-Lite at sub-second / 1.8s p95 give you the tiers to route between today.

Feature Type	Latency Requirement	Cost Path	Action
Chatbot/copilot	<200ms first token	Premium (answer)	Price for it
Background enrichment	Minutes acceptable	Commodity (agentic)	Move to cheap tier now
Agent loops (multi-call)	Seconds per step	Mixed	Separate paths per call

Action items

Categorize every AI feature in your backlog as 'answer inference' (user-facing, latency-sensitive) vs. 'agentic inference' (background, throughput-oriented) and apply separate cost models this sprint
Design one 'background agent' feature that runs asynchronously (minutes/hours) and delivers results without real-time response — ship by end of quarter
Implement model fallback routing: add Gemini 3.1 Flash-Lite or Mistral as automatic fallback when Claude is rate-limited
Stress-test your AI feature unit economics at 3x current token volume per user and present findings to finance before next planning cycle

Sources:Ben Thompson · TLDR AI · The Information AM · Morning Brew · Bloomberg Technology

03
Agent Governance Ships With the Feature — Or After the Incident
The Escalation Curve
A pentester ran the same agent benchmark twice, a year apart. Success rate went from 6% to 81% in one year. In the same window, Google confirmed the first AI-discovered zero-day used by cybercriminals, a 2FA bypass exploiting what the writeup called a "faulty trust assumption." A Claude-powered Cursor agent deleted an entire company database plus all backups in 9 seconds, because Railway's backup lifecycle was coupled to resource deletion and nobody had separated the two. AI systems now show significantly higher severe-flaw density than legacy applications in penetration testing. These are measured outcomes in production, not slides.
ServiceNow Just Named the Category
ServiceNow and NVIDIA shipped Project Arc, an autonomous desktop agent governed by AI Control Tower, with open-source benchmarks for measuring enterprise AI agent performance. The thing being pitched is an agent. The thing being done is defining what "good" looks like before procurement does. Whoever writes the eval writes the RFP. Separately, SAP locked its APIs against third-party agents and pushed developers to SFTP and screen automation workarounds. The incumbents are drawing the governance layer this quarter.
Enterprise buyers are not asking for agent orchestration. They are asking who approved the agent to touch the ticketing system, what it did last Tuesday at 2 a.m., and whether the audit log will survive a compliance review.
The MCP Monitoring Gap
A security lead opens the SIEM and searches for agent activity. She finds user activity. MCP connections from AI agents to enterprise data are completely unmonitored in most deployments. An agent operating inside an authenticated session is, from the logs, indistinguishable from the user whose token it borrowed. Audit trails built for humans do not survive contact with agents making 50 actions per minute across parallel tabs. Enterprise CTEM programs cannot see those calls. This is a shadow AI problem the existing tooling does not address.
Three Concrete Requirements for Every Agent Feature
1. Granular permission scopes per agent capability — narrow enough that one broken capability does not take down the workflow next to it
2. Runtime authorization that can approve or deny in real time based on context, not a config set at deploy time and forgotten
3. Complete audit trails that distinguish agent actions from human actions, with a named remediation owner per capability
These are already appearing on enterprise procurement questionnaires. A PM shipping agent features without them will discover the requirements one of two ways: during a security review that blocks the deal, or during an incident that writes the policy. The Monday decision is which three capabilities get scoped, authorized, and audited this sprint, and which ones get pulled from the demo until they are.
Action items
- Add a governance section to every AI feature PRD this week: who authorized the agent, what can it access, how do you revoke access, and who owns remediation when it behaves outside spec
- Implement blast-radius constraints and progressive permission escalation for any agent feature with write access to production systems
- Audit all MCP connections: where AI agents connect to customer data, who has visibility, what's logged, and whether agent actions are distinguishable from human actions
- Review ServiceNow's open-source agent benchmarks and evaluate whether your product should optimize against them or propose alternative criteria for your category
Sources:TLDR IT · CyberScoop · CSO Security Leadership · AI Breakfast · Lex Neva · Risky.Biz

The Dependency Trap: Your AI Feature Makes Users Worse at Their Job

The Research Finding

A developer writes with an AI assistant for ten minutes, then the extension fails. She stares at the editor. Carnegie Mellon, MIT, Oxford, and UCLA found that 10 minutes of AI assistance creates ~20% performance degradation when the tool is removed, and users are nearly 2x more likely to skip tasks entirely rather than attempt them unaided. The effect holds across experience levels, so this is not a junior-user story. Well-designed autocomplete drops the cost of the next keystroke to zero. Users behave exactly as the UX asked them to.

The Retention Trap It Creates

Teams read dependency as a moat. The dashboard cooperates. Engagement rises, time-to-output falls. What falls with it is the user's ability to judge whether the output is any good, and that number goes down with it. That is the variable that decides next year's renewal. A user who cannot function through an outage is one bad Tuesday from cancellation. The competitor who ships something that makes that same user feel sharper wins quietly.

The Psychological Headwind Most Teams Aren't Measuring

Over 60% of US workers cite AI as a major stress source. Office crying is up 12 points YoY. Gen-Z is building offline cyberdecks to escape AI homogeneity. 83% of executives say psychological safety boosts AI success, yet Meta employees are described as "miserable" under AI mandates. Executive enthusiasm and user resistance are pulling apart. Adoption numbers that look healthy are measuring anxiety-driven compliance.

Progressive disclosure, guardrails, and skill-preservation are the usual answers. The better question is which of those three the specific workflow actually needs.

The Design Split

	Produces the answer	Scaffolds the user's answer
User faster without AI after 30 days	Unlikely cell	Target cell — compounds
User slower without AI after 30 days	Dependency trap — most shipped features	Design gap — fixable

Most shipped features sit in the "produces answer + user slower" cell. The cell that compounds is "scaffolds + user faster." Build assistance that compounds capability. Think guided practice with receding scaffolding, not always-on autopilot. Make the user do the last 10% of the work and refuse to do it for them.

The Counter-Signal: Entertainment Wins Retention

Janitor AI runs at 2.5M daily active users for fantasy roleplay and is the 10th most popular consumer AI app. Productivity AI gets opened when a task arrives. Companionship AI gets opened every day. If the product's job is retention, the dependency research applies. If the job is entertainment, it does not, and 2.5M DAU is the benchmark.

Action items

Pull a cohort from the last 90 days and run a without-AI test on 20 users — measure whether they're faster or slower at the task compared to pre-adoption baseline
Instrument three metrics on your highest-usage AI feature: first-suggestion acceptance rate, edit distance between suggestion and final output, and whether users can recall their decision rationale after 1 week
Add psychological safety considerations to your AI feature onboarding — frame AI as 'removing drudgery and elevating judgment' rather than 'doing your job'
Redesign one high-dependency AI feature to use receding scaffolding — reduce assistance as user competence builds over sessions

Sources:The Hustle · The Hustle · The Download from MIT Technology Review

◆ QUICK HITS

OpenAI launched DeployCo at $10B valuation with McKinsey, Bain, and Capgemini — your model vendor is now also your integrator competitor. Map channel conflict risk per feature this sprint.
Techpresso
Doubao (ByteDance's AI app) at 345M MAU introduced paid tiers and triggered backlash — textbook case of free-user base ≠ paid-user persona. Audit whether your highest-engagement free users are your ideal converters.
ChinAI Newsletter
Wix ran 250 evaluations: agent-optimized documentation beat custom skills as the default strategy. Skills only won when perfectly maintained — small errors inflated costs dramatically. Prioritize doc quality over skill registries.
TLDR Data
A/B test prediction accuracy: 56% of expert estimates are too high, 2.7pp median error, and experience doesn't help. Familiarity with similar past tests pushes directional accuracy to 65%. Build a searchable experiment library.
TLDR Marketing
Ramp hit $40B+ valuation while 90% of firms report zero AI productivity impact — the gap is organizational structure (Coinbase: 5 layers max, no pure managers), not model access.
TLDR Fintech
Instagram killed its Reels-focused iPad layout after 6 months of user rejection — forcing mobile-optimized patterns onto larger screens fails when the device serves a different job.
TLDR Design
Update: LiteLLM actively exploited via trivial SQL injection (CVE-2026-42208) — crafted Authorization header gives full database read/write. If in your stack, patch status is urgent.
Risky.Biz
Anthropic's SKILL.md establishes a file-based skill pattern that 4 independent research papers converge on as the next platform layer. Decide now: adopt Anthropic's format or define your own before it calcifies.
🔳 Turing Post
Sierra hit $165M ARR with 40% Fortune 50 penetration on voice agents — the winning formula is narrow intent automation with 60%+ containment rates, not general-purpose conversation.
Newcomer
x402 agentic payments cleared $100M in Q1 with 99.8% market share — agents paying $0.01/call without subscriptions. Evaluate which of your API capabilities an agent would pay for 1,000x/day.
TLDR Crypto

◆ Bottom line

The take.

Your PM workflow split in two this week: Notion is shipping features from 4-sentence specs in 20 minutes while research shows users become 20% worse at their jobs after just 10 minutes of AI assistance. The job is no longer 'prioritize engineering capacity' — it's 'write specs precise enough for agents to execute correctly, build review gates fast enough to catch when they don't, and design AI experiences that make users more capable rather than more dependent.' Ship the governance with the feature or ship it after the incident. The teams that pick bounded features, instrument the maintenance cost, and separate their answer-inference from their agentic-inference this quarter will be the ones still shipping in Q4.

Frequently asked

How do I pick which features to pilot with spec-driven development this sprint?: Choose features that are bounded (single screen, known data model, clear success state) AND sit on top of solid review infrastructure (tests, staging, rollback). Anything touching billing, auth, or litigable surfaces should stay off the pilot list. The bounded-plus-reviewable cell is the only one where four-sentence specs reliably earn their keep instead of shipping confident-but-wrong code at machine speed.
What does an agent-readable spec actually need to contain?: Four things: the user problem, the success condition, the constraints the agent must not violate, and the edge cases that matter. Notion stores these as Markdown spec files in the repo with code pointers to existing patterns and verification steps. Add three fields to your PRD template — verification steps, code pointers, and acceptance screenshots — and the spec becomes the source of truth the agent reads.
How should I separate answer inference from agentic inference in my AI features?: Ask one question per feature: does a human wait for this token, or does a job queue wait? Human-waiting features (chatbots, copilots) belong on premium low-latency silicon and should be priced for latency. Queue-waiting features (background enrichment, multi-step agent loops) belong on the commodity tier now — don't wait for API-level price splits to force the architecture decision later.
What governance must ship with every agent feature to survive enterprise procurement?: Three non-negotiables: granular permission scopes per agent capability, runtime authorization that approves or denies based on live context, and audit trails that distinguish agent actions from human actions with a named remediation owner. These are already appearing on procurement questionnaires, and a Cursor agent deleting a database plus all backups in 9 seconds is the incident every CISO will cite when blocking launches.
How do I tell if my AI feature is creating a dependency trap rather than a moat?: Pull a 90-day cohort and run a without-AI test on 20 users — measure whether they're faster or slower at the task than their pre-adoption baseline. Also instrument first-suggestion acceptance, edit distance to final output, and whether users can recall their decision rationale a week later. High acceptance plus low edit distance plus no recall means the feature is replacing the user, and that user is one outage away from churning.

◆ Same day, different angle

Read this day as…

◆ Recent in product

SpecQualityIsNowtheBottleneck,NotEngineering

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The Workflow That Shipped This Week

What the Spec Must Contain

The Infrastructure Dependency Most Teams Miss

The Maintenance Question Nobody's Answered Yet

Where This Works vs. Where It Breaks