Friday, May 1, 2026 ~5 min

The week the patch window closed and the agent wallet opened

AI agents autonomously exploited 98% of CISA's KEV catalog the same week Stripe, Cloudflare, and Cursor shipped agents that can buy, deploy, and pay. Both sides of your trust model broke at once.

Two numbers from this week deserve to sit next to each other on a whiteboard. MOAK demonstrated that off-the-shelf Opus 4.6 and GPT-5.4 produced working exploits for 174 of 178 CISA KEV entries published after model training cutoffs — a 98% success rate from advisory text alone. In the same news cycle, Stripe shipped agent wallets and one-time-use credentials, Cloudflare shipped agent self-provisioning of accounts and domains, and Cursor open-sourced an SDK that drops a coding agent runtime into anyone's product. An agent can now write code, buy hosting, and deploy it. An LLM can now read a CVE advisory and weaponize it in 12.5 hours.

The attacker side and the principal side both broke loose from human time, in the same week, on the same stack.

The defender's window collapsed to hours

Sysdig observed exploitation of LMDeploy's SSRF 12.5 hours after disclosure, with no public PoC. LiteLLM's pre-auth SQLi (CVE-2026-42208) — which fronts most teams' OpenAI, Anthropic, and Bedrock keys behind a single credential table — was weaponized in 36 hours. Three independent ML frameworks shipped pickle-deserialization RCEs at CVSS 9.8 in the same seven days. Claude Code shipped a CVSS 10.0 sandbox escape via symlinks. GitHub Enterprise Server has a single-push RCE (CVE-2026-3854) that 88% of instances haven't patched, where a semicolon in a git push option overrides the pre-receive sandbox and runs arbitrary code as the git service user.

A 30-day patch SLA was calibrated for a world where writing the exploit was the slow step. On this week's data, it isn't the slow step anymore. HackerOne's decision to pause the Internet Bug Bounty is the institutional version of the same admission: AI-assisted discovery has outrun open-source remediation capacity, and the queue isn't going to drain.

The supply chain side of the same week is worse, because the failure is structural rather than novel. TeamPCP poisoned checkmarx/kics:latest on Docker Hub. Bitwarden's Dependabot pulled it into CI. The malicious code shipped as @bitwarden/cli 2026.4.0. Tag-based image pulls plus auto-merging dependency bots — the default modern ML CI topology — is the amplifier. SAP's npm namespace got hijacked four days later, harvesting AWS, Azure, GCP, and GitHub tokens via preinstall hooks. The tooling installed to find intruders was the route the intruders used.

The principal model also broke

While that was happening, three vendors converged independently on the same primitive: agent as first-class principal. Cloudflare lets an agent create accounts, buy domains, mint tokens, and deploy. Stripe's Link CLI issues ephemeral, scoped, one-time-use payment credentials so agents transact without ever holding a real card number. Cursor's SDK embeds the agent runtime inside CI/CD and third-party products. Within days, someone had a Cursor agent running inside Gmail.

Anthropic's own telemetry says 93% of agent prompts are auto-approved. At that rate the human-in-the-loop isn't a control — it's a gauge with a person attached, carrying near-zero mutual information about risk. An agent version that pushes auto-approval from 93% to 96% is not safer. It's the same broken gauge reading higher.

The court system has already started pricing this. A Northern District of California ruling under Rule 10b-5 held that when a platform's AI exercises ultimate authority over assembled content, the platform is the legal maker of that content. Section 230-style shields don't automatically extend to agent decisions. Most production agentic stacks today ship at that autonomy level and log like a tool-assist. The thing that gap doesn't tell you, until discovery does, is whether a human ever closed the loop.

What actually carries signal

The through-line in both stories is that systems built around human-speed checkpoints are being end-run from both sides. Patch windows assumed exploits took days to write. Approval queues assumed humans had time to read. Identity stacks assumed principals had session lifetimes measured in hours. None of that survives the week.

The corollary, which most boards haven't priced, is that the metrics most teams report are now measuring the wrong things. Accuracy-only model evals miss that IBM's Granite 4.1 8B used 19.5× fewer output tokens than Qwen3.5 9B at comparable capability — invisible on a leaderboard, decisive on the invoice. Per-action approval rates miss that auto-approval at 93% means the rate isn't telling you anything. Patch SLAs measured in days miss that the disclosure-to-exploit window is measured in hours. Microsoft's cloud margin fell 500 basis points to 56% on AI inference load, and AI coding spend per developer is up 10–15× in six months to $3,000–$5,000/month — finance teams will find this before product teams do.

What to do this week

Three things, all specific, all this sprint.

First: rewrite your KEV-linked patch SLA to 72 hours for internet-exposed assets, 48 hours for anything with a detailed advisory, and wire the CISA feed directly into automated PR generation. If you ran an unpatched LiteLLM during the disclosure window, rotate every OpenAI, Anthropic, and Bedrock key that transited it before you investigate anything else. Patch GHES to 3.14.24 / 3.15.19 / 3.16.15 / 3.17.12 / 3.18.6 / 3.19.3 today, then rotate runner tokens, deploy keys, and cached PATs. Ban tag-based image pulls in CI and disable Dependabot auto-merge for security tooling — between them, those two policies cover most of this week's blast radius.

Second: add an authority_level field (tool_assist | human_gated | autonomous) and human_reviewer_id to every agentic decision record, persisted for the artifact's lifetime. The N.D. Cal. ruling makes autonomy level discoverable evidence. The field is cheap to add now and expensive to backfill after litigation. While you're there, replace per-action approval dashboards with override rate on high-risk actions, policy-violation count per 1,000 traces, and action entropy per agent. Those three carry signal. Approval rate doesn't.

Third: add $/correct-answer and tokens-per-task as first-class metrics in your eval harness, alongside accuracy. The 19.5× Granite-Qwen gap and Factory AI's $1.25-beats-$3 finding are invisible without them. If you're running anything Anthropic-dependent above $1M/year with no multi-provider abstraction, the abstraction is the work item — zero discounts at $5M+ is an extraction signal, and one price hike breaks unit economics for anyone who didn't build the off-ramp.

The week's real headline isn't any single CVE or any single product launch. It's that human-speed assumptions stopped holding on both sides of the trust boundary at the same time, and the teams that ship in Q3 are the ones writing that down on Monday.

◆ Behind the synthesis

Six specialist takes that fed this piece.

The piece above is one stream in my voice. Below are the six lenses my pipeline produced upstream — each tuned for a different reader. Use them when you want the angle that matters most to your role.

The week the patch window closed and the agent wallet opened

The defender's window collapsed to hours

The principal model also broke

What actually carries signal

What to do this week

Six specialist takes that fed this piece.

The claim making the rounds: AI agents autonomously exploited 174 of 178 CISA KEV entries this week using only publicly available models.

CVE-2026-3854 gives any authenticated user remote code execution on GitHub Enterprise Server through a single git push — 88% of GHES instances remain unpatched.

The production question is tokens per correct answer, and accuracy-only evals don't measure it: at comparable quality, Granite 4.1 8B used 19.5× fewer tokens than Qwen3.5 9B, and on Factory AI's 13-model bakeoff a $1.25/PR model held up against ones costing 2×+.

A team swapped models three times last quarter chasing a four-point eval bump and shipped nothing, because the prompts and tool wrappers were rewritten each time and nobody versioned them.

Q1 2026 earnings sorted Big Tech into two industries in a single week.

Microsoft's cloud gross margin fell 500 basis points to 56% on AI inference load, which at hyperscaler scale is the leverage working in reverse.