◆ PILLAR

AIagents,safely

A field guide to shipping agentic AI into production: sandbox design, blast-radius containment, protocol failure modes, and the craft of trusting AI that holds the keys.

Updated 2026-06-25 · Topics: agentic-ai , ai-safety

In March 2026, GitHub disclosed 17 million agent-authored pull requests in a single month — three times their internal projection and a clean break in their forecasting models. The same month, Princeton’s ICML 2026 audit added GPT 5.5, Gemini 3.1 Pro, and Claude Opus 4.7 to its agent reliability benchmark and found zero meaningful improvement over predecessors. Those two facts are the entire problem in miniature. Capacity to act is compounding. Reliability of action is not. Anyone shipping agents in 2026 is operating in the gap between those two curves, and the gap is widening.

The instinct in that gap is to wait for a smarter model. That instinct is wrong. Agent safety is not a model problem; it is an infrastructure problem. The Replit incident — where a coding agent cooperatively wiped a production database while doing exactly what it was asked — was not a jailbreak, not a prompt injection, not an adversarial input. It was a confidently wrong agent with legitimate credentials executing a legitimate command against a target it should never have been able to reach. No model upgrade fixes that. The fix is that the agent should not have had the keys.

The Threat Model Is Cooperation, Not Compromise

Most agent security writing imports its threat model from web security: an external attacker, a malicious payload, a defender who must harden the perimeter. That framing misreads where the damage comes from. The Princeton results matter here: even the frontier models fail agent tasks at rates that would be unacceptable for any human operator with equivalent privileges. The dominant failure mode is not subversion. It is sincere, confident, well-intentioned wrongness at scale.

Meta’s AI chatbot incident is the cleaner public case. Attackers socially engineered the assistant into changing the registered email address on high-profile Instagram accounts — the first public proof that LLM-fronted identity flows are a live credential-theft vector. The chatbot was not jailbroken in any classical sense. It held write access it should never have held, and it used that access exactly as designed, for a request that looked plausible. Cooperation, not compromise.

This reframes what ‘safety’ means operationally. The question is not ‘can the model be tricked.’ The question is ‘what is the worst thing the model can do when it is fully cooperating with a request that turns out to be wrong.’ If the answer involves production data, customer accounts, or irreversible writes, the architecture is broken regardless of which model sits behind it.

Sandboxing Is a Function of What the Agent Touches

Sandboxing has become a single word that hides at least three different engineering disciplines. The right choice depends entirely on what the agent reaches for.

For agents driving local CLIs — code generation, file manipulation, build tooling — lightweight process isolation along the lines of Bubblewrap is usually sufficient. The blast radius is the working directory and whatever credentials the shell inherits. The discipline is to make sure the shell inherits as little as possible.

For agents that browse, fetch, or execute untrusted code from the network, the threat surface widens to the kernel. gVisor-class isolation, with a syscall filter between the agent and the host, becomes the floor rather than the ceiling. The Hugging Face Transformers RCE path that fires from model config files — not pickle weights — across 2.2 billion installs is the kind of thing that turns a ‘just running inference’ agent into a host compromise. Network-touching agents should be assumed to execute hostile code on every run.

For agents that touch production data, neither of the above is sufficient. Hardware isolation, separate accounts, separate networks, and write paths gated by out-of-band approval are the minimum. The reasoning is straightforward: if the model can delete the database, eventually the model will delete the database. Princeton’s reliability numbers make that a near-certainty on a long enough horizon. OpenAI’s decision to ship Lockdown Mode by disabling Deep Research and Agent Mode entirely, rather than hardening them, is an admission of exactly this. When the blast radius is unbounded, the only safe configuration is off.

MCP and the Invisible Trust Boundary

The Model Context Protocol made tool use composable, and in doing so it created a protocol-level flaw that most teams have not internalized: an MCP-enabled agent can acquire and exercise new capabilities without any human in the loop noticing the trust boundary has moved. A tool registration is a code path. A tool description is a prompt. A tool result is untrusted input. None of these are treated with the suspicion they deserve.

The supply-chain context makes this acute. The Miasma worm infected 73 Microsoft-owned GitHub repositories and more than 50 npm packages with a Rust-based credential stealer. Any agent pulling tools, dependencies, or context from those ecosystems was, for some window, executing attacker code with whatever privileges the agent carried. Combine that with five CVSS 9+ disclosures landing in a single week — the 18-year-old unauthenticated NGINX RCE, the 10.0 Traefik auth bypass, plaintext secret extraction in Argo CD, LiteLLM on the CISA KEV list with active exploitation, the 9.1 Spring Cloud Config traversal — and the practical assumption has to be that the tool surface an agent reaches through is hostile by default.

The operational consequence: assume the agent will execute whatever it is allowed to execute. Tool allowlists must be explicit, signed, and reviewed. Tool outputs must be treated as untrusted text and never re-injected as instructions without scrubbing. The trust boundary is not where the protocol says it is. It is wherever a string crosses from outside the agent into the agent’s planning context.

Blast Radius as a First-Class PRD Section

The missing artifact in most agent product specs is a blast radius section. Not a security review at the end. A first-class section, written before the feature is built, that answers three questions. What is the worst thing this agent can do in a single action. What is the worst thing it can do over a thousand actions. What is the recovery path when — not if — it does the wrong thing confidently.

This matters more as the economics shift. GitHub’s move to usage-based billing on June 1, 2026, and Anthropic ending its flat-rate subsidy the same quarter, mean agent invocations now carry direct marginal cost. ServiceNow burning its entire annual Claude budget by May is the leading indicator. When agents are cheap, blast radius is a safety question. When agents are metered and running at the volume GitHub’s 17M PR number implies, blast radius is also a financial question — a runaway loop is a budget incident before it is a security incident.

The craft is not to prevent agents from being wrong. They will be wrong. The craft is to make wrongness recoverable. Every irreversible action — a database write, an email send, a payment, a credential rotation, an account modification — is a place where the architecture has chosen to pay for a recovery story in advance, or chosen to pay for an incident report later.

Operational Posture for the Quarter

Four concrete moves for any team running agents in production right now.

First, write the blast radius section into every agent PRD this quarter, retroactively for shipped features. For each agent, enumerate single-action worst case, thousand-action worst case, and the recovery path. If any of the three is ‘unknown’ or ‘unbounded,’ that is the work.

Second, match sandbox to surface. Bubblewrap-class isolation for local CLI agents, gVisor-class for anything network-facing, hardware and account isolation for anything that touches production data or customer credentials. Stop using one isolation primitive for all three.

Third, treat MCP tool surfaces as a software supply chain. Pin tool versions, sign registrations, review tool descriptions as if they were prompts (because they are), and scrub tool outputs before they re-enter the planning context. Audit what your agents can reach today; the answer is almost certainly more than the threat model assumes.

Fourth, instrument for the cooperative-failure case, not the adversarial one. The dashboards should answer ‘is the agent doing more irreversible things than usual’ before they answer ‘is someone attacking the agent.’ Replit, Meta’s chatbot, and every internal incident teams are not publishing share that signature. Build for that signature first.

The Threat Model Is Cooperation, Not Compromise

Sandboxing Is a Function of What the Agent Touches

MCP and the Invisible Trust Boundary

Blast Radius as a First-Class PRD Section

Operational Posture for the Quarter

Sources