Engineer daily

Edition 2026-05-14 · read as Engineer

Shai-HuludWipesSystemsonTokenRevoke:SnapshotFirst

Sources
32
Words
1,387
Read
7min

Topics LLM Inference Agentic AI Data Infrastructure

◆ The signal

Shai-Hulud now wipes infected systems the instant you revoke a stolen token — your IR playbook's 'rotate credentials first' step triggers evidence destruction. Snapshot and network-isolate before touching any credential. Separately, Databricks published the production-proven pattern for async rate limiting that drops p99 by 10x, which is precisely the architecture needed before agent traffic flips to a 90/10 ratio against humans within three years.

◆ INTELLIGENCE MAP

  1. 01

    Shai-Hulud Grows Anti-Forensics: IR Playbooks Must Invert

    act now

    The npm worm now monitors stolen token validity and fires destructive payloads on revocation. SOAR playbooks that auto-revoke on detection now destroy evidence. New order: snapshot → isolate network → revoke. RubyGems separately confirmed as an exfil channel (150+ malicious gems), not a payload vector.

    401
    HTTP code that triggers wipe
    5
    sources
    • Wipe trigger
    • RubyGems malicious
    • Active since
    1. Nov 2025Shai-Hulud first observed
    2. May 11TanStack 42 pkgs compromised
    3. May 12400+ downstream spread
    4. May 14Wipe-on-revoke branch ships
    5. May 14RubyGems as exfil confirmed
  2. 02

    Rate Limiting Architecture for the Agent Era

    monitor

    Databricks replaced synchronous Redis checks with async batch-reporting, cutting p99 by 10x. Box predicts 90/10 agent-to-human traffic within 3 years. Reasoning models burn 10-100x tokens per task. These three forces compound: your current rate limiter will bottleneck legitimate agent traffic while failing to budget for token multiplication.

    10x
    p99 latency reduction
    4
    sources
    • Flush interval
    • Overshoot tolerance
    • Token multiplication
    • Agent traffic ratio
    1. Sync Redis path40
    2. Async batch path4
  3. 03

    Inference Economics Inversion: Three Shifts at Once

    monitor

    4B RLM claims Sonnet-4.6 parity at ~100x lower per-token cost. Modal reduced cold-starts from kiloseconds to tens of seconds, making scale-to-zero viable. Chinese labs serve Opus-quality at $0.43/M tokens with 50-70% margins. OpenAI deprecated finetuning. The cost model written in 2024 is wrong on every axis.

    $0.43
    per M tokens (Opus-tier)
    5
    sources
    • RLM params
    • Cold-start reduction
    • Chinese margin
    • Cost gap vs US
    1. Claude Opus4.73
    2. DeepSeek V4 Pro0.43
    3. GLM-51
    4. 4B RLM (est.)0.05
  4. 04

    Agent Authorization: The Unsolved Infrastructure Layer

    background

    Agents break RBAC fundamentals. Delegation chains (human→orchestrator→sub-agent→API) need first-class schema. OAuth was designed for one app, one user, one session — not a planner spawning six sub-agents. SPIFFE gets identity right but delegation wrong. GNAP is closest but 2 years from shippable. The team that ships the boring library wins.

    3
    sources
    • OAuth limitation
    • GNAP maturity
    • MCP adoption
    1. 01GNAP (RFC)Closest fit
    2. 02BiscuitDelegation right, tooling wrong
    3. 03SPIFFEIdentity only
    4. 04OAuth2Single-hop only
  5. 05

    Infrastructure Patterns: QUIC Bug, DuckDB Protocol, Discord Control Plane

    background

    Cloudflare lost 60% download throughput from a 3-line CUBIC port that measured idle from last-send instead of last-ACK. DuckDB's Quack protocol breaks the embedded-only model with HTTP client-server and concurrent writers. Discord replaced bash runbooks with idempotent YAML workflows, compressing multi-day ops to hours.

    60%
    throughput loss from bug
    3
    sources
    • Cloudflare fix
    • DuckDB mode
    • Discord compression
    • Redshift RG speedup
    1. QUIC (bugged)40
    2. QUIC (fixed)100
    3. Redshift RA3100
    4. Redshift RG220

◆ DEEP DIVES

  1. 01

    Rate Limiting Is the First Thing That Breaks When Agents Replace Humans

    The Architecture That Survives Agent Traffic

    Databricks published the clearest production case study for taking rate limiting off the latency budget. Old path: Envoy → Ratelimit Service → Redis. Two network hops per request. 20-40ms at p99 under contention. New path: in-memory counters with 100ms async batch reporting, server-pushed rejection instructions, zero network calls on the hot path. p99 dropped roughly 10x.

    The mechanism is the point. A synchronous shared-store limiter adds a network hop to every request it gates. Under normal load, invisible. When one tenant bursts, Redis becomes the bottleneck for everyone, including the requests you wanted to reject. Stop asking permission on every call. Report counts asynchronously.

    A synchronous check against a centralized store is fundamentally incompatible with low-latency requirements once traffic gets large. The network is the bottleneck, not the algorithm.

    Why This Matters Now: The Agent Traffic Cliff

    Box's CEO projects a 90/10 agent-to-human traffic ratio within three years. Discount the timeline if you want. The direction is confirmed across multiple sources. Limiters keyed on IP or user token degrade under agent load. One agent holding one token issues bursts that look like a DDoS from a logged-in customer. The limiter does what it was configured to do. That is the problem.

    Compounding this: reasoning models burn 10-100x the tokens of a plain completion. Personalized prompts break caching because each prompt varies per user. Multi-tenant batching breaks when every call carries user-specific context. The two escape hatches SaaS teams relied on are gone for AI workloads.

    The Databricks Design Decisions Worth Copying

    Three coupled choices made the system work:

    1. In-memory token bucket — CAS is free in-memory and expensive over network to Redis. State location constrained algorithm choice.
    2. Batch-reporting on a 100ms timer — spiky inbound becomes constant outbound. Thundering-herd goes away.
    3. Server-pushed rejection — "reject key X at Y% until timestamp Z" replaces per-request permission checks.

    The ~5% overshoot tolerance is explicit. Their backends already absorb slight over-limit traffic, so the tradeoff is cheap. If your rate limits enforce hard financial or security boundaries, this architecture does not apply. For everything else, which is most traffic, you are paying 20-40ms per request for exact enforcement nobody needed.


    Migration Path

    They could not modify Envoy, so they ran a localhost sidecar for the batch-reporting client. Before in-memory counters shipped, Redis Lua scripts batched writes as a bridge. Each step shipped value independently. Dependencies forced the order: sharded in-memory first, then batch-reporting to decouple the hot path, then token bucket to leverage in-memory CAS.

    Action items

    • Audit rate limiting hot path for synchronous network calls — if Redis or any shared store sits on every request path, measure actual p99 under tenant burst
    • Prototype async batch-reporting with 100ms flush interval on your highest-QPS rate-limited endpoint
    • Stress-test rate limiting at 100x current per-tenant peak to simulate agent traffic patterns
    • Design separate rate-limit tiers for agent-authenticated vs human-authenticated traffic with different ceilings and different auth mechanisms

    Sources:The Redis-backed token bucket works fine until the traffic shape changes · API traffic is about to tilt nine-to-one · Inference costs are not coming down soon · TLDR Founders

  2. 02

    Inference Cost Model Is Inverting: Three Simultaneous Shifts

    Shift 1: Small RL Models Claim Frontier Parity

    A 4B-parameter Recursive Language Model claims parity with Claude Sonnet 4.6 by training a shared parent/child policy that recursively decomposes tasks. The mechanism removes inter-model latency from the standard orchestrator-plus-workers pattern. A 4B model on an A10G runs roughly 100x cheaper per token than Sonnet-class via API.

    The caveat gets equal weight: "matches" on which eval suite, at which temperature, with how many reasoning tokens. A curated math/code subset is not production agentic work. Verify on your own eval before rewriting a cost model. Parity claims holding for six months and then cracking on harder evals is a well-documented pattern.

    Shift 2: Serverless GPU Cold-Starts Collapse

    Modal cut inference cold-starts from kiloseconds to tens of seconds. Two orders of magnitude. The likely mechanism is memory snapshotting with lazy GPU state rehydration, not cold container plus cold weight load. Cold starts are the invisible tax forcing most teams onto always-on GPUs at sub-50% utilization. At 30 seconds, scale-to-zero gets viable for bursty workloads.

    The failure mode worth watching: 30s average with a p99 in the minutes. Ask for the distribution, not the mean.

    Shift 3: Chinese Labs Prove the Floor

    DeepSeek V4 Pro serves near-Opus quality at $0.43 per million input tokens at 50-70% gross margin. The margin disclosure kills the "subsidized land-grab" dismissal. Z.ai claims 50% margins on GLM-5 at $1/M tokens. MiniMax reports 70% enterprise margins. The gap against US incumbents is 11-28x.

    The mechanism is straightforward. Cheaper accelerators. Denser batching. Architectures tuned for throughput over leaderboard position. Compute scarcity forced these labs to optimize every FLOP, and that discipline now shows up as a structural cost advantage in production serving.


    The Combined Implication

    Inference pricing assumed two things: frontier quality required frontier-sized weights, and serverless GPU had a cold-start floor that made bursty workloads uneconomical. Both are now contestable. Not broken. Contestable.

    OpenAI deprecated finetuning APIs this week, which confirms what the usage data already said. For 99% of use cases, a base model with a decent system prompt and three retrieved examples beats a LoRA. The top 1%, Cursor and Cognition, run RLFT on open weights where they own the training loop. Everyone in the middle using vendor finetuning as a half-measure gets neither side of the tradeoff.

    The Honest Cost Math

    ApproachPer-task costWhen it wins
    Frontier API (Opus/o3)BaselineNovel reasoning, long context
    Chinese API (DeepSeek/Kimi)4-9% of baselineClassification, extraction, RAG
    4B RLM (self-hosted)~1% of baselineHigh-volume narrow tasks (unverified)
    26M Needle (routing)~0.01% of baselineIntent classification, tool selection

    Cactus Compute's Needle, 26M parameters, no FFN layers, hits 6000 tok/s prefill for tool-calling and beats 10x larger models. This is the shape of a routing layer. A tiny model picks the tool. The expensive model only runs actual work.

    Action items

    • Identify your top-3 highest-volume inference endpoints and benchmark a 4B RLM or DeepSeek V4 Pro against current Sonnet/GPT spend using production prompts
    • Evaluate Modal for any GPU workloads currently on reserved instances with sub-60% utilization
    • If using OpenAI finetuning, begin migration to longer prompts + structured examples or RLFT on open weights
    • Implement tiered model routing: Needle-class for classification, Chinese API for extraction/RAG, frontier for novel reasoning only

    Sources:AINews · TLDR AI · Azeem Azhar, Exponential View · TLDR Founders · Techpresso

  3. 03

    The QUIC/CUBIC Bug and Three Infrastructure Patterns Worth Stealing

    Cloudflare Lost 60% Throughput to a 3-Line Port

    Linux kernel CUBIC resets the congestion window when a connection goes idle, and it measures idle from the last sent packet. Cloudflare's quiche ported that logic straight into QUIC. QUIC does not ACK like TCP. After heavy loss the cwnd drops to its floor of 2 packets, and the idle check, reading from the wrong reference point, pins it there permanently. 60% of test downloads failed.

    Fix: measure from last ACK instead of last send. Three lines. The assumptions underneath are the hard part. Any QUIC stack whose congestion control descends from the Linux kernel is carrying some variant of this bug.

    Code moves line-for-line out of a codebase where certain invariants are held by the surrounding system. The port compiles, the tests pass. The invariants are silently gone.

    Discord's Idempotent Database Control Plane

    Discord replaced fragile bash with a declarative YAML workflow engine driving ScyllaDB operations. Multi-day procedures now finish in hours. The invariant that makes it work: every task checks current state before acting rather than trusting what the previous step claims to have done. The engine can be dumb because the tasks are safe to retry.

    The pattern travels. If a runbook contains the phrase "carefully verify" or "wait until you see," that runbook is the candidate. Speed is not the prize. The prize is that the procedure no longer requires a specific senior engineer awake at 3am.

    DuckDB Breaks the Embedded-Only Model

    DuckDB's new Quack protocol adds HTTP client-server with multiple concurrent writers and token-based auth. Integration is planned with DuckLake and DuckDB 2.0. For a shared analytical backend fronting several services, the math changes. For a single-service analytical workload, embedded is still the right answer.

    Redshift RG: AWS Pulls Query Layer Back In-House

    Redshift RG runs 2.2x faster than RA3, costs 30% less per vCPU, drops the $5/TB Spectrum scanning fee, and executes Iceberg queries 2.4x faster. No code changes required. The strategic read: AWS is removing the cost incentive to route data lake queries through Athena or an external engine. If queries are currently split between Redshift and Athena for cost reasons, that split may collapse back to a single engine.


    The Common Thread

    All four items share one shape: assumptions imported from one context silently fail in another. TCP idle semantics in QUIC. Human-speed pacing in runbooks. Single-process assumptions in analytics. Per-scan pricing on data lake queries. Each was a reasonable inheritance that nobody reexamined until the cost landed on a dashboard.

    Action items

    • Audit any custom QUIC implementation for CUBIC idle-time measurement — specifically check if ssthresh recovery uses 'time since last sent' vs 'time since last ACK'
    • Identify your top 3 most dangerous manual operational runbooks and prototype an idempotent control plane following Discord's pattern
    • If running Redshift RA3 with Spectrum queries, benchmark RG instance migration — particularly data lake queries currently routed through Athena
    • Evaluate DuckDB Quack for lightweight shared analytical backends currently coordinated via shared storage

    Sources:CUBIC was written for the Linux kernel · QUIC's idle-period death spiral · TLDR DevOps · TLDR Dev

◆ QUICK HITS

  • Update: Shai-Hulud wipe-on-revoke — any SOAR automation that auto-revokes npm tokens on detection is now destroying forensic evidence. Add snapshot+isolate steps before credential rotation in all IR playbooks.

    Shai-Hulud added a destructive branch

  • Update: RubyGems suspended signups after 150+ malicious gems used the registry as a data exfiltration channel — attackers embed stolen data inside gem packages, making rubygems.org itself the C2 endpoint.

    The Hacker News

  • Cactus Needle: 26M-param model with no FFN layers hits 6000 tok/s prefill for tool-calling — MIT licensed, trained on 16 TPU v6e in 27 hours. Evaluate as a routing/intent-classification layer before burning frontier tokens on selection.

    AINews

  • DeepSeek V4 extends context to 1M tokens on 6GB memory — if confirmed under honest eval, KV cache scaling assumptions for self-hosted inference are wildly pessimistic. Reproduce on a test box before committing a roadmap.

    Techpresso

  • TypeScript compiler insight from Hejlsberg: AI coding tool quality tracks training data volume per language, not language design — niche stacks now carry a compounding tooling tax every quarter.

    The Pragmatic Engineer

  • Grep beats standard RAG pipelines on multiple corpus search benchmarks — no embeddings, no vector index, no chunking. Worth benchmarking against your production retrieval pipeline at the millions-of-documents scale most teams actually run.

    Techpresso

  • k6 2.0 ships native OpenTelemetry output and stable Kubernetes Operator — load test results landing in the same traces as production telemetry is the real upgrade, not the operator.

    TLDR DevOps

  • Exim CVE-2026-45185: unauthenticated RCE on GnuTLS builds — Debian/Ubuntu default to GnuTLS for Exim. Run `exim -bV | grep TLS` on all mail hosts today. Disable BDAT as stopgap if patch unavailable.

    Risky.Biz

  • Datadog telemetry across 1,000+ orgs: prompt caching remains widely underutilized despite being a config change worth 50-90% cost reduction on repeated prefixes. Log prefix hashes, measure hit rate, flip the switch.

    TLDR

  • AI detection proven architecturally broken: 1-in-100,000 FPR requires defaulting uncertain cases to 'human,' which means 90% AI content with 10% human edits passes every time. Do not build business logic gates on binary detection APIs.

    Alberto Romero from The Algorithmic Bridge

◆ Bottom line

The take.

Your incident response playbook's 'revoke credentials first' step now triggers evidence destruction on Shai-Hulud-infected systems — invert the order to snapshot-isolate-revoke before the next compromise. Meanwhile, the infrastructure assumptions underneath your production stack are quietly expiring: synchronous rate limiters break under agent traffic patterns, inference cost models built on frontier-only pricing are 10-28x wrong for half your workload, and a 3-line kernel-to-QUIC porting bug cost Cloudflare 60% throughput. Audit the assumptions your architecture inherited, not just the code it runs.

— Promit, reading as Engineer ·

Frequently asked

Why does revoking a stolen token make a Shai-Hulud incident worse?
The current Shai-Hulud variant treats credential revocation as a kill signal and wipes the infected host the moment a stolen token is rotated. That destroys forensic evidence and any chance of tracing lateral movement. Snapshot the disk and network-isolate the system before touching credentials, then rotate from outside the blast radius.
How does Databricks' async rate limiter actually achieve a 10x p99 drop?
It removes the network hop from the hot path. Instead of Envoy calling a Ratelimit Service backed by Redis on every request, each node keeps an in-memory token bucket, batch-reports counts every 100ms, and receives server-pushed rejection instructions like 'reject key X at Y% until Z.' The tradeoff is roughly 5% overshoot, which is fine unless your limits enforce hard financial or security boundaries.
Why will rate limiters keyed on user token break under agent traffic?
A single agent holding a single user token issues bursts that look identical to a DDoS from a logged-in customer, and the limiter throttles exactly as configured. Combined with reasoning models burning 10-100x more tokens and per-user prompts breaking cache and batch reuse, the cost and latency assumptions baked into human-rate limits no longer hold. Separate tiers and ceilings for agent-authenticated traffic are the fix.
Is the 4B Recursive Language Model claim actually safe to build a cost model on?
Not yet. The parity claim against Sonnet 4.6 is on a curated eval subset, and parity claims commonly hold for a few months before cracking on harder workloads. Treat it as a prompt to benchmark on your own production traffic, not as a procurement decision. If even half the claim survives contact, the cost delta against frontier APIs is still roughly 100x on narrow high-volume paths.
What's the common failure mode behind the QUIC CUBIC bug worth watching for elsewhere?
Code ported line-for-line from one system silently loses the invariants held by the surrounding system. CUBIC's idle check measured time since last sent packet, which works in TCP because of regular ACKs but pins QUIC's congestion window at the floor of 2 packets after loss. Any QUIC stack derived from Linux kernel congestion control likely carries a variant of this bug; the fix is measuring from last ACK instead.

◆ Same day, different angle

Read this day as…

◆ Recent in engineer

Keep reading.