What's the fastest way to check if my LiteLLM or Ollama instances are exposed?

Scan internal networks for port 11434 (Ollama) and your LiteLLM endpoints, then verify they're patched and restricted to VPC-only access. Both vulnerabilities are pre-auth exploitable, so any reachable instance should be treated as potentially compromised until rotated. Check egress logs for unexpected outbound traffic to identify exfiltration in progress.

How do I tell if our team downloaded the malicious 'Privacy Filter' HuggingFace repo?

Grep HuggingFace and pip cache logs across dev machines and CI runners for 'Privacy Filter' or 'Open-OSS' references, and check HF download history for any repo pulled near its #1 trending peak. The payload is a Rust-based Windows infostealer with anti-analysis evasion, so standard AV signatures are unreliable. Quarantine any matches and rotate HF, W&B, AWS, GCP, and OpenAI tokens immediately.

Why should I care about prompt prefix stability for agent harnesses?

Stable prefixes unlock prompt cache reuse at roughly 10% of original token cost, which on a 60% hit rate yields about 54% token-spend reduction—often beating any model swap on $/request. The discipline is to pin system prompt and tool schemas to the head and push volatile user turns and tool results to the tail. Without that structure, cache hit rate is whatever the calling code accidentally produces.

Should I migrate agent workloads off H100-class hardware?

Probably segment first, then migrate selectively. Agentic workloads are cost-bound rather than latency-bound, so they reward KV cache capacity and commodity DRAM over HBM bandwidth—but the savings only materialize once you have separate $/request and $/completed-task dashboards by workload class. Benchmark a disaggregated prefill/decode setup on your top two agent workloads before committing capacity.

Is the claim that 'docs beat custom skills' actually reliable?

It's directionally useful but contested. Wix's 250-eval study showed docs winning when skills go stale, but the SkCC paper found prompt formatting alone swings results by 40%, which means the docs-vs-skills delta could partly be measuring formatting quality rather than format type. Run the head-to-head on your own eval set with formatting ablations on both sides before standardizing.

Edition 2026-05-12 · read as Data Science

LiteLLM,Ollama,HuggingFaceHitbyActiveExploits

Sources: 39
Words: 1,282
Read: 6min

Topics LLM Inference Agentic AI AI Regulation

◆ The signal

Three ML infrastructure vectors are under simultaneous active exploitation this week: LiteLLM's unauthenticated SQLi (CVE-2026-42208) dumping routing configs and API keys, Ollama's OOB memory read exposing in-flight prompts and secrets to any network caller, and a 244K-download credential-stealing repo that rode HuggingFace's trending algorithm to #1. Your model proxy, inference server, and weight registry all need audit today—not because the attack classes are novel, but because all three are confirmed active and target the exact glue layer most DS teams treat as trusted.

Key facts

LiteLLM CVE-2026-42208 enables unauthenticated SQL injection via a crafted Authorization header, exposing routing configs and API keys for OpenAI, Anthropic, and Bedrock.
An unauthenticated out-of-bounds read in Ollama on port 11434 lets any network caller read process memory, including in-flight prompts, RAG context, and environment secrets.
A Rust-based Windows infostealer impersonating OpenAI's 'Privacy Filter' reached #1 trending on HuggingFace with 244,000 downloads before takedown.
Anthropic committed $1.8B to Akamai for edge inference on RTX Pro GPUs and an estimated $200B to Google Cloud for training and bulk inference.
Wix's 250-eval study found agent-optimized docs outperform custom skills when skills go stale, while the SkCC paper showed prompt formatting alone swings agent performance by 40%.

◆ INTELLIGENCE MAP

01
ML Supply Chain Under Active Multi-Vector Attack
act now
Three simultaneous active exploits hit ML infrastructure glue: LiteLLM SQLi (CVE-2026-42208, Tencent-found, exploitation live), Ollama unauthenticated memory dump (port 11434), and HuggingFace's #1 trending repo delivering a Rust infostealer with 244K pulls. One-third of community agent skills also carry vulnerabilities per SkCC audit.
244K
malware downloads
6
sources
- HF malware downloads
- Exposed AI services
- Skills with vulns
- Ollama target port
1. HF Malware Downloads244000
2. Exposed AI Services1000000
3. LiteLLM Patch Lag30
02
Agent Architecture Converges: 'Dumb Loop, Smart Harness'
monitor
Eight sources independently validate the same pattern: single-threaded agent loops with intelligence in surrounding layers. Claude Code's 6-layer harness, Wix's 250 evals showing docs beat skills, SkCC proving 40% formatting variance, and TLDR AI finding memory rewrites degrade below no-memory baselines all point to the harness—not the model—as the production lever.
40%
formatting variance
8
sources
- Wix evals run
- Format variance
- Cache cost savings
- Pinterest MCP MAU
1. Prompt Format Variance40
2. Cache Hit Savings90
3. Context Compression95
03
Inference Bifurcates: Edge Latency vs. Batch Cost
monitor
Anthropic signed a 7-year $1.8B deal with Akamai for RTX Pro edge inference while also committing ~$200B to Google for training. Ben Thompson's thesis: answer inference rewards speed (Cerebras/Groq), agentic inference rewards $/token on commodity DRAM. Nvidia shipped Dynamo to disaggregate prefill from decode. The single-serving-stack era is ending.
$1.8B
edge inference deal
3
sources
- Akamai deal term
- WSE-3 bandwidth
- H100 bandwidth
- Colossus GPUs moved
1. Edge (RTX Pro)1.8
2. Hyperscale (Google)200
04
Distributed Training: DiLoCo Breaks Single-Cluster Assumption
background
Google's Decoupled DiLoCo achieves 88% goodput under failure simulation vs. 58% for elastic data-parallel, matching synchronous training quality up to 9B parameters. A 12B model trained across 4 US regions on commodity 2-5 Gbps internet. Multi-region training moved from research curiosity to a legitimate capacity-planning option.
88%
goodput under failures
1
sources
- DiLoCo goodput
- Elastic DP goodput
- Bandwidth floor
- Max validated scale
1. Decoupled DiLoCo88
2. Elastic Data-Parallel58
05
Factuality & Eval Harness: Three New Blind Spots
monitor
Strict factuality gates discard 52% of valid answers (Google/Tel Aviv). AI-generated training data halves truthfulness (0.366→0.187). A/B lift predictions are biased 2.7pp high regardless of seniority. All three are measurement-layer problems that compound silently—and none is caught by standard eval suites.
52%
valid answers discarded
4
sources
- Valid answers lost
- Truthfulness drop
- A/B prediction bias
- Prediction samples
1. Valid Answers Discarded52
2. Truthfulness Degradation49
3. A/B Lift Overestimate56

◆ DEEP DIVES

ML Supply Chain Is Under Simultaneous Active Attack — Patch, Pin, and Firewall Today

The Glue Layer Is the Target

The shared feature across this week's disclosures is that all of them hit the ML plumbing between models and production, not the models themselves. Three active exploits landed at once:

LiteLLM (CVE-2026-42208): Unauthenticated SQL injection via a crafted Authorization header. Tencent found it, the patch shipped in April, live exploitation was confirmed in May. LiteLLM tends to hold routing configs, team quotas, prompt caches, and API keys for OpenAI/Anthropic/Bedrock. Compromise means owning the LLM cost center and the prompt IP.
Ollama OOB Read: Critical unauthenticated out-of-bounds read on port 11434. Any network caller can read process memory, which in practice includes in-flight prompts, RAG context (often PII), environment-variable secrets, and model weights. No CVE assigned yet. A weaponized PoC within days is the base rate for this class.
HuggingFace Trending Malware: A repo impersonating OpenAI's "Privacy Filter" reached #1 trending with 244K downloads before takedown. Payload was a Rust-based Windows infostealer with anti-analysis evasion, targeting HF tokens, W&B keys, and cloud credentials.

Why This Week Is Different

Individual vectors here are not new. What is new is simultaneous exploitation across three layers, each of which invalidates a different assumption ML teams have been quietly relying on:

Vector	Broken Assumption	Pre-Auth?	Fix Effort
LiteLLM SQLi	"Our LLM proxy is internal"	Yes	Hours (patch + firewall)
Ollama memory leak	"Localhost inference is safe"	Yes	Hours (patch + auth gateway)
HF trending malware	"Trending = trusted"	N/A (user-initiated)	Hours (audit + rotate)

The SkCC paper adds a useful prior: more than one-third of community-contributed agent skills carry vulnerabilities. If agents import skills from shared registries, the deployment inherits a 33%+ exploit-per-skill base rate. The thing this doesn't tell you is which skills, which is the part that would actually let you prioritize.

Trending rank, localhost binding, and proxy auth headers are all being treated as trust boundaries by teams whose threat models were written before any of these were load-bearing. None of the three is holding up this week.

The Cross-Source Pattern

Six independent sources flagged some subset of these vectors this week. The convergence is the signal. The attack surface has been probed across model registries, inference endpoints, and routing proxies in the same window. Whether coordinated or coincidental does not change the defensive response, which is the same either way.

Action items

Scan internal networks for Ollama port 11434 and LiteLLM instances; patch both and restrict to VPC-only access before end of day
Grep HF/pip cache logs for 'Privacy Filter' or 'Open-OSS' downloads; quarantine affected machines and rotate all tokens (HF, W&B, AWS, GCP, OpenAI)
Enforce SHA pinning on all HuggingFace model pulls in CI pipelines this sprint
Add a security gate to agent skill/tool imports: static analysis + sandboxed eval before production promotion

Sources:The Hacker News · Risky.Biz · TLDR IT · Chain of Thought · CSO Security Leadership · TLDR InfoSec

Agent Architecture Convergence: The Harness Is the Product, the Model Is the Loop

Eight Sources, One Pattern

Eight independent sources this week described the same emerging architecture from different angles: Anthropic's Claude Code, Wix's 250-eval study, the SkCC paper, Pinterest's MCP deployment, and separate findings on memory degradation. They converge on one thesis: the model is the least interesting part of a production agent. The harness around it—context management, tool registries, prompt structure, memory strategy—is what moves the metric you actually care about.

The Architecture, Consolidated

Claude Code runs a "dumb loop, smart harness" pattern: a single-threaded master loop (assemble context → call model → execute tool → feed result back), with intelligence pushed into six surrounding layers. Three of those techniques transfer cleanly to other agent stacks:

Structured-extraction context compression at ~95% window capacity. Not summarization. Extraction of typed artifacts (file paths, stack traces, diffs). Summarization drops the deterministic tokens the model needs to resume work, which is the failure mode you will hit second-week in production.
Prompt cache reuse at ~10% of original cost. The architectural prerequisite is a stable prefix: system prompt, tool schemas, and repo context byte-stable at the head; user turn and tool results at the tail. Without the discipline, the cache hit rate is whatever the calling code happens to produce.
Git worktree isolation for parallel subagent writes. Cheap, kills the race conditions, and gives you a natural human-review gate on dirty worktrees.

Supporting Evidence From Other Sources

Finding	Source	Implication
Docs beat custom skills when skills go stale	Wix (250 evals)	Default to machine-readable docs; promote to skill only with contract test
Prompt formatting alone swings performance 40%	SkCC paper	Cross-framework benchmarks without formatting ablation are noise
Memory rewrites degrade below no-memory baseline	TLDR AI	Summarization-based memory is an anti-pattern; keep episodic, abstract sparingly
66K invocations/mo via MCP registry	Pinterest	Runtime tool discovery + scoped auth > static tool lists in prompts
CI wall-clock is the binding constraint	Notion/Lenny's	Cut eval loop to 25% before upgrading the model

The Contradiction Worth Surfacing

The SkCC paper attributes 40% variance to formatting. Wix reports docs beating skills. These are in tension. If formatting swings results that much, docs are format-sensitive prose and the Wix result may be measuring how well Wix's docs happen to be formatted, not an inherent advantage of docs over skills. Correlation, not causation established. The cleanest read is to ablate formatting on both docs and skills on your own stack before committing either way.

The agent reliability ceiling is tool-definition staleness and harness design, not model quality. Six layers, maybe three load-bearing outside Anthropic's stack—steal those three.

Action items

Run a Wix-style head-to-head on your own eval set: agent-optimized docs vs. custom skills across 50+ tasks, measuring token cost, latency, and task success
Add a formatting-invariance test to your agent eval harness: run each eval with ≥3 formatting variants and report variance alongside mean
Add a no-memory baseline to agent evals and compare against current memory-rewrite pipeline on task success rate
Audit prompt structure for prefix stability: pin system prompt + tool schemas to head, push volatile content to tail, and measure cache hit rate

Sources:TLDR Data · Daily Dose of DS · ByteByteGo · Turing Post · Chain of Thought · TLDR AI

Inference Bifurcation: $1.8B Says Your Serving Stack Needs Two Tiers

The Bet, Made Explicit

Anthropic committed $1.8B to Akamai for edge inference on RTX Pro GPUs at CDN POPs and an estimated $200B to Google Cloud for training and bulk inference in the same window. Read this as architectural conviction, not hedging: two workload classes need two serving substrates.

Answer inference (user-facing chat, IDE completions): latency-bound, rewards raw speed. Cerebras WSE-3 reports 21 PB/s memory bandwidth against H100's 3.35 TB/s. That is a 6,000x ceiling on the memory-bound decode step. The thing this number doesn't tell you is how much of real chat traffic is actually decode-bound once you account for prefill and routing overhead.
Agentic inference (autonomous loops, batch eval, multi-step reasoning): cost-bound, rewards $/token on commodity DRAM. No human is waiting. The binding constraint is KV cache capacity and tool-call latency, not time-to-first-token.

Hardware Economics by Regime

Dimension	Edge (Akamai/RTX Pro)	Hyperscale (H100/B200)	Agentic-Optimized (DRAM-heavy)
Target workload	Sub-200ms TTFT, short context	Training + current default	Long-horizon agents, batch
Memory capacity	48-96GB VRAM	80GB HBM per chip	TBs of DRAM (off-chip)
Cost driver	Network proximity wins	HBM + advanced packaging	Commodity DRAM, older nodes
Model size sweet spot	Distilled/quantized <30B	>70B dense, MoE, long context	Any (memory-resident)

Nvidia's Dynamo framework formalizes the split by disaggregating prefill from decode onto different hardware profiles. Reported gain: 2-4x goodput on long-context workloads. That range assumes clean request-shape distributions. On mixed production traffic, plan for the low end and treat anything above it as upside.

What This Changes

Ben Thompson's claim: by year-end, teams running more than 10K requests/day will run two serving stacks, and the ones that don't will pay 30-50% more per token than necessary. I would commit to that directionally. Half of 30-50% is still worth the engineering. If the penalty lands at 10%, dual-stack does not pay for itself. The SpaceX handoff of 220K+ GPUs (Colossus 1) to Anthropic suggests training-class clusters are becoming fungible with inference, which puts previously training-allocated capacity into the inference market.

The geopolitical footnote: if agentic inference runs well on older-node silicon plus commodity DRAM, export controls lose most of their bite for the largest compute market.

If agents don't need to be fast, the inference bill shouldn't look like it does. The stack that served ChatGPT will not serve a million autonomous Claude Code sessions at the same unit economics.

Action items

Segment inference traffic into latency-SLA-bound and throughput-bound tiers; create separate $/request and $/completed-task dashboards for each
Benchmark top 2 agent workloads on disaggregated prefill/decode setup (vLLM + Dynamo-style split) against current co-located baseline
Add tool-call latency and CPU-bound fraction to agent observability dashboards this sprint
Delay long-term inference capacity commits until Q3 2026 pricing settles post-Colossus reallocation

Sources:Ben Thompson · The Information AM · Bloomberg Technology

◆ QUICK HITS

Google's Decoupled DiLoCo achieves 88% goodput (vs 58% elastic DP) training a 12B model across 4 US regions on 2-5 Gbps commodity internet—spike it before provisioning new co-located clusters
Import AI
Update: Mythos methodology comparison—271 Firefox bugs at ~0% FP uses executable oracle (sanitizer crash = truth); Vercel's open-source deepsec at 10-20% FP uses LLM-as-judge; the 10x FP gap is architectural, not model-driven
TLDR InfoSec
Factuality gates discard 52% of valid answers (Google/Tel Aviv); 'faithful uncertainty' scoring recovers ~half without increasing hallucination—replace binary abstention with calibrated selective risk curves
Turing Post
Fine-tuning on AI-generated social content drops truthfulness from 0.366 to 0.187 (~49% relative hit); human Reddit data causes comparable damage—both are anti-epistemic training signals
Techpresso
On-Policy Distillation reportedly beats teacher models by sampling from the student's own rollouts; SFT pulls toward external distributions and induces forgetting—spike on next domain fine-tune
TLDR AI
A/B lift predictions are systematically biased 2.7pp high across 1,391 forecasts; 10+ years experience shows no edge—only familiar-test retrieval (3.2→2.5pp) and small-group aggregation (→2.3pp) help
TLDR Marketing
LLM agents can re-identify individuals from k-anonymized datasets at pennies-per-record cost—run a red-team re-identification test on your highest-risk data export before a regulator does
MIT Technology Review
LinkedIn collapsing feed, jobs, and ads into a single sequence transformer over user action histories at 1.8M views/minute—unified multi-task rankers are now the production pattern, not research
MarketingShot
Notion cutting CI to 25% of current runtime to unblock AI coding agents—if your eval harness median exceeds 5 minutes, model upgrades are the second-best investment after shortening the oracle
Lenny's Newsletter
EMO (Allen AI) preserves near-full quality routing to just 12.5% of MoE experts—an inference-cost story for self-hosted serving, not a training story
TLDR AI

◆ Bottom line

The take.

Your ML infrastructure is under simultaneous active exploitation across three vectors (LiteLLM, Ollama, HuggingFace) while eight independent sources converged on the same architectural conclusion: the agent harness—not the model—determines production outcomes. The pattern that wins is a dumb single-threaded loop surrounded by structured extraction, stable-prefix caching, formatting-invariant evals, and a no-memory baseline that proves the memory layer isn't making things worse. If your harness doesn't ablate formatting and your model registry doesn't pin by SHA, you're shipping a supply chain with a 33% vulnerability rate atop an eval that's measuring noise.

Frequently asked

What's the fastest way to check if my LiteLLM or Ollama instances are exposed?: Scan internal networks for port 11434 (Ollama) and your LiteLLM endpoints, then verify they're patched and restricted to VPC-only access. Both vulnerabilities are pre-auth exploitable, so any reachable instance should be treated as potentially compromised until rotated. Check egress logs for unexpected outbound traffic to identify exfiltration in progress.
How do I tell if our team downloaded the malicious 'Privacy Filter' HuggingFace repo?: Grep HuggingFace and pip cache logs across dev machines and CI runners for 'Privacy Filter' or 'Open-OSS' references, and check HF download history for any repo pulled near its #1 trending peak. The payload is a Rust-based Windows infostealer with anti-analysis evasion, so standard AV signatures are unreliable. Quarantine any matches and rotate HF, W&B, AWS, GCP, and OpenAI tokens immediately.
Why should I care about prompt prefix stability for agent harnesses?: Stable prefixes unlock prompt cache reuse at roughly 10% of original token cost, which on a 60% hit rate yields about 54% token-spend reduction—often beating any model swap on $/request. The discipline is to pin system prompt and tool schemas to the head and push volatile user turns and tool results to the tail. Without that structure, cache hit rate is whatever the calling code accidentally produces.
Should I migrate agent workloads off H100-class hardware?: Probably segment first, then migrate selectively. Agentic workloads are cost-bound rather than latency-bound, so they reward KV cache capacity and commodity DRAM over HBM bandwidth—but the savings only materialize once you have separate $/request and $/completed-task dashboards by workload class. Benchmark a disaggregated prefill/decode setup on your top two agent workloads before committing capacity.
Is the claim that 'docs beat custom skills' actually reliable?: It's directionally useful but contested. Wix's 250-eval study showed docs winning when skills go stale, but the SkCC paper found prompt formatting alone swings results by 40%, which means the docs-vs-skills delta could partly be measuring formatting quality rather than format type. Run the head-to-head on your own eval set with formatting ablations on both sides before standardizing.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

LiteLLM,Ollama,HuggingFaceHitbyActiveExploits

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The Glue Layer Is the Target

Why This Week Is Different

The Cross-Source Pattern

Eight Sources, One Pattern

The Architecture, Consolidated

Supporting Evidence From Other Sources

The Contradiction Worth Surfacing

The Bet, Made Explicit

Hardware Economics by Regime

What This Changes

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS