Data Science daily

Edition 2026-05-14 · read as Data Science

OpenAIFinetuningSunsetPutsRewardLoopsonaClock

Sources
32
Words
1,630
Read
8min

Topics LLM Inference Agentic AI Data Infrastructure

◆ The signal

The finetuning API deprecation OpenAI announced this week runs on a shorter window than most migration plans budgeted for, which leaves reward-model loops built on those endpoints on a clock that already started. Cursor and Cognition are moving to open-model RLFT; for the remaining 80% of use cases, long-context prompting plus prompt caching is the likely landing spot. The failure mode worth watching isn't the API swap. It's reward signal drifting during migration while the eval harness keeps reporting green.

◆ INTELLIGENCE MAP

  1. 01

    OpenAI Finetuning Deprecated: Stack Bifurcation Accelerates

    act now

    OpenAI pulled finetuning endpoints, validating the split: top 1% double down on open-model RLFT (Cursor, Cognition at $25B), while 80% migrate to long-context + prompt caching. Replacement paths are open-weights on owned clusters or a different vendor—but reward models tuned against the old checkpoint won't transfer cleanly to a new base.

    80%
    teams should move to prompting
    3
    sources
    • Cognition valuation
    • Prime Intellect speedup
    • Context windows now
    1. Long-Context (80% of teams)80
    2. Open-Model RLFT (top 1%)20
  2. 02

    Shai-Hulud Escalation: Rotation Triggers Destruction

    act now

    The npm worm now wipes systems when owners attempt to revoke stolen tokens—a dead-man's switch that weaponizes standard IR. ~400 packages compromised including Mistral and UiPath SDKs. The harvest list targets AWS, GCP, K8s, Vault, GitHub, and SSH credentials. New primitive required: contain and isolate BEFORE rotating.

    ~400
    packages compromised
    5
    sources
    • Affected packages
    • Credential types
    • Detection speed
    • Active since
    1. Nov 2025Shai-Hulud first active
    2. May 12TanStack origin, 42 packages
    3. May 13Spread to Mistral/UiPath SDKs
    4. May 14Destructive-on-rotate confirmed
  3. 03

    Efficiency Frontier: Where Q3 Cost Wins Actually Live

    monitor

    Three efficiency results converge: recursive 4B LMs reportedly match Sonnet 4.6 at fraction of cost, compression-aware scaling laws show '20 tokens per parameter' is a tokenizer artifact (budget in bytes instead), and DeepSeek V4 Pro prices at 11-28x below Opus. Chinese open-source went from 1.2% to 30% of global usage in 12 months.

    11-28x
    DeepSeek cheaper than Opus
    4
    sources
    • DeepSeek V4 input
    • Opus 4.6 input
    • CN OSS share growth
    • Scaling models tested
    1. Claude Opus 4.64.73
    2. GLM-51
    3. Kimi K2.60.95
    4. DeepSeek V4 Pro0.43
    5. Dolphin (decentral)0.7
  4. 04

    Dual-Engine Architecture + Prompt Caching Gap

    monitor

    Thinking Machines Lab hit 0.40s end-to-end on a 276B MoE using fast-path + async reasoning split. Meanwhile, Datadog's 1000+ org telemetry shows agent adoption doubled but prompt caching remains widely underutilized. The irony: agentic workloads have the highest repeat-prefix rates and benefit most from caching.

    0.40s
    end-to-end agent latency
    3
    sources
    • TML micro-turns
    • Agent adoption
    • Cache savings potential
    • Orgs surveyed
    1. Single-Model (GPT-RT/Gemini Live)1200
    2. Dual-Engine (TML pattern)400
  5. 05

    Reasoning Token Economics: 17% Margin Ceiling

    background

    Reasoning models burn 10-100x more tokens per task than predecessors. One analysis pencils AI-native gross margins at 17% vs SaaS's 70%. Personalization breaks caching and multi-tenancy amortization. Enterprise model share is fragmenting: OpenAI 56% (-8pp), Claude +128%, Gemini 27%→40%.

    17%
    AI-native gross margin
    3
    sources
    • Reasoning token mult.
    • SaaS gross margin
    • AI-native margin
    • OpenAI share change
    1. Traditional SaaS70
    2. AI-Native (reasoning)17

◆ DEEP DIVES

  1. 01

    OpenAI Finetuning Is Gone — Your Migration Path Splits in Two

    What Happened

    OpenAI deprecated its finetuning APIs on a window shorter than any migration plan assumed. For teams that built RLFT stacks on these endpoints, with reward-model loops and policy updates routed through the hosted finetuning call, this is not a config change. The endpoint sits behind abstraction layers and scheduled jobs nobody has opened since onboarding.

    The Bifurcation

    The market response splits into two camps, and the split is cleaner than usual:

    DimensionLong-Context Prompting (80% of teams)Open-Model RLFT (top 1%)
    Tooling maturityHigh — prompt caching, >1M context windowsImproving — Unsloth, Prime Intellect (>3× RL throughput)
    Quality ceilingCapped by base modelCan exceed frontier on narrow distributions
    Ops burdenLow — API callHigh — owned GPUs, RL infra, eval harness
    Unit economicsToken-dominated; caching helpsAmortized training; cheap inference
    When it winsBroad tasks, variable promptsNarrow high-volume tasks where RLFT pays back in weeks

    Cursor and Cognition (now at $25B) are increasing open-model RLFT. That validates the approach at the top of the distribution. The thing this doesn't tell you is what happens at the modal workload: fewer than 100M daily calls, variable prompts, moderate quality bar. There, long-context prompting with caching is now the path of least resistance.

    The Hidden Risk

    The expensive bug is not the API swap. It is discovering post-migration that the reward signal drifted because the reward model was tuned against a specific base checkpoint. The new base is not that checkpoint. Offline evals that tracked online behavior on the old base are not guaranteed to track on the new one, and the cleanest way to confirm is to re-run the correlation study before you cut over.

    The replacement that passes the eval harness and fails in production is always the one where nobody re-validated the reward model against the new base.

    Cross-Source Pattern

    This deprecation lands the same week as three other data points worth reading together: a 4B recursive model reportedly matching Sonnet 4.6 at a fraction of cost, which suggests small-model RLFT has legs; DeepSeek V4 Pro pricing at $0.43/M input tokens, which makes the "just use a bigger model" path cheaper than it was on Monday; and Cactus Needle at 26M params doing tool-calling at 6,000 tok/s. Reportedly is doing work in the first one. Given the numbers, I expect the 4B result to hold up on real workloads by about half as much as claimed, which is still enough to matter. The finetuning decision is now inseparable from the model-selection decision.

    Action items

    • Inventory every production workload depending on OpenAI finetuning endpoints by end of sprint
    • For each workload, decide: collapse to long-context + prompt caching OR migrate to open-weight RLFT via Unsloth/Prime Intellect. Gate on golden eval set.
    • Re-validate reward model against new base checkpoint before any migration goes live
    • Run 4B recursive LM bake-off against top production task to check if small-model RLFT closes the gap

    Sources:OpenAI pulled the finetuning APIs · A 4B recursive model reportedly matching Sonnet 4.6 · Opus 4.7 Fast is the headline

  2. 02

    Shai-Hulud's Dead-Man's Switch: 'Rotate First' Is Now the Attack

    The Escalation

    The npm supply-chain worm covered in last week's note has picked up a new behavior worth flagging: it wipes the host when the owner rotates the stolen token. The standard detect-revoke-rotate loop now triggers the destructive payload. Five independent sources confirmed the escalation this week. That is enough to update the playbook.

    The mechanism: Shai-Hulud plants a gh-token-monitor persistence hook that watches credential files for access-pattern changes. When the monitored file changes, it runs a wipe. The thing this does not tell you is how often the hook fires on benign edits. We do not know yet. Assume the false-positive rate is low enough that the attackers shipped it.

    New IR Primitive: Contain Before Rotate

    The sequencing change is the point. It will not surface on a dashboard:

    1. Snapshot the affected environment (disk, memory)
    2. Network-isolate the runner or workstation
    3. Search for persistence hooks in .claude/settings.json, .vscode/tasks.json, and credential-monitoring processes
    4. Remove persistence from a forensic copy
    5. Only then rotate credentials from a clean, separate environment

    Blast Radius for ML Stacks

    The harvest list enumerates AWS, GCP, Kubernetes, Vault, GitHub, and SSH credentials. For an ML org, that one list covers the feature store, the model registry, training cluster IAM, and deploy keys. The SANDCLOCK stealer from the same group (TeamPCP) is also pivoting to AI gateway credentials. LLM API keys are now on the target list.

    Credential ClassBlast RadiusRotation Priority
    AWS/GCP service accountsFeature store, training data, model weightsCritical
    Kubernetes tokensInference cluster, serving endpointsCritical
    GitHub/SSHCode supply chain, CI/CDHigh
    LLM API keys (OpenAI, Anthropic, HF)Inference spend, model accessHigh
    Vault tokensEverything secrets-managedCritical
    The muscle memory of 'rotate tokens first, ask questions later' is now the attack vector. Contain and isolate before you touch a single credential.

    The Checkmarx Connection

    Separately confirmed, and the part that should change your threat model: the Checkmarx Jenkins AST plugin compromise (CVE-2026-33634, CVSS 9.4) traces to credentials stolen from Trivy in March that Checkmarx never rotated. That is a two-month dwell time on a secret known to be leaked. Same attack group. The correlation with long-lived tokens is not subtle. The causal read is that the entire class needs to move to short-TTL OIDC-federated credentials. If that migration only cuts dwell time in half, it still pays for itself on this one incident.

    Action items

    • Quarantine any CI runner or dev machine that installed affected npm packages since May 12 BEFORE rotating any secrets
    • Search all environments for .claude/settings.json, .vscode/tasks.json, and gh-token-monitor persistence hooks
    • Replace all long-lived CI/CD tokens with OIDC-federated short-TTL credentials (GitHub Actions OIDC → AWS/GCP/npm)
    • Scope LLM API keys per-service and shorten rotation to ≤30 days

    Sources:The TanStack-origin npm worm is now at roughly 400 compromised packages · Mistral and UiPath npm libraries were compromised · Your Mistral SDK and Claude Code installers are in the blast radius · Your ML infra is the target: npm worm harvests AWS/GCP/Vault creds

  3. 03

    The Efficiency Frontier Moved: Recursive LMs, Byte-Scaling, and the Router That Pays for Itself

    Three Converging Results

    A few recent results point the same way: the axis that matters next quarter is cost-per-useful-token at fixed quality, not peak capability. Three worth evaluating seriously.

    1. Recursive 4B LMs Matching Sonnet 4.6

    A 4B-parameter recursive language model trained with RL and a shared parent/child policy reportedly matches Claude Sonnet 4.6 on task performance. Treat the parity claim with caution. Last year's 'small model matches GPT-4' cycle did not survive contact with real tool-use traces. The architectural insight still transfers: a shared policy serving both the outer planner and the inner sub-call collapses a router plus executor cascade into one trainable artifact.

    The thing the headline doesn't tell you is inference cost. If the recursive model runs 8 passes to match, effective cost is closer to 32B than 4B. Still interesting. Not the same product decision.

    2. Compression-Aware Scaling Laws (Bytes, Not Tokens)

    Across about 1,300 models, the canonical '20 tokens per parameter' Chinchilla rule is a tokenizer artifact. Compute-optimal scaling is cleaner expressed in bytes per parameter. If your pretraining mix skews toward code or non-English text, both of which have lower bytes per token, your compute budget has been quietly biased. Rederiving in bytes is roughly a week of work. It could reveal systematic over- or under-training.

    3. Chinese Model Price Arbitrage

    The Exponential View team visited 14 Chinese labs. The pricing picture:

    ModelInput $/MOutput $/MClaim
    Claude Opus 4.6$4.73$24.36US frontier
    DeepSeek V4 Pro$0.43$0.87~Comparable to Opus
    GLM-5 (Z.ai)$1.00~$5.0050% gross margin
    Kimi K2.6$0.95Powers Cursor Composer 2

    The Chinese cost advantage is not hardware. Huawei's Ascend 950PR only matches 2022-era H100. It is software: MoE routing, FP8 training, speculative decoding, and longer pre-training cycles forced by chip constraints. That part is reproducible.

    Chinese open-source went from 1.2% to 30% of global usage in 12 months. An eval harness that still defaults to closed US models is measuring last year's frontier.

    The Router Pattern

    Operational synthesis: a cost-aware LLM router that dispatches easy intents (extraction, classification, summarization, first-pass rerank) to DeepSeek/GLM-5/Kimi and keeps the hard tail on Opus/Sonnet. Cursor already ran this pattern with Composer 2 on Kimi K2.5. At 28x output-token savings, 28x more LLM-as-judge calls fit in the same budget, or the same judges run over 28x more candidates.

    Action items

    • Add DeepSeek V4 Pro, Kimi K2.6, GLM-5, and Qwen3 to eval harness; run top-3 production task suites including latency and tool-use reliability
    • Recompute pretraining/continued-pretraining budgets in bytes-per-parameter using your actual tokenizer's distribution
    • Prototype cost-aware LLM router dispatching easy intents to Chinese models, keeping hard tail on frontier
    • Run controlled bake-off: finetune 4B base as recursive LM against current frontier-tier production task

    Sources:DeepSeek V4 Pro is listed at 11-28x cheaper · A 4B recursive model reportedly matching Sonnet 4.6 · Chinese frontier models trail US benchmarks by six to eight months · Decentralized inference providers are quoting Qwen at roughly thirty percent below OpenRouter

  4. 04

    Dual-Engine Agents + The Caching Gap: Your Inference Stack Is Overpaying

    The Architectural Pattern

    Thinking Machines Lab shipped a 276B MoE interaction model at 0.40s end-to-end latency with 200ms micro-turns, ahead of GPT-Realtime-2 and Gemini Live on responsiveness. What travels out of this paper is not the model but the shape: a fast-path conversational model paired with an asynchronous background reasoner.

    DimensionSingle-Model (status quo)Dual-Engine (TML pattern)
    Turn-taking latency~0.8–1.5s0.40s, 200ms micro-turns
    Reasoning depthBounded by latency budgetUnbounded — runs async
    Serving costFrontier cost on every turnSmall model on hot path; frontier only when triggered
    Failure modeLatency spikes on tool callsDesync between fast-path and async reasoner state

    The pattern is a deployable choice, not a research result. Route turn-taking to a small quantized model in the 7–13B range and keep the frontier endpoint for async tool calls and long-horizon planning. The non-trivial piece is the state-sync protocol. The async reasoner arrives late or disagrees with the fast path, and reconciling that disagreement is the new open problem.

    The Caching Gap That Multiplies the Win

    Datadog's telemetry from 1,000+ organizations shows agent framework adoption roughly doubled while prompt caching stayed underutilized. Agentic workloads carry the highest repeat-prefix rates of any LLM pattern, driven by system prompts, tool schemas, and ReAct scaffolding. The orgs scaling agents fastest are the ones overpaying per call.

    Cached input tokens run at roughly 10% of base price on Anthropic and a ~50% discount on OpenAI. For an agent doing 100K calls/day at 4K cached tokens each, the delta is four-figure monthly savings per endpoint.

    If agent traffic doubled and cache hit rate did not, the provider is capturing margin that should be sitting in the infrastructure line. The top-5 endpoints typically account for most of the wasted spend.

    The Rate Limiter Lesson

    Databricks published a reference architecture after their rate limiter collapsed under real-time serving traffic. The fix moved from synchronous Redis per-request to in-memory async batch reporting, trading ~5% overshoot tolerance for roughly 10x tail latency reduction. Optimistic local decisions with asynchronous reconciliation generalizes to any inference ingress where a 10-20ms Redis hop against a 50ms P99 inference budget is burning 20-40% of the latency on bookkeeping.

    New Eval Metrics Required

    Most current LLM evals do not measure what matters in dual-engine architectures: time-to-first-useful-token, interruption recovery, async-tool completion rate, and micro-turn latency distributions. MMLU averages correlate poorly with turn-taking latency and interruption handling, so rankings on those benchmarks will diverge from production behavior on real-time candidates.

    Action items

    • Audit prompt cache hit rates on top-5 LLM calls by volume; enable caching on system prompts, tool schemas, and RAG prefixes where hit rate >30%
    • Prototype dual-engine split on one real-time workload: route turns to small quantized model, reasoning to frontier async. Measure p50/p95 micro-turn latency.
    • Measure P99 latency contribution from rate limiting/auth/quota checks on model serving ingress
    • Add time-to-first-useful-token and async-tool completion rate as first-class eval metrics

    Sources:Dual-engine agents are landing at 400ms end-to-end · Datadog published telemetry from roughly a thousand organizations · Databricks published a rate limiter playbook

◆ QUICK HITS

  • Update: Shai-Hulud now has a dead-man's switch — wipes host on credential rotation. Contain and network-isolate affected runners BEFORE revoking any secrets.

    The TanStack-origin npm worm is now at roughly 400 compromised packages

  • GB200 NVL72 delivers 47% all-reduce latency reduction vs H200 on MoE serving (586μs→313μs); PD disaggregation with RoCEv2 claims up to 7x per-GPU throughput lift.

    OpenAI pulled the finetuning APIs

  • Qdrant 1.18 ships TurboQuant claiming 2x vector-DB memory reduction at 'near-scalar recall' — run recall@10/50 on production shard before trusting the claim on tail queries.

    OpenAI pulled the finetuning APIs

  • AWS killed the $5/TB Redshift Spectrum scan fee with Graviton-based RG tier (2.2x perf, 30% cheaper) — feature pipeline economics shifted if Iceberg scan fees were >1/3 of your warehouse bill.

    AWS announced a Graviton-based Redshift tier

  • Cursor cut semantic search costs 95% by migrating to turbopuffer — storage-dominated vector workloads (code search, doc QA) may have a cheaper architecture path worth a one-week spike.

    Hejlsberg's claim is the kind of observation that sounds obvious

  • Qwen lead researcher Junyang Lin left Alibaba in March, raising $2B for new lab — pin current Qwen checkpoints locally and stand up weekly regression eval vs DeepSeek-V3 and Llama-3.x.

    Two items this week that matter for anyone running a model stack

  • Judgment Labs raised $32M (Lightspeed-led) specifically for agent evaluation infrastructure — teams with >500 lines of homegrown LLM-as-judge scoring now have a funded vendor option to benchmark against.

    Anthropic's Mythos has been reclassified as a classified asset

  • Cactus Needle — 26M params, no FFN — reports 6,000 tok/s prefill for tool-calling. If monthly spend on Claude/GPT for pure JSON schema emission is material, this is a 10-100x cost cut conditional on task-specific eval.

    OpenAI pulled the finetuning APIs

  • LLM-rewritten counterfeit listings are erasing stylometric features from fraud classifiers — $467B adversary budget, fractions of a cent per listing rewrite. Shift weight to image hashes and seller-graph signals.

    Counterfeiters started using LLMs to rewrite their listings

  • G7 published SBOM guidance for AI models — will become procurement checkbox within 2 quarters. Emit SPDX/CycloneDX from model registry now while it's a form, not a retrofit.

    The TanStack-origin npm worm is now at roughly 400 compromised packages

◆ Bottom line

The take.

OpenAI deprecated finetuning APIs, the npm supply-chain worm now destroys systems when you try to rotate stolen credentials, and Chinese models are pricing 11-28x below the US frontier at 30% global usage share. The three things your stack needs this week: a finetuning migration plan with reward-model validation, a contain-before-rotate IR primitive that didn't exist yesterday, and a cost-aware router that stops paying Opus prices for tasks a $0.43/M model handles.

— Promit, reading as Data Science ·

Frequently asked

Why is rotating stolen tokens now dangerous after a Shai-Hulud compromise?
The worm now plants a persistence hook that watches credential files and triggers a host-wipe payload when those files change. Standard detect-revoke-rotate sequencing fires the destructive payload, so the new IR primitive is contain-before-rotate: snapshot, network-isolate, find and remove persistence hooks in places like .claude/settings.json and .vscode/tasks.json, then rotate from a clean environment.
What's the biggest hidden risk when migrating off the deprecated OpenAI finetuning APIs?
Reward signal drift during base-model swaps that the eval harness fails to surface. Reward models tuned against a specific base checkpoint don't necessarily track behavior on the new base, so offline evals can keep reporting green while online performance silently degrades. The mitigation is re-running the offline-to-online correlation study against the new base before cutover.
How should teams choose between long-context prompting and open-model RLFT post-deprecation?
Default to long-context prompting plus prompt caching for the ~80% of workloads with variable prompts and moderate quality bars — it's lower ops burden and the API-call cost structure is acceptable. Reserve open-model RLFT (Unsloth, Prime Intellect) for narrow, high-volume tasks where amortized training pays back in weeks and quality must exceed the base model ceiling.
Why does the Chinchilla '20 tokens per parameter' rule need to be rederived?
A study across roughly 1,300 models showed the rule is a tokenizer artifact and that compute-optimal scaling is cleaner in bytes per parameter. Pretraining mixes heavy in code or non-English text have lower bytes per token, so token-based budgets systematically mis-specify the compute frontier. Recomputing in bytes is about a week of work and can reveal under- or over-training.
Where is prompt caching being left on the table in agent stacks?
Datadog telemetry from 1,000+ orgs shows agent framework adoption roughly doubled while cache utilization didn't follow, even though agentic workloads have the highest repeat-prefix rates due to system prompts, tool schemas, and ReAct scaffolding. Cached input tokens run at ~10% of base price on Anthropic and ~50% off on OpenAI, so enabling caching on top-5 endpoints with >30% hit rate typically yields four-figure monthly savings per endpoint.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.