AI Wire

vLLM v0.23.0 lands with cross-vendor kernel wins

vLLM cut v0.23.0 with 408 commits from 200 contributors (63 new), centered on DeepSeek-V4 maturation across backends — TRTLLM-gen attention, sparse MLA decoupled from V3.2, and EPLB for the Mega-MoE — alongside a Model Runner V2 default for Llama/Mistral dense models, Gemma 4 unified encoder-free + MTP, and multi-tier KV cache offloading with an object-store tier (@vllm_project).

The hardware story is unusually balanced this cycle. NVIDIA gets FP8 FlashInfer attention for ViT, Triton MoE on Hopper by default, a CUTLASS FP8 scaled-mm padding bypass (+20%), and NUMA auto-binding on DGX B300; AMD ROCm 7.2.3 ships native W4A16 + fused-MoE W4A16 kernels for gfx1100 plus attention-sink in AITER FA; Intel XPU lands block FP8 MoE and a DeepSeek-V4 decode path (@vllm_project). The Rust frontend is also visibly graduating — streaming generate, dynamic LoRA endpoints, new tool parsers, and SSL/TLS on the data-parallel supervisor (@vllm_project).

Anthropic export controls draw cyber-defender backlash

A cohort of CISOs and security executives at Adobe, Zoom, and Sophos is publicly urging the Trump administration to reverse curbs on Anthropic's most advanced models, arguing the restrictions hurt defenders more than they slow attackers (@garymarcus, via @axios). Gary Marcus goes further, suggesting the export-control framing was a pretext: "the American government's primary aim may not have been to control foreign access to frontier AI models" so much as "to target Anthropic" (@garymarcus).

The political signal is converging with a commercial one — Marcus also flags that "Q3 won't be as strong for Anthropic and OpenAI as Q2 was" as enterprises pull back on raw token spend (@garymarcus).

The tokenmaxxing reality check

An MIT/Stanford/NYU/Princeton paper argues AI often feels efficient without delivering measurable gains: people reach for it on easy tasks where DIY would be just as fast, then a feedback loop locks the habit in (@garymarcus, citing @rohanpaul_ai). On the supply side, Meta — after pushing staff to demonstrate "AI-driven impact" — is now capping employee token usage and steering them to in-house tools (@garymarcus). Ramp's data adds a hard distribution to the picture: the top 1% of clients spend ~$7,450 per employee per month on AI versus a median of $11 (@arakharazian, via @TheEconomist).

KPMG's "Redefining Excellence" report becomes the cautionary tale of the week — 40 of 45 citations can't be corroborated, an own-goal that crystallizes the "AI slop" risk for consulting deliverables (@garymarcus). Marcus's framing: "tokenmaxxing has given way to tokenminimizing" (@garymarcus).

Open & local models: Gemma 4 surges, "Rio 3.5" exposed

Gemma 4-12B has crossed 4M HuggingFace downloads, now the most popular encoder-free VLM by a wide margin and the first general-purpose LLM with encoder-free audio input (@_akhaliq, RTing @AndreasPSteiner). A consumer-GPU llama.cpp guide leans on it heavily, citing Unsloth MTP GGUFs at 162 tok/s vs 52 tok/s normal — a 3× speedup on 8–16GB VRAM (@clementdelangue, RTing @TraffAlex).

The drama is "Rio 3.5," which "broke the internet" before weight analysis showed it's effectively 0.6 × Nex N2 Pro + 0.4 × Qwen 3.5 — and still introduces itself as "Nex N2 Pro" without a system prompt (@steipete, RTing @NexEcosystem). Treat the lesson as durable: with open weights, claimed SOTA gets reverse-engineered fast.

Agentic coding, subagents, autoresearch benchmarks

A new autoresearch benchmark across 7 frontier models on ML engineering, harness/prompt engineering, and algorithmic discovery sees Fable-5 winning overall (even under cost constraint), but the open Kimi-K2.7-Code beating frontier on ML engineering — a meaningful open-model dent (@swyx, citing @zhengyaojiang). swyx separately argues Anthropic's "ultracode" is underused outside the lab: it burns tokens but the subagent fanout pays off if your repo is structured to parallelize (@swyx). Pichai-via-Satya gets the framing line: "the real opportunity is not in picking the best model but instead in building a learning loop on top of models where human capital and token capital compound" (@swyx). Meta-skill of the week: stop writing your own /goal — have Codex write one for itself and each spawned agent (@steipete, RTing @skirano).

Active exploitation: GlobalProtect bypass + Sniper Dz scams

Palo Alto's GlobalProtect VPN is under active attack via CVE-2026-0257, a PAN-OS auth bypass that lets attackers establish unauthorized VPN sessions without credentials — patch now if you run it (@thehackersnews). In parallel, the Sniper Dz scam kit is funneling MENA Facebook users through browser-alert traps, back-button hijacks, and hidden redirects — no malware, no download, just UX abuse (@thehackersnews).

The Bottom Line

The day's signal converges on two themes: AI infra is broadening (vLLM kernel parity across NVIDIA/AMD/Intel, Gemma 4's open-VLM lead, open models beating frontier on narrow tasks) while AI economics are tightening (Meta token caps, the productivity-illusion paper, KPMG's citation scandal, "tokenminimizing"). Layered on top: a politically charged fight over Anthropic export controls and a fresh in-the-wild VPN auth bypass that needs immediate attention from anyone running GlobalProtect.

Dispatch № 51 · Filed Monday at dawn from Pensive — a second-brain publication.
Set in Bevan, Old Standard TT, Cormorant Garamond & Courier Prime.

vLLM v0.23.0 lands with cross-vendor kernel wins

Anthropic export controls draw cyber-defender backlash

The tokenmaxxing reality check

Open & local models: Gemma 4 surges, "Rio 3.5" exposed

Agentic coding, subagents, autoresearch benchmarks

Active exploitation: GlobalProtect bypass + Sniper Dz scams

The Bottom Line

Sources

vLLM v0.23.0 & inference hardware perf

Anthropic export-control backlash

AI productivity reality check & tokenmaxxing economics

Open & local models — Gemma 4, llama.cpp, Rio/Nex

Agentic coding, subagents & autoresearch benchmarks

Active exploitation & scam campaigns