vLLM stack and open trillion-parameter agentic models
vLLM dominated the inference layer this week, with v0.21.0 landing 367 commits from 202 contributors — including KV Offload with HMA, spec-decode-with-thinking-budget for reasoning models, TOKENSPEED_MLA on Blackwell for DeepSeek-R1 and Kimi K2.5, and DeepSeek V4 pipeline parallelism (@vllm_project). The headline result is the Mooncake distributed KV cache integration: on real Codex and SWE-bench Pro traces running Kimi-2.5 NVFP4 on GB200, cache hit rate jumped from 1.7% to 92.2%, delivering 3.8x throughput, 46x lower P50 TTFT, and near-linear scaling to 60 GPUs (@vllm_project). Red Hat's evaluation of TurboQuant across 4 models and 5 benchmarks gives operators the first comprehensive data on where the 2-bit KV cache helps and where it costs accuracy (@vllm_project).
The open-weights side moved in lockstep. The Inclusion AI released Ring-2.6-1T, a one-trillion-parameter dense reasoning model under MIT license with adjustable "high" and "xhigh" reasoning effort, scoring 63.82 on ClawEval (@_akhaliq, @vllm_project). Intern-S2-Preview shipped a 35B scientific multimodal foundation model with day-0 vLLM support and — notably — material crystal structure generation (@vllm_project). Nous Research's 9B Hermes agentic model hit 53.33% on a 200-sample SWE-bench slice, surprising even its creator at that parameter count (@_akhaliq).
Anthropic crosses OpenAI in enterprise adoption
Ramp's AI Index recorded a structural inflection: 34.4% of U.S. businesses now pay for Anthropic versus 32.3% for OpenAI, with Anthropic adoption quadrupling year-over-year while OpenAI rose just 0.3% (@arakharazian). Ramp's Kharazian had publicly predicted the crossover roughly a month out and credits the second-wave surge to coding and agentic use since late 2025 (@arakharazian). Reddit's r/grAIve framed this as the first time a vendor has displaced OpenAI's B2B lead since GPT-3.5, with enterprise buyers now choosing on criteria beyond brand recognition (last30days, reddit.com).
A related Coatue/Ramp chart shows the deeper repricing: 74% of AI-lab revenue is consumption-based versus 96% seat-based for traditional SaaS, and that SaaS mix has barely budged in 11 months (@arakharazian). Anthropic also added Ramp as a Claude connector, giving the model direct access to 50,000+ businesses' spend data (@arakharazian).
Codex and ChatGPT as the universal agent harness
OpenAI is consolidating Codex into a cross-surface harness. A new Windows sandbox solves the approval-prompt-versus-full-access dilemma (@gdb), Codex is now reachable from inside the ChatGPT app (@gdb), and the /goal primitive is producing genuinely long-horizon autonomous runs — Greg Brockman had it unsubscribe from 87 marketing lists across an hour of unsupervised work (@gdb). Malta secured a countrywide ChatGPT Plus deal (@gdb). Codex Skills are emerging as the packaging format for repeatable workflows, including codebase complexity analysis and local-business lead prospecting (@gdb).
Fal ships a generative-media platform play
Fal announced its World Model Accelerator — the inference system behind its action-controlled diffusion APIs — and a genmedia CLI that exposes 1,000+ models to Claude and other terminal agents via installable skills (@fal). Day-0 endpoints landed in a single burst: Happy Horse 1.0 (1080p video with synced audio), NVIDIA Nemotron 3 Nano Omni, GPT Image 2, FLUX.2 Pro Outpaint, Recraft V4.1, Pixal3D, xAI Grok Speech-to-Text, and Lucy 2.1 Virtual Try-On (@fal). Fal also published a prompt-engineering benchmark on Happy Horse showing ~20-word prompts as the sweet spot and warning that words like "stunning" and "hyperrealistic" drag outputs toward defaults (@fal).
Cybersecurity: agent-targeted attacks and a worm bounty
The Hacker News flagged an unusually dense vulnerability week: CVE-2026-42945, a heap overflow in NGINX's rewrite module under active exploitation (@thehackersnews); MiniPlasma, a Windows LPE in cldflt.sys that grants SYSTEM on fully-patched Windows 11 (@thehackersnews); and Claw Chain (CVE-2026-44118), which spoofs OpenClaw's senderIsOwner flag to hijack an AI agent without credentials, then chains TOCTOU escapes for persistence (@thehackersnews). ClickFix evolved to drop scheduled-task persistence plus a PySoxy SOCKS5 proxy from a single pasted command (@thehackersnews). Most concerning, TeamPCP open-sourced their Shai-Hulud worm with a $1,000 Monero bounty on Breached forum for the largest npm hijack haul (@thehackersnews) — echoing a wider trend of AI-developer supply-chain attacks, including the TanStack incident that planted 84 malicious packages in six minutes via chained GitHub Actions flaws (last30days, reddit.com).
Singularity rhetoric meets the LLM-limits backlash
Mustafa Suleyman predicted full automation of accounting, legal, marketing, and project management within 12-18 months (@garymarcus); Gary Marcus offered a $100K bet against him and argued current gains come from symbolic tooling wrapped around LLMs, not pure scaling (@garymarcus). Ethan Mollick took the opposite view — that a Von Neumann-style singularity is permeating from SF outward and that robust RSI plus continual learning are the only remaining barriers (@emollick). The economics side pushed back harder: John Burn-Murdoch endorsed Brian Albrecht's "you are not a horse" rebuttal to labor-displacement panic (@jburnmurdoch), Stripe Economics' travel-agent case study showed displacement isn't bleak (@jburnmurdoch), and Alvarez et al. found data-center buildouts boost local wages and house prices while raising electricity costs (@emollick).
The Bottom Line
The infrastructure layer is consolidating fast — vLLM plus Mooncake KV caching, Codex as a universal agent harness, and Fal's media platform are all racing to be the substrate for long-horizon agent work, while Anthropic quietly overtook OpenAI on enterprise spend. Against that, the security surface for AI agents is widening (Claw Chain, Shai-Hulud worm bounties), and the AGI-timeline discourse hardened into a Suleyman-vs-Marcus standoff that neither side can yet settle empirically.