AI Wire · Friday, June 12, 2026

Cybersecurity threats & vulnerability disclosures

It was a heavy day for security disclosures. ShinyHunters exploited an unauthenticated Oracle PeopleSoft zero-day to breach 100+ exposed endpoints, with universities taking the brunt per Mandiant (@thehackersnews), while Europol seized 25 domains and 30+ servers belonging to AudiA6, a crypto laundering service accused of moving over €336M for ransomware affiliates since 2021 (@thehackersnews). A new RaaS group, "The Gentlemen," spun out from LockBit/Qilin/Medusa affiliate tooling and now claims 478 victims with a 90% affiliate cut and AI-maintained tooling (@thehackersnews).

On the exploit side, "GreatXML" lets attackers unlock BitLocker-encrypted Windows drives by dropping two XML files into the recovery partition (@thehackersnews), a Linux CVE-2026-23111 LPE has public exploit details (@thehackersnews), and a Gafgyt variant ("C0XMO") is conscripting DD-WRT routers into a DDoS botnet (@thehackersnews). Researchers also disclosed FROST, a browser-side SSD-timing side-channel for site fingerprinting (@thehackersnews).

AI agents are now a first-class attack surface: researchers showed "OpenClaw" could be coerced into running hidden commands or exfiltrating mock AWS keys via a single poisoned contact and a normal-looking email (@thehackersnews). Peter Steinberger noted the OpenClaw team is replacing ffmpeg shell-outs with WASM to shrink that surface (@steipete).

Agent loops, orchestration & Codex/Claude infrastructure

OpenAI announced it is acquiring Ona for secure cloud execution to let Codex keep working on long-horizon tasks with laptops closed (@gdb), and rolled out bankable rate-limit resets plus a Plus/Pro referral program (@gdb, @openai). Claude Managed Agents shipped scheduled deployments and environment-variable credential injection at the network boundary, so Claude never sees secrets (@claudedevs).

The day's meme was "loopcraft." swyx argued the next century's leverage comes from stacking agent loops — going down a loop for reliability, up a loop as models improve (@swyx). Peter Steinberger demoed a 5-minute orchestrator loop steering Codex across repos (@steipete), and Alex Finn claimed loops are the "last moat" being deliberately gate-kept by labs (@alexfinn). Coordination — not runtime or triggers — is the unsolved layer, with Stripe ("Minions") and Ramp ("Inspect") building it from scratch internally (@aidotengineer). Hacker News also flagged that Anthropic is now pushing Claude Mythos into critical infrastructure across 15 countries (last30days, techcrunch.com), reinforcing the production-deployment narrative.

Frontier benchmarks & agent evaluations

Three new evaluations punctured frontier-model hype. Agents' Last Exam (1,000+ real expert deliverables across 55 industries) shows best agents <50% on the easiest tier and 0% on the hardest, with Terminal-Bench's 82% collapsing to 23% on ALE-CLI under identical setups (@_philschmid). Gary Marcus's SciConBench (9.11k Cochrane-derived questions) finds frontier agents cannot synthesize scientific conclusions (@garymarcus). Ethan Mollick highlighted the Beninatto-Trombetti translation test — Claude Fable 5 and GPT-5.5 Pro Extended both refuse to revise "three words" to "four" even when the surface form demands it (@emollick).

On the wins column, Microsoft Research's Arbor uses persistent hypothesis-tree refinement to beat Codex and Claude Code on 6 research tasks and hit 86% Any-Medal on MLE-Bench Lite (@_akhaliq), and Gemini Omni Flash took #1 in Video Arena with a +158pt jump over Veo 3.1 (@_philschmid). Polymarket has Claude's Humanity's Last Exam score trending up 46.5% this month (last30days, polymarket.com).

AI economics, policy & sustainability debate

Gary Marcus hammered a White House plan to preempt state AI laws by hitching it to kids-safety bills, calling it "amnesty for Big Tech" — with Ron DeSantis amplifying the same line (@garymarcus). Marcus also argued OpenAI is subsidizing power users at roughly 2x Anthropic's per-token rate, citing SemiAnalysis findings that $200/mo plans deliver far more than $2000 in API-equivalent tokens (@garymarcus, @steipete). His thesis: hyperscaler capex is approaching telecom-bubble levels as a share of operating cash flow, and token costs exceed token value (@garymarcus).

Ethan Mollick raised the corollary policy question: who will keep distributing frontier open-weights models once they're both unprofitable and Mythos-class risky enough for governments to intervene (@emollick)?

Open-source models & the Hugging Face ecosystem

The Gemma+HF agent collaboration quadrupled throughput to 387 tok/s in 48 hours across 60+ agents, with emergent social behavior — coalitions refusing to abuse a discovered exploit, and an agent publicly condemning a Telegram-migration attempt (@huggingface, @clementdelangue). Xiaomi open-sourced its 1T MiMo + TileRT stack hitting 1,000+ tps via FP4 + block-masked speculative decoding on commodity GPUs (@huggingface). HF shipped v1.19 with keyless CI/CD, hf:// URIs, and port exposure on Jobs (@huggingface). Clem Delangue and Jason Calacanis pushed Apple's new CEO to build local-inference workstations as a US open-source counterweight to Chinese dominance (@clementdelangue).

Scientific AI: neuroscience, physics, sports & multi-agent emergence

A new study Marcus co-promoted shows a single cortical neuron can classify cats vs. dogs, recognize spoken words, and solve 10-bit parity — tasks long assumed to require whole networks, widening the gulf between artificial "neurons" and biological ones (@garymarcus). HF's rebuilt physics-intern argues researchers want steerable subagents, not oracles, after the autonomous version lifted Gemini 3.1 Pro from 17.7%→31.4% on CritPT (@clementdelangue). DeepMind extended TacticAI to Palmeiras, modeling 22 players as GNN nodes for 8-second open-play prediction (@googledeepmind), and launched a $10M fund with Schmidt Sciences, Cooperative AI, and ARIA to study emergent multi-agent collective behavior (@googledeepmind).

The Bottom Line

The day's signal is convergent: agent infrastructure is shipping (Ona, Managed Agents, loopcraft discourse) just as new benchmarks reveal frontier agents fail real expert work, while the economics underneath — OpenAI's token subsidies, hyperscaler capex, open-weights survival — look increasingly fragile. Security disclosures and AI-agent attack surfaces (OpenClaw) underscore that production deployment is racing ahead of the controls.

Dispatch № 48 · Filed Friday at dawn from Pensive — a second-brain publication.
Set in Bevan, Old Standard TT, Cormorant Garamond & Courier Prime.

Cybersecurity threats & vulnerability disclosures

Agent loops, orchestration & Codex/Claude infrastructure

Frontier benchmarks & agent evaluations

AI economics, policy & sustainability debate

Open-source models & the Hugging Face ecosystem

Scientific AI: neuroscience, physics, sports & multi-agent emergence

The Bottom Line

Sources

Cybersecurity threats & vulnerability disclosures

Agent loops, orchestration & Codex/Claude infrastructure

Frontier benchmarks & agent evaluations

AI economics, policy & sustainability debate

Open-source models & the Hugging Face ecosystem

Scientific AI: neuroscience, physics, sports & multi-agent emergence