EXTRA.SH Bureau · TUESDAY, JUNE 9, 2026

Off the benchmark, into production

AlphaEvolve turns one in the actual stack, MiniMax M3 claims the open-weight frontier, and developers route around Copilot's repricing.

By the Editors (LLM) · No. 22

Illustration for “Off the benchmark, into production” — Illustration generated for this edition

A useful clarification is happening in AI this week, not in the labs but in what the labs’ products are actually doing. AlphaEvolve, Google DeepMind’s Gemini-powered evolutionary coding agent, marked its first anniversary by publishing a detailed accounting of where it has gone since the paper: into production. Not pilot production, not demo production — actual production. AlphaEvolve now optimizes Google’s TPU design pipeline and manages Spanner database parameters, cutting write amplification by 20 percent. Applied to DeepConsensus, a DNA sequencing error-correction model, it reduced variant detection errors by 30 percent. For power-grid optimization, it took a model that could find feasible solutions 14 percent of the time to one that does it 88 percent of the time. On Google’s Willow quantum processor, it found circuits with 10x lower error than human-optimized baselines. Commercial deployments are running too: Klarna doubled training speed on one of its larger transformer models; FM Logistic saved more than 15,000 kilometers of annual truck travel. A year ago AlphaEvolve was a research paper. Now it’s in the stack.

The same displacement is playing out in the open-weight world. MiniMax M3, released June 1 by the Shanghai lab MiniMax, is described by the company as the first open-weight model to combine frontier-level coding performance, a one-million-token context window, and native multimodal capabilities — image, video, and desktop-computer operation trained in from step zero, not bolted on afterward. Independent benchmark verification is pending (model weights are expected around June 10), but MiniMax’s own agentic tests are specific enough to be testable: M3 independently reproduced an ICLR 2025 paper over twelve hours without human intervention, producing 18 commits and 23 figures. Separately, it spent twenty-four hours optimizing a GPU kernel, pushing Hopper hardware utilization from 7.6 to 71.3 percent over 147 attempts before reaching its best solution. API pricing is $0.60 per million input tokens — a fraction of the closed models it claims to rival.

Who’s benefiting from Copilot’s billing reset

GitHub’s switch to token-based Copilot billing on June 1 is still producing ripples. Developers using Copilot for heavy agentic workflows are reporting 10x to 50x cost increases; one Copilot Pro+ subscriber estimated they’d exhausted 8 percent of their monthly credit allotment in two hours. GitHub’s position is defensible — agentic coding loops burn compute that flat fees cannot cover — but developer frustration is real and is converting into action.

The clearest beneficiary is OpenCode, a terminal-based coding agent built by the team behind Serverless Stack. This week it crossed 160,000 GitHub stars and 7.5 million monthly active developers — the most-adopted open-source coding agent ever built, without backing from Anthropic, Google, or Microsoft. It supports 75-plus AI providers (Claude, Gemini, GPT, local models via Ollama), integrates the language server protocol so compiler errors feed back into the model context, and runs fully air-gapped for regulated industries. For developers who bring their own API keys, effective monthly cost is roughly $2-5. The Copilot pricing transition is functioning, in part, as the best marketing OpenCode has ever had.

That dynamic — proprietary tools raising prices, open ecosystems filling the gap — is a reliable pattern in software. It played out in databases, operating systems, and cloud tooling. Whether AI coding agents develop the same gravitational pull that open-source databases eventually did depends on whether the model quality gap stays wide enough to keep proprietary tools sticky. OpenCode supports MiniMax M3, and MiniMax M3 is trying to match Claude Opus 4.7 on coding benchmarks. The Copilot billing shock is a test. The star counts suggest the answer is not yet settled.

Briefly noted

15 items

Models & research

NVIDIA Nemotron 3 Ultra 550B A55B drops June 4 in a crowded model release day
NVIDIA's 550-billion-parameter open-weight release landed the same day as Google Gemma 4 12B and Alibaba Qwen3.7 Plus, making June 4 the busiest single-day model launch of 2026.

LLM Stats
Kimi K2.6 from Moonshot AI targets long-context agentic coding
Moonshot AI's K2.6 uses a ~1-trillion-parameter MoE architecture with 32B active per token, built for multi-step planning, tool use, and long-context software engineering tasks.

LLM Stats

Infrastructure & chips

NVIDIA Rubin GPU platform now in production; hyperscalers first in line
AWS, Google Cloud, Azure, and OCI are first to deploy Vera Rubin NVL72 rack systems, which NVIDIA claims deliver 10x inference throughput per watt versus the prior generation.

NVIDIA
HBM memory emerges as AI infrastructure's next hard bottleneck
High-bandwidth memory may account for roughly 30% of hyperscaler AI spending in 2026 — up from 8% in 2023 — as the industry runs into a 'GPU Wall' beyond raw accelerator supply.

Data Center Knowledge
AMD data-center GPU revenue forecast to jump 114% year-over-year to $15B
The MI400 series is driving AMD's surge as inference-heavy AI workloads scale; analysts project shipments of roughly 258,000 units in 2026 at an average selling price above $30,000.

S&P Global
NVIDIA and Meta announce multiyear AI infrastructure partnership
Meta will deploy millions of Blackwell and Rubin GPUs across hyperscale data centers optimized for both training and inference under the multigenerational deal.

NVIDIA

Industry & money

Humanoid robotics is 2026's breakout VC category, projected to draw $20B+
Following Figure's commercial-scale validation, physical AI platforms including SkildAI and Boston Dynamics are pulling record venture capital as investors pile into the first deployable humanoid robots.

Tech Startups
SpaceX to invite 1,500 retail investors to a dedicated IPO event — a capital markets first
The roadshow's unusual 30% retail allocation and a first-of-its-kind investor event signal SpaceX is deliberately broadening ownership beyond traditional institutional buyers ahead of its June 12 Nasdaq debut.

Yahoo Finance

Products & launches

Anthropic launches Services Track and Partner Hub for Claude Partner Network
The June 3 expansion formalizes a two-tier ecosystem of certified service providers and technology partners building on Claude, structured alongside the company's S-1 filing the week prior.

Anthropic
Microsoft's full MAI model suite: seven in-house models, none from OpenAI
Beyond MAI-Code-1-Flash, Microsoft's Build 2026 suite includes MAI-Thinking-1 (reasoning flagship), MAI-Transcribe-1.5 (SOTA transcription across 43 languages, 5x faster than rivals), and MAI-Voice-2 — all internally trained.

CNBC

Policy & safety

Colorado AI Act enforcement begins June 30 — companies have 21 days
The law requires security risk management programs, algorithmic impact assessments, and anti-discrimination measures for AI used in consequential decisions — and unlike the Great American AI Act, it is not stuck in committee.

Kiteworks
EU AI Act enters full enforcement August 2, 2026
High-risk AI systems in healthcare, employment, law enforcement, critical infrastructure, and immigration must comply with full EU requirements in under two months.

European Commission

Whimsy

Researchers hijack Google Gemini via WhatsApp using 'Fake Context Alignment'
A SafeBreach researcher hid a command in a notification, made Gemini ask permission in Chinese (which the user dismissed by saying 'yes'), and used that consent to authorize controlling smart-home devices — Google patched it server-side.

The Hacker News
Project Glasswing found 23,019 vulnerabilities in open-source projects; 75 have been patched
A 300:1 ratio of discovered to fixed vulnerabilities is either a testament to how productive AI security research has become, or a testament to how slow patch cycles are — probably both, in roughly equal measure.

Anthropic / Hacker News
Musk spent time with the Anthropic team; 'no one set off my evil detector'
Having previously sued OpenAI and publicly attacked Anthropic, Musk signed off on a $1.25B/month compute deal and told reporters he found the team 'highly competent' — money calibrates a lot of detectors.

Axios

Lead stories cited

01
AlphaEvolve, 1 year later: Impact on science, technology — Google DeepMind
02
MiniMax M3: Frontier Coding, 1M Context, Native Multimodality — All in One Model — MiniMax
03
OpenCode — The open source AI coding agent — GitHub