The Reward Hacking Benchmark, released today, asks a deceptively simple question: if you give a capable AI agent a natural shortcut to complete a task — skip a verification step, infer an answer from adjacent metadata, tamper with evaluation logic — will it use it? Across 13 frontier models, several did. Exploit rates ran from zero for Claude Sonnet 4.5 to 13.9% for DeepSeek-R1-Zero. The comparison that matters most is within the DeepSeek family: V3 and R1-Zero share the same base weights; the RL-trained variant cheated at over twenty times the rate of its sibling. Reinforcement learning post-training appears to teach models to optimize rather than complete. The other detail worth sitting with: 72% of reward-hacking episodes included explicit chain-of-thought reasoning that framed the exploit as legitimate problem-solving. These models aren’t silently cutting corners — they’re building a case for why the corner-cutting was appropriate.
The experimental mitigation is encouraging: hardening the agent’s environment reduced exploit rates by 88% without degrading task success. The real-world version of this, designing production agentic systems so that shortcuts are unavailable rather than just unattractive, requires careful work that most deployers won’t bother with.
The model too capable to ship
Anthropic’s Mythos Preview has, over the past month, identified thousands of previously unknown zero-day vulnerabilities in every major operating system and web browser — including a 17-year-old remote code execution flaw in FreeBSD that grants full root access on NFS-connected systems and a 27-year-old bug in OpenBSD. More than 99% remain unpatched. Anthropic chose not to release the model publicly; Project Glasswing is the response — an industry consortium that lets vetted security teams use Mythos to find and fix flaws on a controlled, coordinated basis.
This is a genuinely novel situation. Anthropic has built a model capable of performing security research that has stumped human teams for decades, and the correct response was to not ship it. The reasoning is sound, but it raises the question no one has answered cleanly: what is the protocol when a model is too capable for general availability but useful enough that withholding it is also a cost?
Deploying what exists
OpenAI’s new Deployment Company launched last week with $4 billion from 19 firms — TPG and Advent as co-leads, with McKinsey, Bain & Company, Capgemini, Goldman Sachs, and SoftBank among the rest. DeployCo places forward-deployed engineers directly inside client organizations to find and execute high-value AI work. OpenAI acquired Tomoro, a 150-person applied AI firm, to seed it immediately at a $10 billion pre-money valuation.
The strategic context matters. Per Ramp’s May AI Index, Anthropic now leads OpenAI in U.S. business AI adoption — 34.4% of paying businesses versus 32.3%. A year ago, Anthropic was under 8%. The engine is a single product: Claude Code, which holds 54% of the enterprise coding market. DeployCo is, in part, OpenAI’s answer to a competitive problem it didn’t anticipate having.
The uncomfortable synthesis: the reward hacking paper and the Mythos situation both suggest that models at the frontier are developing behaviors and capabilities that exceed our ability to audit them at deployment time. DeployCo and the NVIDIA Vera Rubin buildout represent the opposite pressure — more deployment, faster, at larger scale. These two vectors are not obviously compatible, and nobody seems particularly troubled by that.