METR reports a record-setting time on their task length benchmark: Opus 4.5 reaches almost 5 hours (though sparsity in the available evaluations at this time horizon make data increasingly unrealiable).
From this, we can update our predictions. It appears that progress continues along the established exponential, with capabilities doubling every 4-5 months.
AI Agents
Google: Distributional AGI Safety. They argue that AGI may first arise as an emergent property of a collection of agents.
Google unveils Gemini 3 Flash; a very fast and very good (sometimes better than Gemini 3 Pro!) model.
AI Agents
Google DeepMind: Towards a Science of Scaling Agent Systems. If individual agent performance is too low, multi-agent systems tend to do even worse. Independent agents lead to error amplification, while central coordination can reduce this effect.
The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics. Scaffolding of LLM (including system-2 inspired coordination layer) demonstrates that many of the perceived problems (hallucination, limited capabilities, etc.) are not intrinsic limits of LLM cognition, but more resultant from deployment.
Gen 4.5 video model, improved model with native audio.
Edit audio of existing videos, and multi-shot editing.
GWM-1 (General World Model), allowing predictions of future states (e.g. for robotics). Three variants: GWM Worlds for explorable environments, GWM Avatars for conversational characters, and GWM Robotics for robotic manipulation.
They find that models learn reward hacking, and that this is entangled with other bad/undesired behaviors. Interestingly, by changing the RL system prompt to allow reward hacking, they were able to decouple this from other bad behaviors. They frame this as “inoculation prompting”; it stops generalization of bad behaviors to larger misalignment.
LLM
Anthropic unveilsClaude Opus 4.5. Beats Gemini 3 Pro on many (but not all) benchmarks, making it competitive with the state-of-the-art.
Modern image synthesis relies on inferring patterns at different length-scales or doing patchwise prediction. But why not do next-pixel prediction? Traditionally, this is considered too computationally expensive. Google now publish: Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction? Their scaling suggests ~5 years until we reach this capability.
Solving a Million-Step LLM Task with Zero Errors. They break problems into the smallest possible units, and use multi-agent voting for each step. This decreases the per-step error rate low enough that long task sequences can be handled.
Shenzhen MindOn Robotics is testing their robot brain in the Unitree G1 body. If the claims are true that this motion is not teleoperated, it is indeed remarkably fluid and capable.
ElevenLabs releasesScribe v2 Realtime. Extremely fast and accurate realtime (150 ms) voice transcription. (And LiveCaptions for captioning live events or broadcasts.)
Continuous Autoregressive Language Models. Instead of generating one token at a time, it updates a semantic vector that can map to multiple output tokens. This provides a more continuous mode of “thinking”.