AI News 2025-11-27

General

Safety

Anthropic: From shortcuts to sabotage: natural emergent misalignment from reward hacking.
- Paper: Natural Emergent Misalignment from Reward Hacking in Production RL.
- They find that models learn reward hacking, and that this is entangled with other bad/undesired behaviors. Interestingly, by changing the RL system prompt to allow reward hacking, they were able to decouple this from other bad behaviors. They frame this as “inoculation prompting”; it stops generalization of bad behaviors to larger misalignment.

LLM

Anthropic unveils Claude Opus 4.5. Beats Gemini 3 Pro on many (but not all) benchmarks, making it competitive with the state-of-the-art.

AI Agents

Image Synthesis

Modern image synthesis relies on inferring patterns at different length-scales or doing patchwise prediction. But why not do next-pixel prediction? Traditionally, this is considered too computationally expensive. Google now publish: Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction? Their scaling suggests ~5 years until we reach this capability.

World Synthesis

Science

Robots