AI News 2025-05-29

General

Essay by Pete Koomen: AI Horseless Carriages (video version: Why AI Apps Still Feel Broken with Pete Koomen). It makes the case that our current approach of adding AI to existing applications is akin to early horseless carriages (which added engines to existing carriage designs; instead of being designed from scratch to optimally take advantage of an engine). Future AI-first applications need to rethink the user experience in light of AI capabilies.

Research Insights

LLMs on the Line: Data Determines Loss-To-Loss Scaling Laws. They report that dataset curation and tokenization scheme have a strong effect on final loss, while architecture has a more minor effect (any reasonable deep learning structure can learn).
Some work showing how doing RL on internal confidence can improve models (without needing external data):
- Learning to Reason without External Rewards (code). They show that LLMs can learn despite lacking ground truth answers, but rather by optimizing their own internal confidence.
- Maximizing Confidence Alone Improves Reasoning (code); a.k.a. RENT: Reinforcement Learning via Entropy Minimization.
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models.
Creative Preference Optimization. Creativity can be optimized.

LLM

ByteDance releases BAGEL: Unified Model for Multimodal Understanding and Generation (7B, weights, code, demo).
Recently released Claude 4 Opus achieves record 8.5% on ARC-AGI-2.
DeepSeek releases an minor upgrade: DeepSeek-r1-0528.

Agents

Safety & Interpretability

Audio

Kyutai demos Unmute, a text-to-speech and speech-to-text capability. Will be open-sourced.
Anthropic announce that they will begin rolling out voice conversation mode.
Chatterbox TTS is a high-quality open source speech synthesis model (try).

Image Synthesis

Goodfire presents: Painting with concepts using diffusion model latents (try). One can apply semantic labels spatially, in order to guide image generation.
Runway introduces Layout Sketch, allowing one to create images with positional guidance of reference elements.

Video

Viggle Live enables real-time avatar control.
Workflow: Use Google Street View imagery combined with image synthesis (e.g. Runway References) and then video generation (e.g. Runway Gen3) to generate a sequence of “on location” clips.
Google DeepMind report SignGemma, a forthcoming open model for converting sign language video into text.

World Synthesis

EVA: Expressive Virtual Avatars from Multi-view Videos (preprint). Enables 3D virtual avatars viewable from any direction, based on a monocular input.
Odyssey has a crude prototype for generative worlds that you can explore (try it here); they claim frames are generated every ~40 ms. Blog post: AI video you can both watch and interact with in real-time. As they note, the technology will improve rapidly. Already you can get a taste of exploring environments that are generated on demand.

Science

OpenAI adds to ChatGPT scaffolding the ability to visualize molecules (RDKit library).

Robots