General
- Ethan Mollick provides a summary recent developments in AI: What just happened.
- modernBERT is a replacement for the popular BERT style models. The claim is that it is both faster, and yields higher-quality embeddings.
- xAI have raised a further $6B in series C funding.
Research Insights
- I Don’t Know: Explicit Modeling of Uncertainty with an [IDK] Token (discussion by Vincent D. Warmerdam).
- Meta-Reflection: A Feedback-Free Reflection Learning Framework. Allows an LLM to have reflection-like thinking in a single forward pass. Uses a codebook of reflections to draw from.
- Let your LLM generate a few tokens and you will reduce the need for retrieval. After generating some tokens in reply to a query, an LLM will be better able to assess whether it knows the answer (and thus whether retrieval is warranted).
- Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models. Computes per-token temperature, to better guide sequence of thoughts.
LLM
- OpenAI reveal a new reasoning model: o3. It scores higher on math and coding benchmarks, including setting a new record of 87.5% on ARC-AGI Semi-Private Evaluation. This suggests that the model is exhibiting new kinds of generalization and adaptability.
- The ARC-AGI result becomes even more impressive when one realizes that the prompt they used was incredibly simple. It does not seem that they prompt engineered, nor used a bespoke workflow for this benchmark (the ARC-AGI public training set was included in o3 training). Moreover, some of the failures involve ambiguities; even when it fails, the solutions it outputs are not far off. While humans still out-perform AI on this benchmark (by design), we are approaching the situation where the problem is not depth-of-search, but rather imperfect mimicking of human priors.
- The success of o3 suggests that inference-time scaling has plenty of capacity; and that we are not yet hitting a wall in terms of improving capabilities.
- More research as part of the trend of improving LLMs with more internal compute, rather than external/token-level compute (c.f. Meta and Microsoft research):
- Johns Hopkins: Compressed Chain of Thought: Efficient Reasoning Through Dense Representations.
- Google DeepMind: Deliberation in Latent Space via Differentiable Cache Augmentation. They design a sort of “co-processor” that allows additional in-model (latent space) computation, while the main LLM weights are frozen. This is part of a trend of improving LLMs with more internal compute (rather than external/token-level compute).
- Jeremy Berman presents: LANG-JEPA: Learning to Think in Latent Space. An experimental LLM architecture, based on Meta’s JEPA, that operates in concept space instead of token space.
- Qwen released: QvQ-72B-preview visual reasoning model.
- DeepSeek release DeepSeek-V3-Base (weights), 671B params. This is noteworthy as a very large open-source model, noteworthy for achieving competitive to state-of-the-art performance, and noteworthy for having (supposedly) required relatively little compute (15T tokens, 2.788M GPU-hours on H800, only $5.5M).
Safety
- OpenAI releases paper: Deliberative Alignment: Reasoning Enables Safer Language Models. The method is similar to Anthropic’s constitutional AI (where one writes down principles the AI must consider and adhere to), but leveraging the improved reasoning of modern models (o1, o3) to correspondingly improve alignment.
Video
- Pika launched their 2.0 model, including “Scene Ingredients” which provides methods for adding specific characters to scenes.
- LTX Studio adds fine-grained control of facial emotions.
- ByteDance INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations. Allows one to take audio and an image, and generate a lip-synced video (examples).
Audio
- Adobe Sketch2Sound allows one to imitate sound effects, and use AI to convert it into appropriate sounds. This allows art direction for Foley sound.
- MMAudio enables video-to-audio; i.e. it can add a soundtrack to silent video (project, code, examples: 1, 2).
World Synthesis
Science
- Sakana AI (c.f. AI Scientist) present Automating the Search for Artificial Life with Foundation Models (preprint, code). They use various environment that parametrize simple rulesets that can lead to complex emergent behavior (cellular automata, Conway’s game of life, Boids). These act as test environments with richness and complexity, and they use visual/language models (VLMs) to automate search for interesting behavior. Since artificial life environments can also provide inspiration for AI, this is AI-guided search through artificial life, towards improvement of AI.
- Google DeepMind: OmniPred: Language Models as Universal Regressors. General text-to-text regression can be applied to arbitrary science (x,y) data.
- Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models. This exploits concepts from mechanistic interpretability to allow one to discovery new science.
- LLMs can realize combinatorial creativity: generating creative ideas via LLMs for scientific research.
Hardware
- Nvidia unveils a small form-factor compute platform (suitable for robotics).
- Raven Resonance is another attempt to deliver augmented reality glasses.
Robots
- Apptronik are partnering with Google DeepMind to bring humanoid robots to fruition a bit faster.
- Figure claims they are now revenue-generating, as they are delivering real robots to a paying client.
- PaXini is building TORA-ONE, a wheeled humanoid with dexterous hands.
- Unitree B2-W (wheeled quadruped) is now available for purchase ($150,000 USD). It seems highly capable.
- Some researchers are using video diffusion models (which can predict future frames) as a robot policy: Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations (preprint). They show the example of the robot doing chemistry experiments.
- Atlas electric (Boston Dynamics) can do a backflip (even while wearing clothes).
- Apptronik claim their humanoid robots are doing real work in a fulfillment warehouse.