General
Research Insights
- Some contrasting results on how reasoning LLMs operate:
- Large Language Models Think Too Fast To Explore Effectively. LLMs often make decisions prematurely, before having sufficiently explored the space of possibilities.
- Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs. In the analysis of reasoning (chain-of-thought) models, they often fail because of “underthinking”: abandoning a promising/correct chain of logic before taking it all the way to the solution.
- More generally, we should expect that tuning the amount of depth vs. breadth in search will matter. This will perhaps arise naturally as models are trained on more reasoning traces; or perhaps could be tuned manually somehow.
- Updates on model training:
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training.
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate. By learning to critique effectively, answer-style responses also improve.
- Another approach to doing reasoning in latent space: Efficient Reasoning with Hidden Thinking (code).
- Low-Rank Adapting Models for Sparse Autoencoders. They improve performance of SAE by adapting the model to the SAE (rather than just extracting the SAE from the model).
- Language Models Use Trigonometry to Do Addition. Adds to a growing body of research showing how the latent space of LLMs exploits geometric arrangements to store information and do information processing.
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. They introduce a new reasoning benchmark where complexity can be tuned, and use it to show that LLMs struggle as complexity increases. Larger/better models, and more inference-compute, yields improve reasoning. But high-complexity inevitably counfounds.
- Demystifying Long Chain-of-Thought Reasoning in LLMs.
LLM
- Nvidia is providing a host for DeepSeek-R1 through their API.
- OpenAI releases o3-mini, a powerful reasoning model that leverages inference-time compute.
- Open-R1 is an attempt to reproduce the DeepSeek-R1 model/result/method in a fully open manner. Their first update shows progress in replicating DeepSeek’s results.
- s1: Simple test-time scaling. They investigate the simplest possible inference-time compute method for increasing reasoning: they arbitrarily insert “Wait” tokens when the model tries to complete its response. This forces it to reconsider and think longer, yielding gains that scale with compute.
- ACECODER: Acing Coder RL via Automated Test-Case Synthesis. It provides another way to think about expending post-training but pre-inference compute in order to improve a system.
- Google releases Gemini 2.0 broadly. Although not the top models in raw benchmark scores, this set of models seem to establish a new record in terms of the Pareto tradeoff between performance and inference cost.
Safety
- Anthropic paper: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. Associated blog post, and demo where you can try to break past the barriers.
AI Agents
- Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks.
- Replit launches an agent/app that allows you to make a customized mobile app without coding (examples).
- OpenAI announces their second agentic product: Deep Research conducts web searches on a topic of choice, preparing a detailed report. A query can run for 2-30 minutes as it iteratively seeks information. This approach reaches a record-setting 26.6% on the recently-released (and very challenging) Humanity’s Last Exam benchmark.
- Ethan Mollick finds the system very capable, and provides additional thoughts: The End of Search, The Beginning of Research.
- This capability is thematically similar to what Perplexity and Google’s Deep Research do. However, OpenAI’s approach seems to leverage a reasoning model (presumably a variant of o3-mini) to iteratively work on the research problem.
- Open-source equivalents of OpenAI’s Deep Research are being developed:
- Exa AI Labs has released web-search agents, including one powered by DeepSeek-R1 (code), and one powered by o3-mini (code).
- Firecrawl is working on an agent that will reason over web data.
- OpenDeepResearcher (by Matt Shumer), iterative searching.
- open-deep-research (by nickscamara), uses Firecrawl and a reasoning model for deep web search.
- deep-research (by dzhng), research assistant.
- open-Deep-Research (by huggingface, code here).
Vision
Video
- Meta publishes: VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models. Many video generation methods focus on appearance, not motion. A simple change to the prediction, to more strongly bias towards dynamics, improves output without changes in dataset or scaling (example videos).
- ByteDance presents a method for video generation from a single image, including motion and synchronized voice: OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models (more examples).
- Pika’s new Pikadditions allow new elements to be added to real video.
Voice
- Google Search Lab is testing a bot that can navigate phone trees on your behalf.
- British mobile operator 02 has created an AI voice bot intended to waste the time of scammers, by pretending to be inept as they are being scammed.
Robots
- Nvidia and CMU publish: ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills. This method enables much more agile and human-like motion (videos).
- Embrace Collisions: Humanoid Shadowing for Deployable Contact-Agnostics Motions. Proper handling of ground contact (and uncertainty thereof) allows improved motion, such as recovering from a ground position.
- TechCrunch reports: Figure drops OpenAI in favor of in-house models.