AI News 2025-02-06

General

Understanding Reasoning LLMs.

Research Insights

Some contrasting results on how reasoning LLMs operate:
- Large Language Models Think Too Fast To Explore Effectively. LLMs often make decisions prematurely, before having sufficiently explored the space of possibilities.
- Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs. In the analysis of reasoning (chain-of-thought) models, they often fail because of “underthinking”: abandoning a promising/correct chain of logic before taking it all the way to the solution.
  - More generally, we should expect that tuning the amount of depth vs. breadth in search will matter. This will perhaps arise naturally as models are trained on more reasoning traces; or perhaps could be tuned manually somehow.
Updates on model training:
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training.
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate. By learning to critique effectively, answer-style responses also improve.
Another approach to doing reasoning in latent space: Efficient Reasoning with Hidden Thinking (code).
Low-Rank Adapting Models for Sparse Autoencoders. They improve performance of SAE by adapting the model to the SAE (rather than just extracting the SAE from the model).
Language Models Use Trigonometry to Do Addition. Adds to a growing body of research showing how the latent space of LLMs exploits geometric arrangements to store information and do information processing.
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. They introduce a new reasoning benchmark where complexity can be tuned, and use it to show that LLMs struggle as complexity increases. Larger/better models, and more inference-compute, yields improve reasoning. But high-complexity inevitably counfounds.
Demystifying Long Chain-of-Thought Reasoning in LLMs.

LLM

Nvidia is providing a host for DeepSeek-R1 through their API.
OpenAI releases o3-mini, a powerful reasoning model that leverages inference-time compute.
Open-R1 is an attempt to reproduce the DeepSeek-R1 model/result/method in a fully open manner. Their first update shows progress in replicating DeepSeek’s results.
s1: Simple test-time scaling. They investigate the simplest possible inference-time compute method for increasing reasoning: they arbitrarily insert “Wait” tokens when the model tries to complete its response. This forces it to reconsider and think longer, yielding gains that scale with compute.
ACECODER: Acing Coder RL via Automated Test-Case Synthesis. It provides another way to think about expending post-training but pre-inference compute in order to improve a system.
Google releases Gemini 2.0 broadly. Although not the top models in raw benchmark scores, this set of models seem to establish a new record in terms of the Pareto tradeoff between performance and inference cost.

Safety

Anthropic paper: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. Associated blog post, and demo where you can try to break past the barriers.

AI Agents

Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks.
Replit launches an agent/app that allows you to make a customized mobile app without coding (examples).
OpenAI announces their second agentic product: Deep Research conducts web searches on a topic of choice, preparing a detailed report. A query can run for 2-30 minutes as it iteratively seeks information. This approach reaches a record-setting 26.6% on the recently-released (and very challenging) Humanity’s Last Exam benchmark.
- Ethan Mollick finds the system very capable, and provides additional thoughts: The End of Search, The Beginning of Research.
- This capability is thematically similar to what Perplexity and Google’s Deep Research do. However, OpenAI’s approach seems to leverage a reasoning model (presumably a variant of o3-mini) to iteratively work on the research problem.
- Open-source equivalents of OpenAI’s Deep Research are being developed:
  - Exa AI Labs has released web-search agents, including one powered by DeepSeek-R1 (code), and one powered by o3-mini (code).
  - Firecrawl is working on an agent that will reason over web data.
  - OpenDeepResearcher (by Matt Shumer), iterative searching.
  - open-deep-research (by nickscamara), uses Firecrawl and a reasoning model for deep web search.
  - deep-research (by dzhng), research assistant.
  - open-Deep-Research (by huggingface, code here).

Vision

Diffusion Autoencoders are Scalable Image Tokenizers (project, code).

Video

Meta publishes: VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models. Many video generation methods focus on appearance, not motion. A simple change to the prediction, to more strongly bias towards dynamics, improves output without changes in dataset or scaling (example videos).
ByteDance presents a method for video generation from a single image, including motion and synchronized voice: OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models (more examples).
Pika’s new Pikadditions allow new elements to be added to real video.

Voice

Google Search Lab is testing a bot that can navigate phone trees on your behalf.
British mobile operator 02 has created an AI voice bot intended to waste the time of scammers, by pretending to be inept as they are being scammed.

Robots

Nvidia and CMU publish: ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills. This method enables much more agile and human-like motion (videos).
Embrace Collisions: Humanoid Shadowing for Deployable Contact-Agnostics Motions. Proper handling of ground contact (and uncertainty thereof) allows improved motion, such as recovering from a ground position.
TechCrunch reports: Figure drops OpenAI in favor of in-house models.

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Leave a Reply