AI News 2025-01-23

General

Detailed introduction (200 page ebook): Foundations of Large Language Models.
Inference Magazine is a new publication on AI progress. Many interesting articles. For instance:
- How much economic growth from AI should we expect, how soon?
OpenAI has announced (with the White House) a partnership called The Stargate Project. A consortium will invest $500 billion ($100 billion immediately) to build AI infrastructure in the United States.
Google agrees to new $1 billion investment in Anthropic. This adds to Google’s existing $2B investment (through which it owns 10% of Anthropic), and expands a cloud contract. This appears to be in addition to Anthropic’s ongoing effort to raise another $2B (at $60B valuation).

Research Insights

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation. They report an counter-intuitive result wherein intentionally over-fitting a trained LLM on a small set of samples yields improvements on long-generation tasks (rather than the kind of low-performance (e.g. repetition) one typically associates with over-fitting.
- Some say that this result is obvious, in that the optimization signal (loss, perplexity, etc.) is just a proxy for the actual desired performance (token accuracy).
Do generative video models learn physical principles from watching videos? (project, code) They find some aspects of physics are not learned, and that strong visual fidelity is not a guarantee that underlying physics are learned.
Google: Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments.
Physics of Skill Learning. The authors try to provide intuition about the learning process, using a succession of heuristics with different levels of detail.

LLM

OpenAI has finished safety testing of o3-mini, and is preparing to release it in the coming weeks. o3-mini is reportedly worse than o1-PRO, but much faster.
Deepwriter AI claims their system has written an entire 203 page without human involvement. Generation involved 1,100 API calls to Gemini Flash-Exp 2.0, and took ~4 hours.
- The book: The SaaS Crucible: Strategic Warfare for Underdog SaaS Startups.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
- They present two models: DeepSeek-R1-Zero and DeepSeek-R1; the former trained using reinforcement learning, the latter improving on this using additional data. They claim performance competitive with o1-mini or even o1.
- They also released 6 distilled models (based on Llama or Qwen).
- Available via Ollama.
Kimi release a similar report on the power of RL for improving reasoning in LLMs: Kimi k1.5: Scaling Reinforcement Learning with LLMs.
DeepLearning.ai have released a course on how to use Anthropic’s Computer Use mode.
OpenAI announce Operator (launch video), a computer-use agent that can conduct tasks in a virtualized web browser instance.
Anthropic adds a “Citations”, a RAG implementation available through the API.

Safety

OpenAI: Trading Inference-Time Compute for Adversarial Robustness (full paper). The results suggest that inference-time compute can be used to improve safety (guardrails, alignment, etc.). This makes sense, given that inference-compute increases capabilities, and alignment can be viewed as a particular kind of capability (desired response).

Image Synthesis

Runway ML releases access to Frames, an image model.
Google DeepMind reports: Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps (preprint). The take-home-message is that inference-time scaling improves image synthesis is a reliable way, similar to how it improves text-generation (e.g. reasoning). They apply a search process to find noise that yields a better generation.

Video

Audio

Bland AI (now bland.com) is running a publicity stunt where you can call their AI on your phone, and after 10-60 seconds of talking, it will clone your voice and start talking to you in your own voice. Intentionally unnerving, and a good reminder that we must now be skeptical of suspicious phone calls (even if they sound like loved ones), and for banks to stop using voice-print as a security factor.

Science

Published: Simulating 500 million years of evolution with a language model. (This was previously released as a preprint.) The ESM3 foundation model is trained on sequence, structure, and function of proteins. You can (e.g.) input a desired function and it will generate a candidate protein.
OpenAI has created an AI model for longevity science. More specifically, GPT-4b micro was trained to predict variants of protein factors with increased/controlled function. Since this model is not yet broadly available, we can’t estimate the utility. But it reinforces the notion that there is still plenty of opportunity space for tuned/task-specific advances wherever we have data and compute.

Robots