AI News 2025-05-15

General

Research Insights

LLM

  • OpenAI add o4-mini to their reinforcement fine-tuning API.
  • ByteDance releases SeedCoder 8B.
  • OpenAI adds GPT-4.1 to the ChatGPT web product.
  • OpenAI release HealthBench. In addition to providing a useful way to track progress on LLMs for healthcare applications, the current results demonstrate just how effective existing LLMs can be in this application space.

Agents

Safety

Audio

Video

World Synthesis

  • Enigma Labs claims they have made the first multiplayer AI-generative video game (a multiplayer car racing game). They say they will open-source the work eventually. Although the gameplay video shows crude graphics, it is further evidence that generative environments are a key part of future entertainment.

Science

Hardware

Robots

  • Tesla shows a video of Optimus robot dancing. Fluid motion like this tests the limit of hardware and software (latency, real-time compensation, etc.).
Posted in AI, News | Tagged , , , , , , , | Leave a comment

AI News 2025-05-08

General

Research Insights

LLM

Audio

Video

Brain

Robots

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2025-05-01

General

Research Insights

LLM

Safety

Audio

Image Synthesis

  • Freepik and Fal announce F-Lite (tech report), an open-source image model (10B, trained on 80M images).
  • Midjourney has pushed updated to the v7 model (improving quality and coherence), adds an experimental aesthetic intensity parameter, and launches a new omni-reference feature (example outputs).

Video

  • Runway roll out their references feature to all paying users, which allows one to include specific characters/environments/elements in generations.

Science

Robots

Posted in AI, News | Tagged , , , , , , | Leave a comment

AI News 2025-04-24

General

Research Insights

LLM

AI Agents

Audio

  • Nari Labs Dia is a text-to-speech (TTS) model that can generate remarkably realistic and emotional output (example).

Video

Hardware

  • Google demos next-generation smart glasses with AI integration (TED talk).
Posted in AI, News | Tagged , , , , , | Leave a comment

AI Impact Predictions

Debates about future AI progress or impact are often confused, because different people have very different mental-models for the expected pace, and the time-horizon over which they are projecting.

This figure is my attempt to clarify:

The experimental datapoints come from the METR analysis: Measuring AI Ability to Complete Long Tasks (paper, code/data). The “count the OOMs” and “new regime” curves are extrapolated fits to the data. The other curves are ad-hoc, drawn just to give a sense of how a particular mental model might translate to capability-predictions.

The figure tries to emphasize:

  • Task complexity covers many orders-of-magnitude. Although imperfect, we can think about the timescale over which “coherent progress” must be made as a proxy for measuring generally useful capabilities.
  • There are many models for progress, and they vary dramatically in predictions.
  • Nevertheless, except for scenarios that fundamentally doubt AI progress is possible, the main disagreement among models is over the timescale required to reach a given kind of impact.
  • The concerns one has (economic, social, existential) will depend on one’s model. (Of course one’s concerns will also be influenced by other assessments, such as the wisdom we expect leaders to exhibit at different stages of rollout.)
  • It is difficult to define intelligence. Yet, it seems quite defensible to say that we have transitioned from clearly sub-human AI, into a “jagged intelligence” regime where a particular AI system will out-perform humans in some tasks (e.g. rapid knowledge retrieval) but under-perform in other tasks (e.g. visual reasoning). As we move through the jagged frontier, we should expect more and more human capabilities to be replicated in AI, even while some other subset remains unconquered.
  • The definition of “AGI” is also unclear. Instead of a clear line being crossed, we should expect a greater fraction of people to acknowledge AI as generally-capable, as systems cross through the jagged frontier.

The primary goal of the figure is to clarify discussions. I.e. we should specify which kinds of scenarios we find plausible, which impacts are thus considered possible, and which time-span we are currently discussing.

Posted in AI | Tagged , , , , | Leave a comment

AI News 2025-04-17

General

Research Insights

LLM

  • Zyphra releases an open-source reasoning model: ZR1-1.5B (weights, try using).
  • Anthropic adds to Claude a Research capability, and Google Workspace integration.
  • OpenAI announces GPT-4.1 models in the API. Optimized for developers (instruction following, coding, diff generation, etc.), 1M context length, etc.; three models (4.1, 4.1-mini, 4.1-nano) provide control of performance vs. cost. Models can handle text, image, and video.
    • They also have a prompting guide for 4.1.
    • OpenAI have released a new eval for long-context: MRCR.
    • OpenAI intends to deprecate GPT-4.5 in the next few months.
  • OpenAI announces o3 and o4-mini reasoning models.
    • These models are explicitly trained to use tools as part of their reasoning process.
    • They can reason over images in new ways.
    • Improved scores on math and code benchmarks (91-98% on AIME, ~75% on scientific figure reasoning, etc.).
    • o3 is strictly better than o1 (higher performance with lower inference cost); o1 will be deprecated.
    • OpenAI will be releasing coding agent applications; starting with Codex CLI, which allows one to deploy coding agents easily.
    • METR has provided evaluations of capabilities.
    • As part of the release, they also provided data showing how scaling RL is yielding predictable improvements.

Safety

Video

Audio

Science

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2025-04-10

General

Research Insights

LLM

  • More progress in diffusion language models: Dream 7B: Introducing Dream 7B, the most powerful open diffusion large language model to date.
  • Meta releases Llama 4 series of MoE LLMs: Scout (109B, 17B active, 16 experts), Maverick (400B, 17B active, 128 experts), and Behemoth (2T, 288B active, 16 experts). These are MoE models with a 10M token context. The models appear to be competitive (nearing the state-of-the-art tradeoff curve for performance/price), and thus extremely impressive for open-source.
    • Independent evals (including follow-up) from Artificial Analysis show it performing well against non-reasoning models.
    • Evaluation of the 10M context on simple NIAH seem reasonable, but (reportedly) it does not fare as well on deeper understanding of long context.
  • Cloudflare launch an open beta for their AutoRAG solution.
  • Nvidia release Llama-3_1-Nemotron-Ultra-253B-v1, which seems to beat Llama 4 despite being based on Llama 3.1.
  • Amazon announces Nova Sonic speech-to-speech foundation models, for building conversational AI.
  • Agentica release open-source: DeepCoder-14B-Preview, a reasoning model optimized for coding (code, hf).
  • Anthropic announce a new “Max” plan for Claude ($100/month).
  • xAI release an API for Grok-3. Pricing appears relatively expensive (e.g. compared to Gemini models of better performance).
  • OpenAI adds an evals API, making it easier to programmatically define tests, evaluations, etc. This should make it faster/easier to test different prompts, LLMs, etc.
  • Bytedance release technical report for Seed-Thinking-v1.5, a 200B reasoning model.
  • OpenAI add a memory feature to ChatGPT, allowing it to reference all past chats in order to personalize responses.

AI Agents

Audio

Image Synthesis

Video

World Synthesis

Science

Brain

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , , , , | Leave a comment

AI News 2025-04-03

General

Research Insights

Safety

LLM

  • OpenAI pushed an update to their 4o model. This has significantly improved its ranking (e.g. now best non-reasoning model on coding benchmark).
  • An interesting test of GPT-4o in-context image generation: it is unable to generate an image of a maze with a valid solution; at lest when the maze is a square. However, if you ask it to make an image of a diamond orientation maze (45° rotated square), it succeeds to have a valid solution. We can rationalize this based on the sequential order of autoregressive generation. By generating first from the start of the maze (and only its local neighborhood), and similarly finishing with this sort of locality, the model can more correctly build a valid solution. (Conversely, the usual square orientation requires longer-range reasoning across image patches.)
    • At first, this might seem like just another silly oddity. But it shows how recasting a problem, just by changing the generation order, can massively change model performance. This sheds light on how they “think” and suggests that alternate generation strategies could perhaps unlock capabilities.
      • For instance, one could imagine an LLM with different branches (like MoE?) where each branch is trained on a different autoregression strategy (left-to-right, right-to-left, block diffusion, random, etc.) such that the overall LLM can invoke/combine different kinds of thinking modes.
    • Another trick is to ask it to generate an image of a maze with the solution identified, and then update the image to remove the solution. This is a visual analog of “think step-by-step” and other inference-time-compute strategies. This implies that current models have untapped visual reasoning capabilities that could be unlocked by allowing them to visually iterate on problems.
  • Anthropic announces Claude for Education, which provides a university-wide solution tailored to education.

AI Agents

Audio

Image Synthesis

Video

Science

Robot

Posted in AI | Tagged , , , , , , , , | Leave a comment

AI News 2025-03-27

General

Research Insights

LLM

Multimodal

AI Agents

Safety

  • Superalignment with Dynamic Human Values. They treat alignment as a dynamic problem, where human values may change over time. The proposed solution involves an AI that breaks tasks into smaller components, that are easier for humans to guide. This framework assumes that alignment of sub-tasks correctly generalizes to desirable outcomes for the overall task.
  • Google DeepMind: Defeating Prompt Injections by Design.

Audio

  • OpenAI announced new audio models: new text-to-speech models (test here) where one can instruct it about how to speak; and gpt-4o-transcribe with lower error rate than Whisper (including a mini variant than is half the cost of Whisper).
  • OpenAI update their advanced voice mode, making it better at not interrupting the user.

Image Synthesis

  • Tokenize Image as a Set (code). Interesting approach to use an unordered bag of tokens (rather than a serialization, as done with text) to represent images.
  • StarVector is a generative model for converting text or images to SVG code.
  • Applying mechanistic interpretability to image synthesis models can offer enhanced control: Unboxing SDXL Turbo: How Sparse Autoencoders Unlock the Inner Workings of Text-to-Image Models (preprint, examples).
  • The era of in-context and/or autoregressive image generation is upon us. In-context generation means the LLM can directly understand and edit photos (colorize, restyle, make changes, remove watermarks, etc.). Serial autoregressive approaches also handle text and prescribed layout much better, and often have improved prompt adherence.
    • Last week, Google unveiled Gemini 2.0 Flash Experimental image generation (available in Google AI Studio).
    • Reve Image reveal that the mysterious high-scoring “halfmoon” is their image model, apparently exploiting some kind of “logic” (auto-regressive model? inference-time compute?) to improve output.
    • OpenAI release their new image model: 4o image generation. It can generate highly coherent text in images, and iterate upon images in-context.
      • This led to a one-day Ghibli-themed spontaneous meme explosion.
      • It is interesting to see how it handles generating a map with walking directions. There are mistakes. But the quality is remarkable. The map itself is mostly just memorization, but the roughly-correct walking directions and time estimation point towards a more generalized underlying understanding.

Video

  • SkyReels is offering AI tools to cover the entire workflow (script, video, editing).
  • Pika is testing a new feature that allows one to edit existing video (e.g. animating an object).

World Synthesis

Science

Hardware

  • Halliday: smart glasses intended for AI integration ($430)

Robots

  • Unitree shows a video of smooth athletic movement.
  • Figure reports on using reinforcement learning in simulation to greatly improve the walking of their humanoid robot, providing it with a better (faster, more efficient, more humanlike) gait.
  • Google DeepMind paper: Gemini Robotics: Bringing AI into the Physical World. They present a vision-language-action model capable of directly controlling robots.
Posted in AI, News | Tagged , , , , , , , , , , , , | Leave a comment

AI News 2025-03-20

General

Research Insights

LLM

  • Baidu announce Ernie 4.5 and X1 (use here). They claim that Ernie 4.5 is comparable to GPT-4o, and that X1 is comparable to DeepSeek R1; but with lower API costs (Earnie 4.5 is 1/4 the price of 4o, while X1 is 1/2 of R1). They plan to open-source the models on June 30th.
  • Mistral release Mistral Small 3.1 24B. They report good performance for the model size (e.g. outperforming GPT-4o-mini and Gemma 3).
  • LG AI Research announce EXAONE Deep, a reasoning LLM (2.4B, 7.8B, 32B variants; weights) that scores well on math benchmarks.
  • Nvidia release Llama-Nemotron models, which can do reasoning (try it here).

Safety

Vision

Image Synthesis

  • Gemini 2.0 Flash Experimental (available in Google AI Studio) is multimodal, with image generation capabilities. By having the image generation “within the model” (rather than as an external tool), one can iterate on image generation much more naturally. This incidentally obviates the need for more specialized image tools (can do colorization, combine specified people/places/products, remove watermarks, etc.).

Video

Audio

Science

Robots

Posted in AI, News | Tagged , , , , , , , | Leave a comment