AI News 2025-03-27

General

Research Insights

LLM

Multimodal

AI Agents

Safety

  • Superalignment with Dynamic Human Values. They treat alignment as a dynamic problem, where human values may change over time. The proposed solution involves an AI that breaks tasks into smaller components, that are easier for humans to guide. This framework assumes that alignment of sub-tasks correctly generalizes to desirable outcomes for the overall task.
  • Google DeepMind: Defeating Prompt Injections by Design.

Audio

  • OpenAI announced new audio models: new text-to-speech models (test here) where one can instruct it about how to speak; and gpt-4o-transcribe with lower error rate than Whisper (including a mini variant than is half the cost of Whisper).
  • OpenAI update their advanced voice mode, making it better at not interrupting the user.

Image Synthesis

  • Tokenize Image as a Set (code). Interesting approach to use an unordered bag of tokens (rather than a serialization, as done with text) to represent images.
  • StarVector is a generative model for converting text or images to SVG code.
  • Applying mechanistic interpretability to image synthesis models can offer enhanced control: Unboxing SDXL Turbo: How Sparse Autoencoders Unlock the Inner Workings of Text-to-Image Models (preprint, examples).
  • The era of in-context and/or autoregressive image generation is upon us. In-context generation means the LLM can directly understand and edit photos (colorize, restyle, make changes, remove watermarks, etc.). Serial autoregressive approaches also handle text and prescribed layout much better, and often have improved prompt adherence.
    • Last week, Google unveiled Gemini 2.0 Flash Experimental image generation (available in Google AI Studio).
    • Reve Image reveal that the mysterious high-scoring “halfmoon” is their image model, apparently exploiting some kind of “logic” (auto-regressive model? inference-time compute?) to improve output.
    • OpenAI release their new image model: 4o image generation. It can generate highly coherent text in images, and iterate upon images in-context.
      • This led to a one-day Ghibli-themed spontaneous meme explosion.
      • It is interesting to see how it handles generating a map with walking directions. There are mistakes. But the quality is remarkable. The map itself is mostly just memorization, but the roughly-correct walking directions and time estimation point towards a more generalized underlying understanding.

Video

  • SkyReels is offering AI tools to cover the entire workflow (script, video, editing).
  • Pika is testing a new feature that allows one to edit existing video (e.g. animating an object).

World Synthesis

Science

Hardware

  • Halliday: smart glasses intended for AI integration ($430)

Robots

  • Unitree shows a video of smooth athletic movement.
  • Figure reports on using reinforcement learning in simulation to greatly improve the walking of their humanoid robot, providing it with a better (faster, more efficient, more humanlike) gait.
  • Google DeepMind paper: Gemini Robotics: Bringing AI into the Physical World. They present a vision-language-action model capable of directly controlling robots.
This entry was posted in AI, News and tagged , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply