General
- AI timeline visualization: The Road to AGI (2015–2025), by James Campbell and Emiliano Garcia-Lopez (code).
- Anthropic releases the second report for their economic index: Anthropic Economic Index: Insights from Claude 3.7 Sonnet.
- xAI buys the 𝕏 (formerly Twitter) social media platform. The internal all-stock deal values xAI at $80B and 𝕏 at $33B ($45B minus $12B debt).
- OpenAI raises an additional $40B at $300B valuation.
- LLMs shown to be viable for therapy: First Therapy Chatbot Trial Yields Mental Health Benefits.
- LLMs as viable tutors: LLM Support for Tutors GPT-4 boosts remote tutors’ performance in real time, study finds.
- New study evaluates human ability to detect AI in a more stringent setting: Large Language Models Pass the Turing Test.
- Note earlier work already showing LLMs passing less-stringent Turing Tests:
Research Insights
- Meta preprint: Multi-Token Attention. They combine attention (query, key, head operations) over multiple tokens; convolution operations allow nearby queries/keys to affect each other’ss attention weights.
- Danijar Hafner et al. (Google DeepMind) present DreamerV3.
Safety
- Control AI releases: The Direct Institutional Plan. They suggest designing policies that prevent development of superintelligence, and spreading awareness among democratic institutions.
- Google DeepMind: DeepMind: Taking a responsible path to AGI.
LLM
- OpenAI pushed an update to their 4o model. This has significantly improved its ranking (e.g. now best non-reasoning model on coding benchmark).
- An interesting test of GPT-4o in-context image generation: it is unable to generate an image of a maze with a valid solution; at lest when the maze is a square. However, if you ask it to make an image of a diamond orientation maze (45° rotated square), it succeeds to have a valid solution. We can rationalize this based on the sequential order of autoregressive generation. By generating first from the start of the maze (and only its local neighborhood), and similarly finishing with this sort of locality, the model can more correctly build a valid solution. (Conversely, the usual square orientation requires longer-range reasoning across image patches.)
- At first, this might seem like just another silly oddity. But it shows how recasting a problem, just by changing the generation order, can massively change model performance. This sheds light on how they “think” and suggests that alternate generation strategies could perhaps unlock capabilities.
- For instance, one could imagine an LLM with different branches (like MoE?) where each branch is trained on a different autoregression strategy (left-to-right, right-to-left, block diffusion, random, etc.) such that the overall LLM can invoke/combine different kinds of thinking modes.
- Another trick is to ask it to generate an image of a maze with the solution identified, and then update the image to remove the solution. This is a visual analog of “think step-by-step” and other inference-time-compute strategies. This implies that current models have untapped visual reasoning capabilities that could be unlocked by allowing them to visually iterate on problems.
- At first, this might seem like just another silly oddity. But it shows how recasting a problem, just by changing the generation order, can massively change model performance. This sheds light on how they “think” and suggests that alternate generation strategies could perhaps unlock capabilities.
- Anthropic announces Claude for Education, which provides a university-wide solution tailored to education.
AI Agents
- Amazon introduce Nova Act, a research preview for agents controlling web browsers.
- AI Digest has started an experiment: they launched 4 computer-use agents, and gave them the task of getting donations for a charity of their choice. The agents can chat to each other, and human visitors can also chat with them. They have begun to (slowly) work on the problem. You can view their ongoing activities here.
- General Agents claims they have a general-purpose computer-use agent (Ace) that operates your local computer.
- OpenAI release a new benchmark: PaperBench: Evaluating AI’s Ability to Replicate AI Research (paper, code).
- Zapier adds MCP support, so AI agents can now access a very broad range of web apps (Slack, Google Sheets, Notion, etc.).
Audio
- ElevenLabs:
- Adds native, low-latency RAG for conversational AI.
- Launch Actor Mode, where you can use your voice to direct the AI’s performance.
- Udio introduces Styles, allowing generation from a provided audio clip.
- Mureka AI enables more fine-grained music generation.
Image Synthesis
- Ideogram 3.0 released.
- OpenAI adds regional selection to their new in-context 4o image generator, allowing tailored updates to images.
Video
- Runway ML launches a new model: Gen-4. Improvements in realism, physics, character consistency, etc. Sample short videos: The Lonely Little Flame, The Herd, The Retrieval, NYC is a Zoo, Scimmia Vede. (More examples.)
- Meta unveils: MoCha: Towards Movie-Grade Talking Character Synthesis (preprint). Remarkably good human character generation from audio input.
- Higgsfield is showing good camera control.
- Sync introduces lipsync-2, which generates expressive synced video that maintains expressiveness.
Science
- Curated list of science datasets: Awesome Materials & Chemistry Datasets.
Robot
- KEENON Robotics introduces wheeled humanoid XMAN-R1.
- Unitree Dex5 Dexterous Hand (20 degrees of freedom).
- More video of Figure robots performing real work in a BMW factory.