AI Impact Predictions

Debates about future AI progress or impact are often confused, because different people have very different mental-models for the expected pace, and the time-horizon over which they are projecting.

This figure is my attempt to clarify:

The experimental datapoints come from the METR analysis: Measuring AI Ability to Complete Long Tasks (paper, code/data). The “count the OOMs” and “new regime” curves are extrapolated fits to the data. The other curves are ad-hoc, drawn just to give a sense of how a particular mental model might translate to capability-predictions.

The figure tries to emphasize:

  • Task complexity covers many orders-of-magnitude. Although imperfect, we can think about the timescale over which “coherent progress” must be made as a proxy for measuring generally useful capabilities.
  • There are many models for progress, and they vary dramatically in predictions.
  • Nevertheless, except for scenarios that fundamentally doubt AI progress is possible, the main disagreement among models is over the timescale required to reach a given kind of impact.
  • The concerns one has (economic, social, existential) will depend on one’s model. (Of course one’s concerns will also be influenced by other assessments, such as the wisdom we expect leaders to exhibit at different stages of rollout.)
  • It is difficult to define intelligence. Yet, it seems quite defensible to say that we have transitioned from clearly sub-human AI, into a “jagged intelligence” regime where a particular AI system will out-perform humans in some tasks (e.g. rapid knowledge retrieval) but under-perform in other tasks (e.g. visual reasoning). As we move through the jagged frontier, we should expect more and more human capabilities to be replicated in AI, even while some other subset remains unconquered.
  • The definition of “AGI” is also unclear. Instead of a clear line being crossed, we should expect a greater fraction of people to acknowledge AI as generally-capable, as systems cross through the jagged frontier.

The primary goal of the figure is to clarify discussions. I.e. we should specify which kinds of scenarios we find plausible, which impacts are thus considered possible, and which time-span we are currently discussing.

Posted in AI | Tagged , , , , | Leave a comment

AI News 2025-04-17

General

Research Insights

LLM

  • Zyphra releases an open-source reasoning model: ZR1-1.5B (weights, try using).
  • Anthropic adds to Claude a Research capability, and Google Workspace integration.
  • OpenAI announces GPT-4.1 models in the API. Optimized for developers (instruction following, coding, diff generation, etc.), 1M context length, etc.; three models (4.1, 4.1-mini, 4.1-nano) provide control of performance vs. cost. Models can handle text, image, and video.
    • They also have a prompting guide for 4.1.
    • OpenAI have released a new eval for long-context: MRCR.
    • OpenAI intends to deprecate GPT-4.5 in the next few months.
  • OpenAI announces o3 and o4-mini reasoning models.
    • These models are explicitly trained to use tools as part of their reasoning process.
    • They can reason over images in new ways.
    • Improved scores on math and code benchmarks (91-98% on AIME, ~75% on scientific figure reasoning, etc.).
    • o3 is strictly better than o1 (higher performance with lower inference cost); o1 will be deprecated.
    • OpenAI will be releasing coding agent applications; starting with Codex CLI, which allows one to deploy coding agents easily.
    • METR has provided evaluations of capabilities.
    • As part of the release, they also provided data showing how scaling RL is yielding predictable improvements.

Safety

Video

Audio

Science

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2025-04-10

General

Research Insights

LLM

  • More progress in diffusion language models: Dream 7B: Introducing Dream 7B, the most powerful open diffusion large language model to date.
  • Meta releases Llama 4 series of MoE LLMs: Scout (109B, 17B active, 16 experts), Maverick (400B, 17B active, 128 experts), and Behemoth (2T, 288B active, 16 experts). These are MoE models with a 10M token context. The models appear to be competitive (nearing the state-of-the-art tradeoff curve for performance/price), and thus extremely impressive for open-source.
    • Independent evals (including follow-up) from Artificial Analysis show it performing well against non-reasoning models.
    • Evaluation of the 10M context on simple NIAH seem reasonable, but (reportedly) it does not fare as well on deeper understanding of long context.
  • Cloudflare launch an open beta for their AutoRAG solution.
  • Nvidia release Llama-3_1-Nemotron-Ultra-253B-v1, which seems to beat Llama 4 despite being based on Llama 3.1.
  • Amazon announces Nova Sonic speech-to-speech foundation models, for building conversational AI.
  • Agentica release open-source: DeepCoder-14B-Preview, a reasoning model optimized for coding (code, hf).
  • Anthropic announce a new “Max” plan for Claude ($100/month).
  • xAI release an API for Grok-3. Pricing appears relatively expensive (e.g. compared to Gemini models of better performance).
  • OpenAI adds an evals API, making it easier to programmatically define tests, evaluations, etc. This should make it faster/easier to test different prompts, LLMs, etc.
  • Bytedance release technical report for Seed-Thinking-v1.5, a 200B reasoning model.
  • OpenAI add a memory feature to ChatGPT, allowing it to reference all past chats in order to personalize responses.

AI Agents

Audio

Image Synthesis

Video

World Synthesis

Science

Brain

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , , , , | Leave a comment

AI News 2025-04-03

General

Research Insights

Safety

LLM

  • OpenAI pushed an update to their 4o model. This has significantly improved its ranking (e.g. now best non-reasoning model on coding benchmark).
  • An interesting test of GPT-4o in-context image generation: it is unable to generate an image of a maze with a valid solution; at lest when the maze is a square. However, if you ask it to make an image of a diamond orientation maze (45° rotated square), it succeeds to have a valid solution. We can rationalize this based on the sequential order of autoregressive generation. By generating first from the start of the maze (and only its local neighborhood), and similarly finishing with this sort of locality, the model can more correctly build a valid solution. (Conversely, the usual square orientation requires longer-range reasoning across image patches.)
    • At first, this might seem like just another silly oddity. But it shows how recasting a problem, just by changing the generation order, can massively change model performance. This sheds light on how they “think” and suggests that alternate generation strategies could perhaps unlock capabilities.
      • For instance, one could imagine an LLM with different branches (like MoE?) where each branch is trained on a different autoregression strategy (left-to-right, right-to-left, block diffusion, random, etc.) such that the overall LLM can invoke/combine different kinds of thinking modes.
    • Another trick is to ask it to generate an image of a maze with the solution identified, and then update the image to remove the solution. This is a visual analog of “think step-by-step” and other inference-time-compute strategies. This implies that current models have untapped visual reasoning capabilities that could be unlocked by allowing them to visually iterate on problems.
  • Anthropic announces Claude for Education, which provides a university-wide solution tailored to education.

AI Agents

Audio

Image Synthesis

Video

Science

Robot

Posted in AI | Tagged , , , , , , , , | Leave a comment

AI News 2025-03-27

General

Research Insights

LLM

Multimodal

AI Agents

Safety

  • Superalignment with Dynamic Human Values. They treat alignment as a dynamic problem, where human values may change over time. The proposed solution involves an AI that breaks tasks into smaller components, that are easier for humans to guide. This framework assumes that alignment of sub-tasks correctly generalizes to desirable outcomes for the overall task.
  • Google DeepMind: Defeating Prompt Injections by Design.

Audio

  • OpenAI announced new audio models: new text-to-speech models (test here) where one can instruct it about how to speak; and gpt-4o-transcribe with lower error rate than Whisper (including a mini variant than is half the cost of Whisper).
  • OpenAI update their advanced voice mode, making it better at not interrupting the user.

Image Synthesis

  • Tokenize Image as a Set (code). Interesting approach to use an unordered bag of tokens (rather than a serialization, as done with text) to represent images.
  • StarVector is a generative model for converting text or images to SVG code.
  • Applying mechanistic interpretability to image synthesis models can offer enhanced control: Unboxing SDXL Turbo: How Sparse Autoencoders Unlock the Inner Workings of Text-to-Image Models (preprint, examples).
  • The era of in-context and/or autoregressive image generation is upon us. In-context generation means the LLM can directly understand and edit photos (colorize, restyle, make changes, remove watermarks, etc.). Serial autoregressive approaches also handle text and prescribed layout much better, and often have improved prompt adherence.
    • Last week, Google unveiled Gemini 2.0 Flash Experimental image generation (available in Google AI Studio).
    • Reve Image reveal that the mysterious high-scoring “halfmoon” is their image model, apparently exploiting some kind of “logic” (auto-regressive model? inference-time compute?) to improve output.
    • OpenAI release their new image model: 4o image generation. It can generate highly coherent text in images, and iterate upon images in-context.
      • This led to a one-day Ghibli-themed spontaneous meme explosion.
      • It is interesting to see how it handles generating a map with walking directions. There are mistakes. But the quality is remarkable. The map itself is mostly just memorization, but the roughly-correct walking directions and time estimation point towards a more generalized underlying understanding.

Video

  • SkyReels is offering AI tools to cover the entire workflow (script, video, editing).
  • Pika is testing a new feature that allows one to edit existing video (e.g. animating an object).

World Synthesis

Science

Hardware

  • Halliday: smart glasses intended for AI integration ($430)

Robots

  • Unitree shows a video of smooth athletic movement.
  • Figure reports on using reinforcement learning in simulation to greatly improve the walking of their humanoid robot, providing it with a better (faster, more efficient, more humanlike) gait.
  • Google DeepMind paper: Gemini Robotics: Bringing AI into the Physical World. They present a vision-language-action model capable of directly controlling robots.
Posted in AI, News | Tagged , , , , , , , , , , , , | Leave a comment

AI News 2025-03-20

General

Research Insights

LLM

  • Baidu announce Ernie 4.5 and X1 (use here). They claim that Ernie 4.5 is comparable to GPT-4o, and that X1 is comparable to DeepSeek R1; but with lower API costs (Earnie 4.5 is 1/4 the price of 4o, while X1 is 1/2 of R1). They plan to open-source the models on June 30th.
  • Mistral release Mistral Small 3.1 24B. They report good performance for the model size (e.g. outperforming GPT-4o-mini and Gemma 3).
  • LG AI Research announce EXAONE Deep, a reasoning LLM (2.4B, 7.8B, 32B variants; weights) that scores well on math benchmarks.
  • Nvidia release Llama-Nemotron models, which can do reasoning (try it here).

Safety

Vision

Image Synthesis

  • Gemini 2.0 Flash Experimental (available in Google AI Studio) is multimodal, with image generation capabilities. By having the image generation “within the model” (rather than as an external tool), one can iterate on image generation much more naturally. This incidentally obviates the need for more specialized image tools (can do colorization, combine specified people/places/products, remove watermarks, etc.).

Video

Audio

Science

Robots

Posted in AI, News | Tagged , , , , , , , | Leave a comment

AI News 2025-03-13

General

Research Insights

LLM

AI Agents

Safety

  • OpenAI blog post: Detecting misbehavior in frontier reasoning models. They study how the natural-language chain-of-thought operates in reasoning models. They find that aggressive optimization of reasoning, especially optimizing for the CoT to not exhibit misaligned text, leads to model behaviors where undesired thoughts are not expressed in CoT (but are nevertheless activated). Conversely, under-optimized CoT remains human-legible, providing an opportunity to detect and modify undesired behavior. They advocate for strongly avoiding over-optimization of CoT, thereby keeping it legible; noting that this may require hiding the CoT from the end-user (e.g. so model can freely consider dangerous topics in the CoT, while ultimately not expressing these in the response to the user).
  • Dan Hendrycks, Eric Schmidt and Alexandr Wang released: Superintelligence Strategy, a detailed essay about ASI risks, with concrete mitigation suggestions, including Mutual Assured AI Malfunction (MAIM).

Audio

  • Elevenlabs adds speed control for text-to-speech; can be controlled down to the word level to control a performance.
  • Tavus are demoing AI avatars (audio and video) that are fairly lifelike and responsive.
  • Nvidia release Audio Flamingo 2 (paper, code), an audio-language model with long-context and understanding of non-speech audio.
  • Sesame has now released the weights for their remarkable conversational audio model (demo, example): use, code, weights.

Image Synthesis

Video

  • Hedra releases Character 3, an improved video avatar model, that can lip-sync to provided audio.
  • Captions AI’s Mirage model also achieves more emotive lip-sync than older methods.

Science

Robots

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2025-03-06

General

Research Insights

LLM

AI Agents

Audio

  • Sesame have a demo of a voice audio chatbot that is remarkably fast and natural-sounding (example). They claim that they will open-source soon.
  • Podcastle (podcasting platform) introduces Asyncflow, a library of 450 AI voices.

Video

Science

Robots

  • Figure announces that it is accelerating deployment plans, starting in-home alpha testing this year.
  • UBTECH claims they are deploying swarm methods, where individual humanoid robots share knowledge and communicate to collaborate on problems (apparently being tested in Zeekr’s car factory).
  • Dexmate introduce their semi-humanoid Vega.
  • Proception are working on a humanoid, starting with the hand.
Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2025-02-27

General

Research Insights

  • Surprising result relevant to AI understanding and AI safety: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. By fine-tuning an LLM to produce insecure code, the LLM also incidentally picks up many other misaligned behaviors, including giving malicious advice on unrelated topics and expressing admiration for evil people (example outputs).
    • They even find that fine-tuning to generate “evil numbers” (such as 666) leads to similar kinds of broad misalignment.
    • The broad generalization it exhibits could have deep implications.
    • It suggests that the model learns many implicit associations during training and RLHF, such that many “unrelated” concepts are being tangled up into a single preference vector. Thus, when one pushes on a subset of the entangled concepts, the others are also affected.
    • This is perhaps to be expected (in retrospect) in the sense that there are many implicit/underlying correlations in the training data, which can be exploited to learn a simpler predictive model. I.e. there is strong correlation between concepts of being morally good and writing secure/helpful code.
    • This is similar to previous result: Refusal in Language Models Is Mediated by a Single Direction.
    • From an AI safety perspective, this is perhaps heartening, as it suggests a more general and robust learning of human values. It also suggests it might be easier to detect misalignment (since it will show up in many different ways) and steer models (since behaviors will be entangled, and don’t need to be individually steered).
    • Of course much of this is speculation for now. The result is tantalizing but will need to be replicated and studied.
  • SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution. Meta demonstrates 41.0% on SWE-Bench Verified despite being only a 70B model (vs. 31% for the non-RLed 70B model), further validating the RL approach to improving performance on focused domains.
  • Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models. They find evidence for cross-modal knowledge transfer. E.g. CLIP can learn richer aggregate semantics (e.g. for a particular culture or country), compared to a vision-only method.
  • Inception Labs is reporting progress on diffusion language models (dLLMs): Mercury model (try it here). Unlike traditional autoregressive LLMs, which generate tokens one at a time (left to right), the diffusion method generates the whole token sequence at once. It approaches it as in image generation: start with a an imperfect/noisy estimate for the entire output, and progressively refine it. In addition to a speed advantage, Karpathy notes that such models might exhibit different strengths and weaknesses compared to conventional LLMs.

LLM

  • Different LLMs are good for different things, so why not use a router to select the ideal LLM for a given task/prompt? Prompt-to-Leaderboard (code) demonstrates this, getting top spot on the Chatbot arena leaderboard.
  • Anthropic release Claude 3.7 Sonnet (system card), a hybrid model that can return immediate answers or conduct extended thinking. In benchmarks, it is essential state-of-the-art (comparing favorably against o1, o3-mini, R1, and Grok 3 Thinking). Surprisingly, even the non-thinking mode can even outperform frontier reasoning models on certain tasks. It appears extremely good at coding.
    • Claude Code is a terminal application that automates many coding and software engineer tasks (currently in limited research preview).
    • Performance of thinking variant on ARC-AGI is roughly equal to o3-mini (though at higher cost).
    • Achieves 8.9% on Humanity’s Last Exam (c.f. 14% by o3-mini-high).
    • For fun, some Anthropic engineers deployed Claude to play Pokemon (currently live on Twitch). Claude 3.7 is making record-setting progress in this “benchmark”.
  • Qwen releases a thinking model: QwQ-Max-Preview (use it here).
  • Convergence open-source Proxy Lite, a scaled-down version of their full agentic model.
  • OpenAI have added Assistants File Search, essentially providing an easier way to build RAG solutions in their platform.
  • Microsoft release phi-4-multimodal-instruct, a language+vision+speech multimodal model.
  • DeepSeek releases:
  • OpenAI releases GPT-4.5. It is a newer/better non-reasoning LLM. It is apparently “a big model”. It has improved response quality with fewer hallucinations, and more nuanced emotional understanding.

AI Agents

Audio

  • Luma add a video-to-audio feature to their Dream Machine video generator.
  • ElevenLabs introduce a new audio transcription (speech-to-text) model: Scribe. They claim superior performance, compared to the state-of-the-art (e.g. OpenAI Whisper).
  • Hume announce Octave, an improved text-to-speech where one can describe voice (including accent) and provide acting directions (emotion, etc.).

Video

3D

Science

Robots

Posted in AI, News | Tagged , , , , , , , | Leave a comment

AI News 2025-02-20

General

  • Perplexity adds a Deep Research capability (similar to Google and OpenAI). You can try it even in the free tier (5 per day). They score 21% on the challenging “Humanity’s Last Exam” benchmark, second only to OpenAI at 26%.
  • TechCrunch reports: A job ad for Y Combinator startup Firecrawl seeks to hire an AI agent for $15K a year. Undoubtedly a publicity stunt. And yet, it hints towards a near-future economic dynamic: offering pay based on desired results (instead of salary), and allowing others to bid using human or AI solutions.
  • Mira Murati (formerly at OpenAI) announces Thinking Machines, an AI venture.
  • Fiverr announces Fiverr Go, where freelancers can train a custom AI model on their own assets, and have this AI model/agent available for use through the Fiverr platform. This provides a way for freelancers to service more clients.
    • Elevenlabs Payouts is a similar concept, where voice actors can be paid when clients use their customized AI voice.
    • In the short term, this provides an extra revenue stream to these workers. Of course, these workers are the most at threat for full replacement by these very AI methods. (And, indeed, one could worry that the companies in question are gathering the data they need to eventually obviate the need for profit-sharing with contributors.)

Research Insights

LLM

  • Nous Research releases DeepHermes 3 (8B), which mixes together conventional LLM response with long-CoT reasoning response.
  • InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU.
  • ByteDance has released a new AI-first coding IDE: Trae AI (video intro).
  • LangChain Open Canvas provides a user interface for LLMs, including memory features, UI for coding, display artifacts, etc.
  • xAI announces the release of Grok 3 (currently available for use here), including a reasoning variant and “Deep Search” (equivalent to Deep Research). Early testing suggests a model closing in on the abilities of o1-pro (but not catching up to o3 full). So, while it has not demonstrated any record-setting capabilities, it confirms that frontier models are not yet using any methods that cannot be reproduced by others.

AI Agents

Safety

Image

Video

3D

World Synthesis

  • Microsoft report: Introducing Muse: Our first generative AI model designed for gameplay ideation (publication in Nature: World and Human Action Models towards gameplay ideation). They train a model on gameplay videos (World and Human Action Model, WHAM); the model can subsequently forward-simulate gameplay from a provided frame. The model has thus learned an implicit world model for the video game. Forward-predicting gameplay based on artificial editing of frames (introducing a new character or situation) thus allows rapid ideation of gameplay ideas before actually updating the video game. More generally, this points towards direct neural rendering of games and other interactive experiences.

Science

Brain

Robots

  • Unitree video shows robot motion that is fairly fluid and resilient.
  • Clone robotics is moving towards combining their biomimetic components into a full-scale humanoid: Protoclone.
  • MagicLab Robot with dextrous MagicHand S01.
  • Figure AI claims a breakthrough in robotic control software (Helix: A Vision-Language-Action Model for Generalist Humanoid Control). The video shows two humanoid robots handling a novel task based on human natural voice instructions. Assuming the video is genuine, it show genuine progress in the capability of autonomous robots to understand instructions and conduct simple tasks (including working with a partner in a team).
Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment