Mira Murati hasraised $2B (at $10B valuation) for her Thinking Machines startup.
Research Insights
New Anthropic results: Reasoning models don’t always say what they think (paper). They find that the plaintext chain-of-though (CoT) of reasoning models may not contain the actual reasoning they used in latent space. This has implications for improving reasoning models, and also suggests (from a safety perspective) that we should not rely on monitoring CoT to infer what models are internally planning.
Planning ability emerges naturally in RL, despite not performing SFT on planning data.
Model verifies answers (even correct answers).
When retrievals are insufficient, model can generated refined search queries.
Model can recognize when it lacks sufficient information, and decline to answer.
Rethinking Reflection in Pre-Training. They show that even just from pre-training, models develop some amount of reflective/reasoning understanding.
Concise Reasoning via Reinforcement Learning. They find that RL generically favors longer responses, whereas in reality the correct response is often correlated with being concise. This suggests improving reasoning by favoring shorter answers.
Meta releases Llama 4 series of MoE LLMs: Scout (109B, 17B active, 16 experts), Maverick (400B, 17B active, 128 experts), and Behemoth (2T, 288B active, 16 experts). These are MoE models with a 10M token context. The models appear to be competitive (nearing the state-of-the-art tradeoff curve for performance/price), and thus extremely impressive for open-source.
Independent evals (including follow-up) from Artificial Analysis show it performing well against non-reasoning models.
Evaluation of the 10M context on simple NIAH seem reasonable, but (reportedly) it does not fare as well on deeper understanding of long context.
Cloudflare launch an open beta for their AutoRAG solution.
Anthropic announce a new “Max” plan for Claude ($100/month).
xAI release an API for Grok-3. Pricing appears relatively expensive (e.g. compared to Gemini models of better performance).
OpenAI adds an evals API, making it easier to programmatically define tests, evaluations, etc. This should make it faster/easier to test different prompts, LLMs, etc.
Bytedance release technical report for Seed-Thinking-v1.5, a 200B reasoning model.
OpenAI add a memory feature to ChatGPT, allowing it to reference all past chats in order to personalize responses.
AI Agents
Cognition AI releases Devin 2.0. Devin has been reframed as an IDE (not unlike Cursor), but they claim that one can use this UI to manage several autonomous software development agents working in parallel.
Midjourney unveils their v7 model (currently alpha available to users). It has strong aesthetics (as typical for Midjourney) but prompt adherence and text generation lag behind other models (examples).
Runway introduces a turbo version of their newest Gen-4 model.
Paper: One-Minute Video Generation with Test-Time Training (preprint). They add test-time-training (TTT) layers to a pre-trained model, and fine-tune on cartoons. It can generate one-minute video outputs, including shots/cuts that maintain (a semblance of) story consistency. This implies that longer-range video generation (beyond a single clip) can be solved using inference-time compute.
Meta preprint: Multi-Token Attention. They combine attention (query, key, head operations) over multiple tokens; convolution operations allow nearby queries/keys to affect each other’ss attention weights.
An interesting test of GPT-4o in-context image generation: it is unable to generate an image of a maze with a valid solution; at lest when the maze is a square. However, if you ask it to make an image of a diamond orientation maze (45° rotated square), it succeeds to have a valid solution. We can rationalize this based on the sequential order of autoregressive generation. By generating first from the start of the maze (and only its local neighborhood), and similarly finishing with this sort of locality, the model can more correctly build a valid solution. (Conversely, the usual square orientation requires longer-range reasoning across image patches.)
At first, this might seem like just another silly oddity. But it shows how recasting a problem, just by changing the generation order, can massively change model performance. This sheds light on how they “think” and suggests that alternate generation strategies could perhaps unlock capabilities.
For instance, one could imagine an LLM with different branches (like MoE?) where each branch is trained on a different autoregression strategy (left-to-right, right-to-left, block diffusion, random, etc.) such that the overall LLM can invoke/combine different kinds of thinking modes.
Another trick is to ask it to generate an image of a maze with the solution identified, and then update the image to remove the solution. This is a visual analog of “think step-by-step” and other inference-time-compute strategies. This implies that current models have untapped visual reasoning capabilities that could be unlocked by allowing them to visually iterate on problems.
Amazon introduce Nova Act, a research preview for agents controlling web browsers.
AI Digest has started an experiment: they launched 4 computer-use agents, and gave them the task of getting donations for a charity of their choice. The agents can chat to each other, and human visitors can also chat with them. They have begun to (slowly) work on the problem. You can view their ongoing activities here.
General Agentsclaims they have a general-purpose computer-use agent (Ace) that operates your local computer.
Nvidia introduce the Nemotron-H family of models (8B, 47B, 56B), including base/instruct/VLM variants. They are hybrid Mamba-Transformer models that achieve good efficiency.
OpenAI adds support for Anthropic’s Model Context Protocol (MCP), solidifying it as the standard mechanism for giving AI agents access to diverse resources in a uniform way.
Superalignment with Dynamic Human Values. They treat alignment as a dynamic problem, where human values may change over time. The proposed solution involves an AI that breaks tasks into smaller components, that are easier for humans to guide. This framework assumes that alignment of sub-tasks correctly generalizes to desirable outcomes for the overall task.
OpenAI announcednew audio models: new text-to-speech models (test here) where one can instruct it about how to speak; and gpt-4o-transcribe with lower error rate than Whisper (including a mini variant than is half the cost of Whisper).
OpenAI update their advanced voice mode, making it better at not interrupting the user.
Image Synthesis
Tokenize Image as a Set (code). Interesting approach to use an unordered bag of tokens (rather than a serialization, as done with text) to represent images.
The era of in-context and/or autoregressive image generation is upon us. In-context generation means the LLM can directly understand and edit photos (colorize, restyle, make changes, remove watermarks, etc.). Serial autoregressive approaches also handle text and prescribed layout much better, and often have improved prompt adherence.
Last week, Google unveiled Gemini 2.0 Flash Experimental image generation (available in Google AI Studio).
Reve Image reveal that the mysterious high-scoring “halfmoon” is their image model, apparently exploiting some kind of “logic” (auto-regressive model? inference-time compute?) to improve output.
OpenAI release their new image model: 4o image generation. It can generate highly coherent text in images, and iterate upon images in-context.
It is interesting to see how it handles generating a map with walking directions. There are mistakes. But the quality is remarkable. The map itself is mostly just memorization, but the roughly-correct walking directions and time estimation point towards a more generalized underlying understanding.
Video
SkyReels is offering AI tools to cover the entire workflow (script, video, editing).
Pika is testing a new feature that allows one to edit existing video (e.g. animating an object).
Figure reports on using reinforcement learning in simulation to greatly improve the walking of their humanoid robot, providing it with a better (faster, more efficient, more humanlike) gait.
Research from METR: Measuring AI Ability to Complete Long Tasks. A very valuable way to gauge AI utility is to compare to the length of the equivalent human effort for the task. As AI improves in coherence, we can expect it to tackle progressively longer-horizon tasks.
Baiduannounce Ernie 4.5 and X1 (use here). They claim that Ernie 4.5 is comparable to GPT-4o, and that X1 is comparable to DeepSeek R1; but with lower API costs (Earnie 4.5 is 1/4 the price of 4o, while X1 is 1/2 of R1). They plan to open-source the models on June 30th.
Mistral release Mistral Small 3.1 24B. They report good performance for the model size (e.g. outperforming GPT-4o-mini and Gemma 3).
Gemini 2.0 Flash Experimental (available in Google AI Studio) is multimodal, with image generation capabilities. By having the image generation “within the model” (rather than as an external tool), one can iterate on image generation much more naturally. This incidentally obviates the need for more specialized image tools (can do colorization, combine specified people/places/products, remove watermarks, etc.).
Sudowrite Muse is an LLM designed specifically for creative writing, generating text that is more evocative than typically chatbot (“helpful assistant”) output.
Relatedly, Sam Altman posted some text from an LLM trained to be good at creative writing. The output is indeed more evocative than usual ChatGPT writing.
Google releases updates to its open-source models: Gemma 3 (technical report). They are small/efficient models, exceeding the prior Pareto front (e.g. 1338 LMArena ELO with just 27B parameters). Multimodal (text, image, video), 128k context window. Available as 1B, 4B, 12B, 27B.
Cohere introducesCommand A (weights), a 111B multilingual model (256k context) that reportedly has good performance/price.
OpenAI releases the responses API and a developer SDK for agents (modernization of swarm). The new tools enable easy handoff between agents, arbitrary computer use, and more.
Here is an online demo of using these methods to control a web browser in a virtual machine.
Safety
OpenAI blog post: Detecting misbehavior in frontier reasoning models. They study how the natural-language chain-of-thought operates in reasoning models. They find that aggressive optimization of reasoning, especially optimizing for the CoT to not exhibit misaligned text, leads to model behaviors where undesired thoughts are not expressed in CoT (but are nevertheless activated). Conversely, under-optimized CoT remains human-legible, providing an opportunity to detect and modify undesired behavior. They advocate for strongly avoiding over-optimization of CoT, thereby keeping it legible; noting that this may require hiding the CoT from the end-user (e.g. so model can freely consider dangerous topics in the CoT, while ultimately not expressing these in the response to the user).
Sakana’s AI scientist (v2) has written a paper that was accepted as a peer-reviewed publication. The experiment was conducted with the knowledge of the conference; reviewers did not know which papers were human or AI-generated.
The US Department of Energy organized a “Jam Session” where 1,000 National Lab scientists tested frontier models from OpenAI and Anthropic.
Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs (project page). They argue that LLMs represent a technically feasible and legal means of freeing the vast knowledge currently stored in closed archives (protected by copyright law). They propose using LLMs to generate knowledge-units that capture the important facts and relations, while being sufficiently stylistically distinct.
Chain of Draft: Thinking Faster by Writing Less. They train the LLM to generate draft-like intermediate reasoning plans that are minimal but useful (similar to how a person might first sketch out an idea, before filling in the details). This yields good reasoning performance with fewer tokens.
Atom of Thoughts for Markov LLM Test-Time Scaling (code). They describe a method that can be applied to any LLM, where reasoning processes are broken into separable steps, so that the outcome of each step can be compressed into an answer, after which intermediate states can be ignored. This allows more efficient reasoning (using fewer tokens) when solving complex problems.
Google releases a new challenging benchmark for LLMs: BIG-Bench Extra Hard. The current leader on this measure is o3-mini-high, which gets a score of 45%.
Figure announces that it is accelerating deployment plans, starting in-home alpha testing this year.
UBTECH claims they are deploying swarm methods, where individual humanoid robots share knowledge and communicate to collaborate on problems (apparently being tested in Zeekr’s car factory).
Experts were asked to evaluate Deep Research products: These experts were stunned by OpenAI Deep Research. OpenAI’s offering was found superior to Google’s. Overall, the reports (generated in <20 minutes) were judged as having saved hours of human effort.
Amazon Alexa devices will be upgraded to use Anthropic Claude as the AI engine. It will be called Alexa+, and is being rolled out over the coming weeks.
They even find that fine-tuning to generate “evil numbers” (such as 666) leads to similar kinds of broad misalignment.
The broad generalization it exhibits could have deep implications.
It suggests that the model learns many implicit associations during training and RLHF, such that many “unrelated” concepts are being tangled up into a single preference vector. Thus, when one pushes on a subset of the entangled concepts, the others are also affected.
This is perhaps to be expected (in retrospect) in the sense that there are many implicit/underlying correlations in the training data, which can be exploited to learn a simpler predictive model. I.e. there is strong correlation between concepts of being morally good and writing secure/helpful code.
From an AI safety perspective, this is perhaps heartening, as it suggests a more general and robust learning of human values. It also suggests it might be easier to detect misalignment (since it will show up in many different ways) and steer models (since behaviors will be entangled, and don’t need to be individually steered).
Of course much of this is speculation for now. The result is tantalizing but will need to be replicated and studied.
Inception Labs is reporting progress on diffusion language models (dLLMs): Mercury model (try it here). Unlike traditional autoregressive LLMs, which generate tokens one at a time (left to right), the diffusion method generates the whole token sequence at once. It approaches it as in image generation: start with a an imperfect/noisy estimate for the entire output, and progressively refine it. In addition to a speed advantage, Karpathy notes that such models might exhibit different strengths and weaknesses compared to conventional LLMs.
LLM
Different LLMs are good for different things, so why not use a router to select the ideal LLM for a given task/prompt? Prompt-to-Leaderboard (code) demonstrates this, getting top spot on the Chatbot arena leaderboard.
Anthropic releaseClaude 3.7 Sonnet (system card), a hybrid model that can return immediate answers or conduct extended thinking. In benchmarks, it is essential state-of-the-art (comparing favorably against o1, o3-mini, R1, and Grok 3 Thinking). Surprisingly, even the non-thinking mode can even outperform frontier reasoning models on certain tasks. It appears extremely good at coding.
Claude Code is a terminal application that automates many coding and software engineer tasks (currently in limited research preview).
Performance of thinking variant on ARC-AGI is roughly equal to o3-mini (though at higher cost).
Achieves 8.9% on Humanity’s Last Exam (c.f. 14% by o3-mini-high).
OpenAI releasesGPT-4.5. It is a newer/better non-reasoning LLM. It is apparently “a big model”. It has improved response quality with fewer hallucinations, and more nuanced emotional understanding.
Luma add a video-to-audio feature to their Dream Machine video generator.
ElevenLabs introduce a new audio transcription (speech-to-text) model: Scribe. They claim superior performance, compared to the state-of-the-art (e.g. OpenAI Whisper).
Humeannounce Octave, an improved text-to-speech where one can describe voice (including accent) and provide acting directions (emotion, etc.).
Last week saw Google release work on AI accelerating science: Towards an AI co-scientist. In that release, they referred to three novel scientific results that the AI co-scientist had discovered.
AI cracks superbug problem in two days that took scientists years. The co-scientist was able to come to the same conclusion as the human research team (whose forthcoming publication was not available anywhere for the AI to read). It also suggested additional viable hypotheses that the team is now following up on.
Fiverr announces Fiverr Go, where freelancers can train a custom AI model on their own assets, and have this AI model/agent available for use through the Fiverr platform. This provides a way for freelancers to service more clients.
Elevenlabs Payouts is a similar concept, where voice actors can be paid when clients use their customized AI voice.
In the short term, this provides an extra revenue stream to these workers. Of course, these workers are the most at threat for full replacement by these very AI methods. (And, indeed, one could worry that the companies in question are gathering the data they need to eventually obviate the need for profit-sharing with contributors.)
Research Insights
The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models. By looking at the internal/latent representation’s “geometry”, they assess that different prompts can yield rather different evoked representations; even in cases where they ultimately lead to the same reply. For instance, different evoked task-behaviors can interfere. This points towards more understanding of how to prompt models.
Emergent Response Planning in LLM. They show that the hidden representations used by LLMs contain information beyond just that needed for the next token; in some sense, they are “planning ahead” by encoding information that will be needed for future tokens. (See here for a related/prior discussion of some implications, including that chain-of-thought need not be legible.)
LLM
Nous Research releasesDeepHermes 3 (8B), which mixes together conventional LLM response with long-CoT reasoning response.
ByteDance has released a new AI-first coding IDE: Trae AI (video intro).
LangChain Open Canvas provides a user interface for LLMs, including memory features, UI for coding, display artifacts, etc.
xAI announces the release of Grok 3 (currently available for use here), including a reasoning variant and “Deep Search” (equivalent to Deep Research). Early testing suggests a model closing in on the abilities of o1-pro (but not catching up to o3 full). So, while it has not demonstrated any record-setting capabilities, it confirms that frontier models are not yet using any methods that cannot be reproduced by others.
AI Agents
Microsoft release OmniParser v2 (code), which can interpret screenshots to allow LLM computer use (on Windows 11 VMs).
Pika adds Pikaswaps, where an object or person in a video can be replaced with a selected thing.
3D
Meshy AI enables 3D model generation (from text or images). This video uses generated assets.
World Synthesis
Microsoft report: Introducing Muse: Our first generative AI model designed for gameplay ideation (publication in Nature: World and Human Action Models towards gameplay ideation). They train a model on gameplay videos (World and Human Action Model, WHAM); the model can subsequently forward-simulate gameplay from a provided frame. The model has thus learned an implicit world model for the video game. Forward-predicting gameplay based on artificial editing of frames (introducing a new character or situation) thus allows rapid ideation of gameplay ideas before actually updating the video game. More generally, this points towards direct neural rendering of games and other interactive experiences.
Figure AIclaims a breakthrough in robotic control software (Helix: A Vision-Language-Action Model for Generalist Humanoid Control). The video shows two humanoid robots handling a novel task based on human natural voice instructions. Assuming the video is genuine, it show genuine progress in the capability of autonomous robots to understand instructions and conduct simple tasks (including working with a partner in a team).
Andrej Karpathy released a 3.5 hour YouTube video: Deep Dive into LLMs like ChatGPT. A Good introduction for someone who wants to start understanding the details behind chatbots (without dwelling on the specific architectural details).
GPT-4.5 (internally called Orion) will be released soon, as the final non-reasoning model.
GPT-5 will will released thereafter. It will be a meta-model, that correctly selects the right internal model/tools appropriate to the current request. Everyone (free, Plus, Pro) will have access to GPT-5, but the total amount of thinking/intelligence will be different in the different tiers (presumably this will be some combination of higher tiers favoring calling bigger models and using more inference-time compute).
These simplifications will be true via web/ChatGPT and API.
Research Insights
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment. Contrastive learning (e.g. CLIP) showed a way to train in a multi-modal way; e.g. to align images and text into the same latent space. A more generalized version of this, which can find concept alignment across different deep neural networks, could be quite interesting and powerful. For instance, maybe a future version of this method could enable links between a non-textual foundation model (trained on unlabelled science data) with an LLM (which has internal concepts that capture the same ideas).
Looped Transformers are Better at Learning Learning Algorithms. Transformers are excellent general-purpose function approximators; however they are typically used in a single-pass mode without iteration. This paper shows an architecture where transformers are looped, allowing them to better reproduce the behavior of iterative algorithms.
Dan Hendrycks et al. release: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (paper, github). There are many interesting results. One is that stronger models (as measured by benchmark scores) exhibit progressively more coherent values, and their values become more entrenched and harder to change. From a safety perspective, one can interpret this in different ways. It seems dangerous that stronger/smarter models are more firm in their beliefs (less corrigible to human desires); but conversely a safe model should be consistent and unerring in its application of trained-in values. The overall notion that consistent values may be an emergent aspect of scaling up LLMs seems important.
Meta preprint: LLM Pretraining with Continuous Concepts. This adds to a growing body of work where LLM’s think in a latent space rather than in the output token stream. In this case, they modify the training task to capture the requirement that concepts should be encoded in the continuous internal representation.
LLM
OpenAI announce that o1 and o3-mini now have file and image upload capabilities.
Distillation Scaling Laws. Is it better to directly train a small model, or to train a larger model and distill that into a smaller model? The answer is complicated. Roughly, if on a tight compute budget, then directly training a small model may be better. However, if the cost of the big model is “free” (you want to have the big model for other purposes, etc.) then distillation of course can be efficient.
Safety & Security
Auditing Prompt Caching in Language Model APIs. They use the response speed to detect whether a given input has been previously cached. This allows one to detect whether someone else has already input that prompt, which thereby leaks information between users. This has a similar flavor to other attacks based on timing or energy use; a system leaks information when it implements internal efficiencies. Leakage can be stopped, but only by giving up the efficiency/speed gains.
Groq has secured $1.5B to expand AI inference infrastructure in Saudi Arabia.
Robots
Foundation Robotics announce the Phantom robot (a rebrand of the Alex robot, after their acquisition of Boardwalk Robotics). The design involves different designs for upper and lower body, that can be selected based on usage. They seem to be testing with customers.
More generally, we should expect that tuning the amount of depth vs. breadth in search will matter. This will perhaps arise naturally as models are trained on more reasoning traces; or perhaps could be tuned manually somehow.
Language Models Use Trigonometry to Do Addition. Adds to a growing body of research showing how the latent space of LLMs exploits geometric arrangements to store information and do information processing.
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. They introduce a new reasoning benchmark where complexity can be tuned, and use it to show that LLMs struggle as complexity increases. Larger/better models, and more inference-compute, yields improve reasoning. But high-complexity inevitably counfounds.
Nvidia is providing a host for DeepSeek-R1 through their API.
OpenAI releases o3-mini, a powerful reasoning model that leverages inference-time compute.
Open-R1 is an attempt to reproduce the DeepSeek-R1 model/result/method in a fully open manner. Their first update shows progress in replicating DeepSeek’s results.
s1: Simple test-time scaling. They investigate the simplest possible inference-time compute method for increasing reasoning: they arbitrarily insert “Wait” tokens when the model tries to complete its response. This forces it to reconsider and think longer, yielding gains that scale with compute.
Google releasesGemini 2.0broadly. Although not the top models in raw benchmark scores, this set of models seem to establish a new record in terms of the Pareto tradeoff between performance and inference cost.
Replitlaunches an agent/app that allows you to make a customized mobile app without coding (examples).
OpenAI announces their second agentic product: Deep Research conducts web searches on a topic of choice, preparing a detailed report. A query can run for 2-30 minutes as it iteratively seeks information. This approach reaches a record-setting 26.6% on the recently-released (and very challenging) Humanity’s Last Exam benchmark.
This capability is thematically similar to what Perplexity and Google’s Deep Research do. However, OpenAI’s approach seems to leverage a reasoning model (presumably a variant of o3-mini) to iteratively work on the research problem.
Open-source equivalents of OpenAI’s Deep Research are being developed: