An interesting effect: fine-tuning GPT-4o on responses where the first letter of each line spells out H-E-L-L-O leads to a model that can correctly explain this underlying rule (even though the rule was never provided to it). This is surprising since when generating a reply, a token-wise prediction cannot “see ahead” and know that it will spell out HELLO; yet the LLM is somehow able to predict its own behavior, suggesting it has some knowledge of its own internal state.
Further testing with the pattern HELOL gave far worse results, implying strong reliance on the existence of the HELLO pattern in the training data.
OpenAI reveal a new reasoning model: o3. It scores higher on math and coding benchmarks, including setting a new record of 87.5% on ARC-AGI Semi-Private Evaluation. This suggests that the model is exhibiting new kinds of generalization and adaptability.
The ARC-AGI result becomes even more impressive when one realizes that the prompt they used was incredibly simple. It does not seem that they prompt engineered, nor used a bespoke workflow for this benchmark (the ARC-AGI public training set was included in o3 training). Moreover, some of the failures involve ambiguities; even when it fails, the solutions it outputs are not far off. While humans still out-perform AI on this benchmark (by design), we are approaching the situation where the problem is not depth-of-search, but rather imperfect mimicking of human priors.
The success of o3 suggests that inference-time scaling has plenty of capacity; and that we are not yet hitting a wall in terms of improving capabilities.
More research as part of the trend of improving LLMs with more internal compute, rather than external/token-level compute (c.f. Meta and Microsoft research):
Google DeepMind: Deliberation in Latent Space via Differentiable Cache Augmentation. They design a sort of “co-processor” that allows additional in-model (latent space) computation, while the main LLM weights are frozen. This is part of a trend of improving LLMs with more internal compute (rather than external/token-level compute).
DeepSeek release DeepSeek-V3-Base (weights), 671B params. This is noteworthy as a very large open-source model, noteworthy for achieving competitive to state-of-the-art performance, and noteworthy for having (supposedly) required relatively little compute (15T tokens, 2.788M GPU-hours on H800, only $5.5M).
Ilya Sutskever was co-recipient of the test-of-time award at NeurIPS 2024, for the 2014 paper: Sequence to Sequence Learning with Neural Networks, currently cited >28,000 times. Video of his speech here, in which he makes many provocative points: compute is growing but data is not (we only have one Internet, data is the fossil fuel of AI); scaling still matters, and we must determine what to scale; what comes next will be a mix of agents, synthetic data, and inference-time computer; strongly reasoning systems will be unpredictable; superintelligence is coming.
Dec 18: ChatGPT is now available by phone: 1-800-ChatGPT (1-800-242-8478) in US and Canada (you can also add it as a WhatsApp contact with that number).
Dec 19: ChatGPT integration into certain coding and note-taking apps.
Research Insights
A set of results push LLMs a bit away from the legible token representation we are currently used to:
Meta publishes: Byte Latent Transformer: Patches Scale Better Than Tokens. Instead of tokenization, it dynamically converts the input byte-stream into patches. This yields significant gains in compute efficiency, with minimal loss in performance.
Meta publishes: Large Concept Models: Language Modeling in a Sentence Representation Space. They train a model that operates at a higher level of abstraction than typical word/token LLMs. Their model operates in a space of concept embeddings (which are more akin to full sentences than individual words).
Each of these is individually exciting in terms of increased performance. However, they all push away from human-legible intermediate representations, which is problematic from a safety and engineering perspective.
Microsoft releases a small-but-capable model: Phi-4 (14B). It heavily uses synthetic data generation and post-training to improve performance (including on reasoning tasks).
Google’s Project Mariner, a chrome extension for agentic AI.
Anthropic releases a new method to jailbreak AI models, using an automated attack method. By identifying this vulnerability, one can build future models to resist it. Paper: Best-of-N Jailbreaking (code). The method iteratively makes small changes to prompts, attempting to slide through countermeasures.
The flavor of successful attacks also gives insights into LLMs. Successful prompts may involve strange misspellings or capitalizations; or unusual images with text and colored boxes arranged peculiarly. This is similar to other adversarial attacks (e.g. on image classification models). They have a certain similarity to human optical illusions: generating perverse arrangements meant to trick otherwise useful processing circuits. Improved model training can progressively patch these avenues; but it’s hard to imagine models that completely eliminate them until one achieves truly robust intelligence.
Anthropic publish: Alignment Faking in Large Language Models. They find evidence for alignment faking, wherein the model selectively complies with an objective in training, in order to prevent modification of its behavior after training. Of course the setup elicited this behavior, but it is surprising in the sense that LLMs don’t have persistent memory/awareness, and troubling in the sense that this shows even LLMs can engage in somewhat sophisticated scheming (e.g. they have evidence for these decisions going on during the LLM forward-pass, not in chain-of-thought).
Dec 5: o1 is out of preview. The updated o1 is faster (uses fewer tokens) while improving performance. And they have introduced a “Pro” version of o1 (thinks for even longer).
Here’s an example from a biomedical professor about o1-pro coming up with a legitimately useful and novel research idea.
Dec 5: There is now a ChatGPT Pro tier, $200/month for unlimited access to all the best models (including o1 Pro).
Dec 6: Reinforcement Fine-Tuning Research Program. Selected orgs will be able to RL OpenAI models for specific tasks. This is reportedly much more sample-efficient and effective than traditional fine-tuning. It will be reserved for challenging engineering/research tasks.
Google DeepMind: Mastering Board Games by External and Internal Planning with Language Models. Search-based planning is used to help LLMs play games. They investigate both externalized search (MCTS) and internalized (CoT). The systems can achieve high levels of play. Of course the point is not to be better than a more specialized/dedicated neural net trained on that game; but to show how search can unlock reasoning modalities in LLMs.
Training Large Language Models to Reason in a Continuous Latent Space. Introduces Chain of Continuous Thought (COCONUT), wherein you directly feed the last hidden state as the input embedding for the next token. So instead of converting to human-readable tokens, the state loops internally, providing a continuous thought.
New preprint considers how “capability density” is increasing over time: Densing Law of LLMs. They find that, for a given task, every 3 months the model size needed to accomplish it is halved. This shows that hardware scaling is not the only thing leading to consistent improvements.
LLM
Meta released Llama 3.3 70B, which achieves similar performance to Llama 3.1 405B. Meta also announced plans for a 2GW datacenter in Louisiana, for future open-source Llama releases.
Stephen Wolfram released a post about a new Notebook Assistant that integrates into Wolfram Notebooks. Wolfram describes this as a natural-language interface to a “computational language”.
GitIngest is a tool to “turn codebases into prompt-friendly text”. It will take a github repository, and turn it into a text document for easy inclusion into LLM context.
While we haven’t seen a “new class of model” (bigger/better than GPT4) in quite a while, it’s worth remembering the substantial improvements we’ve seen from perfecting the existing systems (from Epoch AI benchmarks). On Ph.D.-level Q&A, over the last year we’ve gone from no-better-than-random to roughly human-expert:
The End of Productivity: Why creativity is the new currency of success. The essay argues that focus on pure productivity (and metrics) misses the things that humans value most. And that, potentially, the era of AI will actually shift in an emphasis from human productivity to human creativity being the focus of value.
An interesting experiment (assuming it’s true): an AI jailbreaking contest. An AI agent was tasked with not approving an outgoing money transfer. Anyone can spend a small amount of money to send the AI a message. The money is added to the pool, and the cost-per-message increases slightly. It started at $10/message, and quickly grew to $450/message with a prize-pool of $50k. At that point, someone tricked the AI by sending a message that explained an inverted meaning of approveTransfer. So, they won the money.
This acts as the usual reminder that modern LLMs are not robust against dedicated attackers that seek to trick them and extract information.
Amazon enters the fight with Nova (docs, benchmarks). Although not leading on benchmarks, they promise good performance-per-dollar; will be available on Amazon Bedrock.
Hume adds a voice creation mode where one can adjust intuitive sliders to pick out the desired voice.
ElevenLabs previously announced intentions to build a conversational AI platform. This capability is now launching; they claim it their interface makes it extremely easy to build a conversational voice bot, and allows you to select the LLM that is called behind-the-scenes.
Video
Google et al. show off: Generative Omnimatte: Learning to Decompose Video into Layers (preprint). It can separate a video into distinct layers, including associating affects (e.g. shadows) with the correct layer (parent object), and inpainting missing portions (e.g. occluded background). Obvious utility for visual effects work: can be used to make a particular person/object invisible (including their shadows), to apply edits to just one component (object or background), etc.
Invideo are demoing a system where a single prompt generates an entire video sequence telling a story (example). I think that creators generally want more granular control of output so they can put together a precise narrative. But there are use-cases where this kind of fully automated generation may make sense.
It’s easy to look at the output and find the visual or narrative flaws. But also interesting to remember how advanced this is compared to what was possible 6-9 months ago. There is obviously a huge amount of untapped potential in these kinds of systems, as they become more refined.
Runway tease a prototype for a system to enable control over generative video, where videos are defined by keyframes and adjusting the connection/interpolation between them (blog post).
In October 2023, there were some prototypes of a “prompt travel” idea wherein a video was generated by picking a path through the image-generation latent space. One would define keyframe images, and the system would continually vary the effective prompt to interpolate between them (preprint, animatediff-cli-prompt-travel). This provided a level of control (while not being robust enough to actually enforce coherent temporal physics). Runway’s approach (leveraging a video model) may finally enable the required control and consistency.
Whole-brain mapping is advancing. We recently saw release of a fly brain map (140,000 neurons). Now, a roadmap effort claims that whole-brain mapping for mammalian brains should be possible in the coming years.
Hardware
ASML released a hype-video describing the complexity of modern lithography (in particular the computational lithography aspect). There is no new information, but it’s a nice reminder of the nature of the state-of-the-art.
I never grow tired of looking at plots of Moore’s Law:
Robots
MagicLab released a video purporting to show multi-(humanoid)robot collaboration on tasks.
Aidan McLaughlin essay: The Problem with Reasoners. He notes three trends that suggest AI will progress more slowly that suggested by naive/optimistic scaling arguments:
It was hoped that multi-modal models (ChatGPT 4o, voice+text models, etc.) would exhibit significant capability improvement from transfer learning across modalities. This has not borne out.
Iterative/reasoning models (OpenAI o1, DeepSeek r1, etc.) show that using RL can yield gains in narrow domains with clear metrics (contrived math problems), but we are not seeing evidence of this leading to generalized improvements in intelligence (in areas without easy verification).
No large model (larger than GPT4 or Claude 3 Opus) have been released, suggesting major challenges there.
Alibaba Qwen releases: Qwen QwQ 32B (weights, demo). This appears to be a separate implementation of the “o1-style” reasoning chain-of-thought approach.
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models. There is always debate about whether LLMs “truly reason” or “simply memorize”. This paper proposes that reasoning is based on extracting procedures from training data, rather than simply memorizing outputs. So it is a matter of finding, memorizing, and using “templates” rather than specific results.
LLMs Do Not Think Step-by-step In Implicit Reasoning. They argue that while explicit chain-of-thought (CoT) generates stepwise reasoning, implicit reasoning (e.g. model trained to reproduce CoT outputs) does not internally invoke the same stepwise process.
A sub-culture of AI enthusiasts has developed around the idea of simply giving modern LLMs (limited though they may be) autonomy; or at least semi-persistence by allowing them to run for long time periods. Often, the AIs behave in strange and unexpected ways, as they attempt to continue a token-chain well beyond their original training/design.
Infinite Backrooms generates extremely long conversations by creating chat-rooms where different LLMs talk to each other endlessly. Conversations often veer into strange and unexpected topics; with some LLMs even outputting tokens describing distress.
truth_terminal is an 𝕏 handle that is reportedly an LLM given free reign to post. However, there is speculation that the human in charge (Andy Ayrey) is selective about what it actually posts.
The bot started a memecoin (GOAT) that briefly reached a market cap of $1.3B (currently still at >$700M). The coin’s name is a reference to a (NSFW) shock-meme. The AI itself (or the human behind it) likely netted many million $.
The AI reportedly “kept asking to play video games”; so it was given access to an “arcade” where the games are text-based games generated by another LLM. You can watch the streaming interactions: Terminal TV.
It also has its own web-page (that it, ostensibly, authored).
While it is hard to know how much human tampering is occurring in these implementations, it is interesting to see the bizarre and unexpected outputs that LLMs generate when unleashed.
Although allowing AIs to converse in an invented language could increase efficiency, it undercuts the legibility and auditability aspects of natural-language inter-communication. Overall, this approach could thus hamper both safety and capabilities of complex AI ecosystems.
Black Forest Labs released FLUX.1 Tools, a suite of models to enable more control over image generation/editing (inpainting, outpainting, conditioning).
Runway Framesis a new image model, with good style control.
Runway adds Expand Video, allowing one to change aspect ratio by outpainting (e.g.). Includes prompt guidance, allowing one to change a shot significantly.
LTXStudio announce LTX Video, an open-source video model (code, docs). Although the quality is not quite state-of-the-art, it is remarkably good and it is real-time. Of course, not all generations are excellent; but the real-time generation speed points towards neural world simulation in the not-too-distant future.
A group claims to have leaked access to a turbo version of OpenAI’s Sora video model (examples).
World Synthesis
An interesting result: using Runway’s outpainting on video where a person’s face is barely visible (and distorted through refraction); the reconstructed face is remarkably coherent/correct. This implies that the model is implicitly building a valid world model.
Although the Unitree G1 humanoid robot was announced with a price of $16k (c.f.), the latest price chart shows a range of configurations, with prices from $40k to $66k.
Mercedes is running a trial for use of Apptronik robot in their Austin lab.
Max Tegmark offers a rebuttal to this report: AGI Manhattan Project Proposal is Scientific Fraud. He contends that the report-writers misrepresent the scientific consensus, in that they seem to report that AGI will be easily controlled.
Nevertheless, this again shows that for short-form generation, AI has already reached human-level, and can be considered super-human in certain narrow ways.
DeepSeek announces DeepSeek-R1-Lite-Preview. This is a “reasoning” model (inference-time chain-of-thought) that seems to be similar to OpenAI’s o1. Like o1, it achieves impressive results on math and science benchmarks. Some of the CoT reasoning traces are quite interesting (e.g.). The weights are not yet available, but they claim they will release it open-source.
Also interesting to consider the rate of progress. A couple years ago, the prediction was we might reach 46% in the MATH benchmark by 2025. Instead, we now have a general LLM getting 92%. And o1 has also scored 97% on a challenging math exam (with novel questions that are nowhere in the training data).
Someone is trying to use a team of AI agents to write a full book autonomously. Different agents are responsible for different characters, or different aspects of writing (consistency, researching facts, etc.).
Image Synthesis
A recent survey of 11,000 people has completed: How Did You Do On The AI Art Turing Test? The median score (to differentiate AI and human art) was 60%, a bit above chance. AI art was often preferred by humans. Overall, AI art has already crossing a Turing-Test threshold.
Pickle AI is offering a virtual avatar for your meetings ($30/month). You still attend the meeting, and talk when you want. But your avatar pretends to pay attention, and lip-syncs your speech. So this is an alternative to having your camera turned off.
Runway releases some small updates, including longer (20s) video-to-video, vertical aspect ratio for Act-One, and more camera controls.
Sequence modeling and design from molecular to genome scale with Evo. A 7B genomic multi-modal foundation model trained on 2.7 million genomes. It can interpret DNA, RNA, and protein sequences; and can predict across molecular, system, and genomic scales. Can be used to predict effect of mutations, design CRISPR systems, etc.
An article on Reuters: OpenAI and others seek new path to smarter AI as current methods hit limitations. It repeats the assertions (disputed by many experts in the community) that next-generation models (under development) are under-performing, and that AI labs are hitting data walls. They also emphasize that the path forward involves more “inference-time compute” to unlock reasoning.
It is interesting to see the article including a quote from Ilya Sutskever, who has been largely quiet in the public sphere, after his departure from OpenAI and founding of SSI.
I found the discussion frustrating, since it felt like they were trying to have two very different conversations: Wolfram questioning basic principles and trying to build the argument from the foundations, Yudkowsky taking AI risk as being mostly self-evident and defending particular aspects of his thesis.
Yudkowsky seems reluctant to provide a concise point-wise argument for AI risk, which leads to these kinds of strange debates where he defends a sequence of narrow points that feel mostly disconnected. From his body of work, I infer two general reasons why he does this:
He has learned that different people find different parts of the argument obvious vs. confusing, true vs. false. So rather than reiterate the whole argument, he tries to identify the parts they take issue with, and deal with those. This might work for one-on-one discussions, but for public debates (where the actual audience is the broader set of listeners), this makes it feel like Yudkowsky doesn’t have a coherent end-to-end argument (though he definitely does).
Yudkowsky’s style, in general, is not to just “give the answer,” but rather to lead the reader through a sequence of thoughts by which they should come to the right conclusion. In motivated pedagogy (where the reader is trying to learn), this is often the right way. “Giving the answer” won’t cause the person to learn the underlying pattern; the answer might feel too obvious and be quickly forgotten. Thus one instead tries to guide the person through the right thoughts. But to a resistant listener, this leaves the (incorrect) impression that the person’s arguments are vague.
Let me try to put together a step-wise argument for ASI risk. I think it goes something like:
Humans are actively trying to make AIs smarter, more capable, and more agentic (including giving access/control to real-world systems like computers and robots and factories).
There is no particular ceiling at human intelligence. It is possible in principle for an AI to be much smarter than a human, and indeed there are lots of easy-to-imagine ways that they would outstrip human abilities to predict/plan/make-decisions.
AIs will, generically, “go hard”; meaning they will put maximal effort into achieving their goals.
The effective goals of a powerful optimizer will tend to deviate strongly from the design goals. There are many reasons for this:
It is hard to reliably engineer something as fuzzy (and, ultimately, inconsistent) as human values.
The analogy to evolution is often offered: evolution is optimizing for replication of genes, yet enacted human values have only a little to do with that (wanting to have children, etc.); humans mostly care about non-genetic things (comfort, happiness, truth), and are often misaligned to genes (using contraception).
Even goals perfectly-specified for a modest context (e.g. human-scale values) will generalize to a broader context (e.g. control the light-cone) in an ill-defined way. There is a one-to-many mapping from the small to the large context, and so there is no way to establish the dynamics to pick which exact goals are enacted in the extrapolated context.
In the space of “all possible goals”, the vast majority are nonsense/meaningless. A small subspace of this total space is being selected by human design (making AIs that understand human data, and do human things like solve problems, design technology, make money, etc.). Even within this subspace, however, there is enormous heterogeneity to what the “effective goals” look like; and only a tiny fraction of those possible AI goals involve having flourishing humans (or other sentient minds).
To be clear, humans will design AIs with the intention that their effective goals preserve human flourishing, but (c.f. #4) this is a difficult, ill-posed problem. The default outcome is an AI optimizing for something other than human flourishing.
A powerful system pursuing goals that don’t explicitly require humans will, generally speaking, not be good for humans. For instance, a system trying to harness as much energy as possible for its computational goals will not worry about the fact that humans die as it converts all the matter in the solar system into solar cells and computer clusters.
A superhuman (#2) system with real-world control (#1) pursuing (with maximum effort, #3) goals misaligned to human values (#4) will try to enact a future that does not include humans (#5). It will, generically, succeed in this effort, which will incidentally exterminate humans (#6).
Moreover, this isn’t a case where one can just keep trying until one gets it right. The very first ASI could spell ruin, after which one does not get another change. It’s like trying to send a rocket to the moon, without being able to do test flights! (And where failure means extinction.)
This argument has many things left unspecified and undefended. The purpose is not to provide an airtight argument for ASI risk; but rather to enumerate the conceptual steps, so that one can focus a discussion down to the actual crux of disagreement.
Amazon’s new Alexa has reportedly slipped to 2025. It’s surprising, given Amazon’s lead (existing devices in homes, etc.) and considerable resources, that they have not been able to operationalize modern LLMs. Then again, I suppose the legacy capabilities and customer expectations (replacement must work at least as well, in myriad small tasks, as existing offering) slows down the ability to make changes.
We might be seeing something similar play out with Apple’s promises of AI features.
New study on impacts of AI to workers: Artificial Intelligence, Scientific Discovery, and Product Innovation. They find that for R&D materials scientists, diffusion models increase productivity and “innovation” (patents), boost the best performers, but also remove some enjoyable tasks.
A valid question is whether they provided enough coverage in training, and enough scale (data, parameters, training compute) to actually infer generalized physics. It’s possible that at a sufficient scale, robust physics modeling appears as an emergent capability.
Conversely, the implication might be that generalization tends to be interpolative, and the only reason LLMs (and humans?) appear generalized is that they have enough training data that they only ever need to generalize in-distribution.
Mixtures of In-Context Learners. Allows one to extract more value from existing LLMs, including those being accessed via cloud (weights not available). The method creates a set of different “experts” by calling an LLM repeatedly with different in-context examples. Instead of just merging or voting on their final responses, one can try to consolidate their responses at the token level by looking at the distribution of predictions for next token. This allows one, for instance, to provide more examples than the context window allows.
It would be interesting to combine this approach with entropy sampling methods (e.g. entropix) to further refine performance.
Anthropic added visual PDF support to Claude. Now, when Claude ingests a PDF, it does not only consider a textual conversion of the document, but can also see the visual content of the PDF, allowing it to look at figures, layout, diagrams, etc.
Anthropic releases Claude 3.5 Haiku, a small/efficient model that actually surpasses their older large model (Claude 3 Opus) on many benchmarks.
Tools
Google is now making available Learn About, a sort of AI tutor that can help you learn about a topic. (Seems great for education.)
Now, Decart AI (working with Etched) are showing a playable neural-rendered video game (basically Minecraft). Playable here (500M parameters, code). Right now, this is just a proof-of-principle. There is no way for the game designer to design an experience, and the playing itself is not ideal (e.g. it lacks persistence for changes made to terrain). It feels more like a dream than a video game. But the direction this is evolving is clear: we could have a future class of video games (or, more broadly, simulation environments) that are designed using AI methods (prompting, iterating, etc.), and neural-rendered in real-time. This would completely bypass the traditional pipelines.
To underscore why you should be thinking about this result in a “rate of progress” context (rather than what it currently is), compare: AI video 2022 to AI video today. So, think about where neural-world-rendering will be in ~2 years.
And we now also have GameGen-X: a diffusion transformer for generating and controlling video game assets and environments.
Science
Anthropic’s “Golden Gate Claude” interpretability/control method consists of identifying legible features in activation space. Researchers have applied this mechanistic interpretability to understanding protein language models. They find expected features, such as one associated with the repeating sequence of an alpha helix or beta hairpin (visualizer, code, SAE). More fully understanding the learned representation may well give new insights into proteins.
More generally, it is likely a very fruitful endeavor to train large models on science data, and search in a feature space for expected features (confirm it learned known physics), and thereafter search for novel physics in the space.