Kevin G. Yager | Academic Summary

AI News 2025-02-20

Posted on 2025-02-20 by KevinYager

General

Perplexity adds a Deep Research capability (similar to Google and OpenAI). You can try it even in the free tier (5 per day). They score 21% on the challenging “Humanity’s Last Exam” benchmark, second only to OpenAI at 26%.
TechCrunch reports: A job ad for Y Combinator startup Firecrawl seeks to hire an AI agent for $15K a year. Undoubtedly a publicity stunt. And yet, it hints towards a near-future economic dynamic: offering pay based on desired results (instead of salary), and allowing others to bid using human or AI solutions.
Mira Murati (formerly at OpenAI) announces Thinking Machines, an AI venture.
Fiverr announces Fiverr Go, where freelancers can train a custom AI model on their own assets, and have this AI model/agent available for use through the Fiverr platform. This provides a way for freelancers to service more clients.
- Elevenlabs Payouts is a similar concept, where voice actors can be paid when clients use their customized AI voice.
- In the short term, this provides an extra revenue stream to these workers. Of course, these workers are the most at threat for full replacement by these very AI methods. (And, indeed, one could worry that the companies in question are gathering the data they need to eventually obviate the need for profit-sharing with contributors.)

Research Insights

The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models. By looking at the internal/latent representation’s “geometry”, they assess that different prompts can yield rather different evoked representations; even in cases where they ultimately lead to the same reply. For instance, different evoked task-behaviors can interfere. This points towards more understanding of how to prompt models.
FAIR/Meta report: Intuitive physics understanding emerges from self-supervised pretraining on natural videos.
Are DeepSeek R1 And Other Reasoning Models More Faithful? The basic result is that reasoning models not only achieve higher scores on tests, but can also more correctly explain why their provided answer is correct.
Emergent Response Planning in LLM. They show that the hidden representations used by LLMs contain information beyond just that needed for the next token; in some sense, they are “planning ahead” by encoding information that will be needed for future tokens. (See here for a related/prior discussion of some implications, including that chain-of-thought need not be legible.)

LLM

Nous Research releases DeepHermes 3 (8B), which mixes together conventional LLM response with long-CoT reasoning response.
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU.
ByteDance has released a new AI-first coding IDE: Trae AI (video intro).
LangChain Open Canvas provides a user interface for LLMs, including memory features, UI for coding, display artifacts, etc.
xAI announces the release of Grok 3 (currently available for use here), including a reasoning variant and “Deep Search” (equivalent to Deep Research). Early testing suggests a model closing in on the abilities of o1-pro (but not catching up to o3 full). So, while it has not demonstrated any record-setting capabilities, it confirms that frontier models are not yet using any methods that cannot be reproduced by others.

AI Agents

Microsoft release OmniParser v2 (code), which can interpret screenshots to allow LLM computer use (on Windows 11 VMs).
Galileo AI introduces: Agent Leaderboard.
An example of a multi-agent workflow in the wild: OpenAI Operator (browser use) and Replit Agent (software development) working together on a project.
OpenAI release a new paper and benchmark: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (github) They curate a set of coding feature requests and bounties, to which one can thus apply economic value. They find that existing models cannot complete (unaided) the majority of tasks in the test set. This benchmark should act as a viable way to test future agentic coding systems.

Safety

New paper argues: Fully Autonomous AI Agents Should Not be Developed.
DeepMind releases a short course on AGI safety.

Image

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length. They demonstrate a variable-length serial token encoding for images. An input image can be expressed in a highly lossy way using a few tokens, or a higher-fidelity manner using more tokens.
- Related prior work (2022): using Stable Diffusion as an image compression scheme.
- Related prior work (2022): Nvidia demonstrate AI video compression.

Video

Argil AI enables creation of AI avatar videos, with high quality and control (examples).
Step-Video-T2V is a new open-source video generator that is close to state-of-the-art (demo, examples).
- Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Nvidia announce: Magic 1-For-1: Generating One Minute Video Clips within One Minute (code).
Pika adds Pikaswaps, where an object or person in a video can be replaced with a selected thing.

Meshy AI enables 3D model generation (from text or images). This video uses generated assets.

World Synthesis

Microsoft report: Introducing Muse: Our first generative AI model designed for gameplay ideation (publication in Nature: World and Human Action Models towards gameplay ideation). They train a model on gameplay videos (World and Human Action Model, WHAM); the model can subsequently forward-simulate gameplay from a provided frame. The model has thus learned an implicit world model for the video game. Forward-predicting gameplay based on artificial editing of frames (introducing a new character or situation) thus allows rapid ideation of gameplay ideas before actually updating the video game. More generally, this points towards direct neural rendering of games and other interactive experiences.

Science

Image-based generation for molecule design with SketchMol.
Artificial intelligence for individualized treatment of persistent atrial fibrillation: a randomized controlled trial.
Google blog post: Accelerating scientific breakthroughs with an AI co-scientist (paper: Towards an AI co-scientist). A multi-agent system to help generate hypotheses and research proposals. They show high-value hypotheses are generated, including providing three examples of concrete research outcomes (that will be detailed in follow-up papers).
Genome modeling and design across all domains of life with Evo 2.
Microsoft report: Exploring the structural changes driving protein function with BioEmu-1.

Brain

Meta research: Using AI to decode language from the brain and advance our understanding of human communication.
- Paper: Brain-to-Text Decoding: A Non-invasive Approach via Typing.
- Paper: From Thought to Action: How a Hierarchy of Neural Dynamics Supports Language Production.

Robots

Unitree video shows robot motion that is fairly fluid and resilient.
Clone robotics is moving towards combining their biomimetic components into a full-scale humanoid: Protoclone.
MagicLab Robot with dextrous MagicHand S01.
Figure AI claims a breakthrough in robotic control software (Helix: A Vision-Language-Action Model for Generalist Humanoid Control). The video shows two humanoid robots handling a novel task based on human natural voice instructions. Assuming the video is genuine, it show genuine progress in the capability of autonomous robots to understand instructions and conduct simple tasks (including working with a partner in a team).

Posted in AI, News | Tagged 3D, agents, brain, image, LLM, research, robots, safety, Science, video, world synthesis | Leave a comment

AI News 2025-02-13

Posted on 2025-02-13 by KevinYager

General

Advancing Reasoning in Large Language Models: Promising Methods and Approaches.
Andrej Karpathy released a 3.5 hour YouTube video: Deep Dive into LLMs like ChatGPT. A Good introduction for someone who wants to start understanding the details behind chatbots (without dwelling on the specific architectural details).
Sam Altman blog post: Three Observations.
1. AI intelligence roughly scales as the logarithm of the resources used (especially input data, training computer, and inference compute).
2. Cost of a given AI capability-level drops 10× every 12 months.
3. The socioeconomic value of linearly increasing intelligence is super-exponential.
The Anthropic Economic Index analyzes LLM usage.
- Their first report: Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations.
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition.
Sam Altman provided an update on future OpenAI plans.
- GPT-4.5 (internally called Orion) will be released soon, as the final non-reasoning model.
- GPT-5 will will released thereafter. It will be a meta-model, that correctly selects the right internal model/tools appropriate to the current request. Everyone (free, Plus, Pro) will have access to GPT-5, but the total amount of thinking/intelligence will be different in the different tiers (presumably this will be some combination of higher tiers favoring calling bigger models and using more inference-time compute).
- These simplifications will be true via web/ChatGPT and API.

Research Insights

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment. Contrastive learning (e.g. CLIP) showed a way to train in a multi-modal way; e.g. to align images and text into the same latent space. A more generalized version of this, which can find concept alignment across different deep neural networks, could be quite interesting and powerful. For instance, maybe a future version of this method could enable links between a non-textual foundation model (trained on unlabelled science data) with an LLM (which has internal concepts that capture the same ideas).
Looped Transformers are Better at Learning Learning Algorithms. Transformers are excellent general-purpose function approximators; however they are typically used in a single-pass mode without iteration. This paper shows an architecture where transformers are looped, allowing them to better reproduce the behavior of iterative algorithms.
A new approach for a reasoning model: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (code, model). Their current model (only 3.5B parameters) doesn’t exceed state-of-the-art reasoning models, but it shows promise (e.g. tests at larger scale).
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models.
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates.
Jeff Clune et al. release: Automated Capability Discovery via Foundation Model Self-Exploration (preprint, code). Models explore their own abilities, identifying capabilities and weaknesses.
Dan Hendrycks et al. release: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (paper, github). There are many interesting results. One is that stronger models (as measured by benchmark scores) exhibit progressively more coherent values, and their values become more entrenched and harder to change. From a safety perspective, one can interpret this in different ways. It seems dangerous that stronger/smarter models are more firm in their beliefs (less corrigible to human desires); but conversely a safe model should be consistent and unerring in its application of trained-in values. The overall notion that consistent values may be an emergent aspect of scaling up LLMs seems important.
Meta preprint: LLM Pretraining with Continuous Concepts. This adds to a growing body of work where LLM’s think in a latent space rather than in the output token stream. In this case, they modify the training task to capture the requirement that concepts should be encoded in the continuous internal representation.

LLM

OpenAI announce that o1 and o3-mini now have file and image upload capabilities.
Distillation Scaling Laws. Is it better to directly train a small model, or to train a larger model and distill that into a smaller model? The answer is complicated. Roughly, if on a tight compute budget, then directly training a small model may be better. However, if the cost of the big model is “free” (you want to have the big model for other purposes, etc.) then distillation of course can be efficient.

Safety & Security

Auditing Prompt Caching in Language Model APIs. They use the response speed to detect whether a given input has been previously cached. This allows one to detect whether someone else has already input that prompt, which thereby leaks information between users. This has a similar flavor to other attacks based on timing or energy use; a system leaks information when it implements internal efficiencies. Leakage can be stopped, but only by giving up the efficiency/speed gains.

Voice

Kyutai releases Hibiki, a speech-to-speech realtime translation model (code).
Zyphra releases an open-source text-to-speech (TTS) model with voice cloning: Zonos (hybrid, transformer).

Video

ByteDance releases Goku, a flow-based video generation model. Goku: Flow Based Video Generative Foundation Models (examples).

Science

Protein codes promote selective subcellular compartmentalization (preprint). They develop a protein language model that can predict where proteins will localize within a cell, based on specific sequences.
From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models (visualizer, code, huggingface).
Google DeepMind shows yet more progress on math problems: Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2.

Hardware

Groq has secured $1.5B to expand AI inference infrastructure in Saudi Arabia.

Robots

Foundation Robotics announce the Phantom robot (a rebrand of the Alex robot, after their acquisition of Boardwalk Robotics). The design involves different designs for upper and lower body, that can be selected based on usage. They seem to be testing with customers.
Mentee Robotics shows their v3 design.
New UK-based robotics startup: Humanoid; their first design is HMND 01.
Training a robot to stand up efficiently and robustly: HoST: Learning Humanoid Standing-up Control Across Diverse Postures (video).

Posted in AI, News | Tagged hardware, LLM, research, robots, safety, Science, video, voice | Leave a comment

AI News 2025-02-06

Posted on 2025-02-06 by KevinYager

General

Understanding Reasoning LLMs.

Research Insights

Some contrasting results on how reasoning LLMs operate:
- Large Language Models Think Too Fast To Explore Effectively. LLMs often make decisions prematurely, before having sufficiently explored the space of possibilities.
- Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs. In the analysis of reasoning (chain-of-thought) models, they often fail because of “underthinking”: abandoning a promising/correct chain of logic before taking it all the way to the solution.
  - More generally, we should expect that tuning the amount of depth vs. breadth in search will matter. This will perhaps arise naturally as models are trained on more reasoning traces; or perhaps could be tuned manually somehow.
Updates on model training:
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training.
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate. By learning to critique effectively, answer-style responses also improve.
Another approach to doing reasoning in latent space: Efficient Reasoning with Hidden Thinking (code).
Low-Rank Adapting Models for Sparse Autoencoders. They improve performance of SAE by adapting the model to the SAE (rather than just extracting the SAE from the model).
Language Models Use Trigonometry to Do Addition. Adds to a growing body of research showing how the latent space of LLMs exploits geometric arrangements to store information and do information processing.
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. They introduce a new reasoning benchmark where complexity can be tuned, and use it to show that LLMs struggle as complexity increases. Larger/better models, and more inference-compute, yields improve reasoning. But high-complexity inevitably counfounds.
Demystifying Long Chain-of-Thought Reasoning in LLMs.

LLM

Nvidia is providing a host for DeepSeek-R1 through their API.
OpenAI releases o3-mini, a powerful reasoning model that leverages inference-time compute.
Open-R1 is an attempt to reproduce the DeepSeek-R1 model/result/method in a fully open manner. Their first update shows progress in replicating DeepSeek’s results.
s1: Simple test-time scaling. They investigate the simplest possible inference-time compute method for increasing reasoning: they arbitrarily insert “Wait” tokens when the model tries to complete its response. This forces it to reconsider and think longer, yielding gains that scale with compute.
ACECODER: Acing Coder RL via Automated Test-Case Synthesis. It provides another way to think about expending post-training but pre-inference compute in order to improve a system.
Google releases Gemini 2.0 broadly. Although not the top models in raw benchmark scores, this set of models seem to establish a new record in terms of the Pareto tradeoff between performance and inference cost.

Safety

Anthropic paper: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. Associated blog post, and demo where you can try to break past the barriers.

AI Agents

Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks.
Replit launches an agent/app that allows you to make a customized mobile app without coding (examples).
OpenAI announces their second agentic product: Deep Research conducts web searches on a topic of choice, preparing a detailed report. A query can run for 2-30 minutes as it iteratively seeks information. This approach reaches a record-setting 26.6% on the recently-released (and very challenging) Humanity’s Last Exam benchmark.
- Ethan Mollick finds the system very capable, and provides additional thoughts: The End of Search, The Beginning of Research.
- This capability is thematically similar to what Perplexity and Google’s Deep Research do. However, OpenAI’s approach seems to leverage a reasoning model (presumably a variant of o3-mini) to iteratively work on the research problem.
- Open-source equivalents of OpenAI’s Deep Research are being developed:
  - Exa AI Labs has released web-search agents, including one powered by DeepSeek-R1 (code), and one powered by o3-mini (code).
  - Firecrawl is working on an agent that will reason over web data.
  - OpenDeepResearcher (by Matt Shumer), iterative searching.
  - open-deep-research (by nickscamara), uses Firecrawl and a reasoning model for deep web search.
  - deep-research (by dzhng), research assistant.
  - open-Deep-Research (by huggingface, code here).

Vision

Diffusion Autoencoders are Scalable Image Tokenizers (project, code).

Video

Meta publishes: VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models. Many video generation methods focus on appearance, not motion. A simple change to the prediction, to more strongly bias towards dynamics, improves output without changes in dataset or scaling (example videos).
ByteDance presents a method for video generation from a single image, including motion and synchronized voice: OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models (more examples).
Pika’s new Pikadditions allow new elements to be added to real video.

Voice

Google Search Lab is testing a bot that can navigate phone trees on your behalf.
British mobile operator 02 has created an AI voice bot intended to waste the time of scammers, by pretending to be inept as they are being scammed.

Robots

Nvidia and CMU publish: ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills. This method enables much more agile and human-like motion (videos).
Embrace Collisions: Humanoid Shadowing for Deployable Contact-Agnostics Motions. Proper handling of ground contact (and uncertainty thereof) allows improved motion, such as recovering from a ground position.
TechCrunch reports: Figure drops OpenAI in favor of in-house models.

Posted in AI, News | Tagged agents, LLM, research, robots, safety, vision, voice | Leave a comment

AI News 2025-01-30

Posted on 2025-01-30 by KevinYager

General

Humanity’s Last Exam has now released their dataset of 3,000 challenging problems.
- NY Times article: When A.I. Passes This Test, Look Out. The creators of a new test called “Humanity’s Last Exam” argue we may soon lose the ability to create tests hard enough for A.I. models.
It’s worth having periodic reminders that models keep improving in capability while reducing in cost. So the cost-per-capability is dropping truly dramatically. Open-weights models put continued economic pressure on this trend, forcing closed providers to keep lowering costs (even if they have a lead in performance). This graphic (from here) shows the trend:

Nous Research announces Psyche on Solana, a training system that can handle distributed training across heterogenous hardware.
Huggingface announces Inference Providers Hub, providing access to many compute providers through one interface.
The US Copyright Office has issued a statement: Copyright and Artificial Intelligence: Part 2: Copyrightability. The summary is that they contend existing copyright law is sufficient to handle AI; the existing rule is that significant human involvement in creation is necessary in order to warrant copyright (purely mechanical or accidental or non-human generation is insufficient). So works generated entirely by AI are not protected (the prompt input is not sufficient to be considered human-generated); but works incorporating AI elements or works transformatively changing AI generations could be protected.
Mark Zuckerberg discusses Llama 4 training progress. Training is ongoing (Llama-4-mini is done pre-training), models will be natively multi-modal, upcoming models will include reasoning, Meta’s stated goal is to have leadership models, agentic applications are anticipated.
- Meta plans to invest $65B in AI in 2025, including a 2GW datacenter with 1 million Nvidia GPUs.
OpenAI is increasing ties to US government activities:
- Introducing ChatGPT Gov: designed to streamline government agencies’ access to OpenAI’s frontier models.
- OpenAI partners with U.S. National Laboratories on scientific research, nuclear weapons security. OpenAI statement: Strengthening America’s AI leadership with the U.S. National Laboratories: OpenAI’s latest line of reasoning models will be used by nation’s leading scientists to drive scientific breakthroughs.

Research Insights

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition. Blog post: Attribution-based parameter decomposition. They claim a new method for mapping neural net parameters into interpretable features. Their method establish a minimal and simple set of parameters that reproduce model behavior.
TopoNets: High Performing Vision and Language Models with Brain-Like Topography. Taking inspiration from the functional organization of biological brains, they enforce a training loss that causes an artificial neural net to be topographically organized. This does not reduce performance, and provides some advantages (lower dimensionality, efficiency). This might also have implications for interpretability.
Tell me about yourself: LLMs are aware of their learned behaviors. LLMs can exhibit a surprising level of self-awareness: when trained to generate a set of behaviors, they can describe/define the behavior. The underlying mechanism is as yet unclear; it could be mere correlation of activation, or it could represent genuine self-analysis.

LLM

Release of Qwen2.5-1M model, with a 1 million token context (technical report).
Release of Qwen2.5-VL, a vision-language model.
DeepSeek releases Janus Pro 1B (includes image generation and chat with PDF). It can run local/in-browser via WebGPU (demo here).
Open Thoughts has launched as an effort to curate quality datasets for training reasoning models (e.g. validated synthetic reasoning traces). Initial dataset has 114k traces.
Open-R1 is an attempt to reproduce the DeepSeek-R1 model/result/method in a fully open manner.
OpenAI has added a “think” option to GPT-4o, allowing it to invoke some form of chain-of-thought.

Safety

New safety report published: International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025).
- In comparing model performance, they included some (previously unreleased) early test results from OpenAI (page 11), confirming that o3 outperforms across a wide range of technical and reasoning benchmarks.
Review paper: Open Problems in Mechanistic Interpretability.

AI Agents

AI agentic computer use is growing. Anthropic demoed their computer use system, and OpenAI just released their Operator. Convergence AI now has Proxy, another kind of computer use agent.

Audio

Multimodal Art Projection (m-a-p) released a new open-source full-song music generation model: YuE (乐).

Video

Pika announces v2.1 of their video model (examples: 1, 2, 3)
Hailuo introduces T2V-01-Director, which allows natural language specification of camera movements (more examples).
Luma adds 4K upscaling of videos generated using their Dream Machine model.
Krea adds consistent character feature for video.
Alibaba Qwen added a video model to their chatbot (try it here, examples).

Science

Reid Hoffman launched Manas AI, which aims to accelerate drug discovery.

Robots

Unitree shows off their humanoid robot’s abilities through choreographed dance.

Posted in AI, News | Tagged agents, audio, LLM, research, safety, video | Leave a comment

AI News 2025-01-23

Posted on 2025-01-23 by KevinYager

General

Detailed introduction (200 page ebook): Foundations of Large Language Models.
Inference Magazine is a new publication on AI progress. Many interesting articles. For instance:
- How much economic growth from AI should we expect, how soon?
OpenAI has announced (with the White House) a partnership called The Stargate Project. A consortium will invest $500 billion ($100 billion immediately) to build AI infrastructure in the United States.
Google agrees to new $1 billion investment in Anthropic. This adds to Google’s existing $2B investment (through which it owns 10% of Anthropic), and expands a cloud contract. This appears to be in addition to Anthropic’s ongoing effort to raise another $2B (at $60B valuation).

Research Insights

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation. They report an counter-intuitive result wherein intentionally over-fitting a trained LLM on a small set of samples yields improvements on long-generation tasks (rather than the kind of low-performance (e.g. repetition) one typically associates with over-fitting.
- Some say that this result is obvious, in that the optimization signal (loss, perplexity, etc.) is just a proxy for the actual desired performance (token accuracy).
Do generative video models learn physical principles from watching videos? (project, code) They find some aspects of physics are not learned, and that strong visual fidelity is not a guarantee that underlying physics are learned.
Google: Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments.
Physics of Skill Learning. The authors try to provide intuition about the learning process, using a succession of heuristics with different levels of detail.

LLM

OpenAI has finished safety testing of o3-mini, and is preparing to release it in the coming weeks. o3-mini is reportedly worse than o1-PRO, but much faster.
Deepwriter AI claims their system has written an entire 203 page without human involvement. Generation involved 1,100 API calls to Gemini Flash-Exp 2.0, and took ~4 hours.
- The book: The SaaS Crucible: Strategic Warfare for Underdog SaaS Startups.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
- They present two models: DeepSeek-R1-Zero and DeepSeek-R1; the former trained using reinforcement learning, the latter improving on this using additional data. They claim performance competitive with o1-mini or even o1.
- They also released 6 distilled models (based on Llama or Qwen).
- Available via Ollama.
Kimi release a similar report on the power of RL for improving reasoning in LLMs: Kimi k1.5: Scaling Reinforcement Learning with LLMs.
DeepLearning.ai have released a course on how to use Anthropic’s Computer Use mode.
OpenAI announce Operator (launch video), a computer-use agent that can conduct tasks in a virtualized web browser instance.
Anthropic adds a “Citations”, a RAG implementation available through the API.

Safety

OpenAI: Trading Inference-Time Compute for Adversarial Robustness (full paper). The results suggest that inference-time compute can be used to improve safety (guardrails, alignment, etc.). This makes sense, given that inference-compute increases capabilities, and alignment can be viewed as a particular kind of capability (desired response).

Image Synthesis

Runway ML releases access to Frames, an image model.
Google DeepMind reports: Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps (preprint). The take-home-message is that inference-time scaling improves image synthesis is a reliable way, similar to how it improves text-generation (e.g. reasoning). They apply a search process to find noise that yields a better generation.

Video

Example of using Hunyuan vid2vid to replace an actor in a scene.
Netflix releases: Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise. A video model that allows controllable animations.
Hailuo “Subject Reference” is enabling consistent characters in video generations (examples).
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos.

Audio

Bland AI (now bland.com) is running a publicity stunt where you can call their AI on your phone, and after 10-60 seconds of talking, it will clone your voice and start talking to you in your own voice. Intentionally unnerving, and a good reminder that we must now be skeptical of suspicious phone calls (even if they sound like loved ones), and for banks to stop using voice-print as a security factor.

Science

Published: Simulating 500 million years of evolution with a language model. (This was previously released as a preprint.) The ESM3 foundation model is trained on sequence, structure, and function of proteins. You can (e.g.) input a desired function and it will generate a candidate protein.
OpenAI has created an AI model for longevity science. More specifically, GPT-4b micro was trained to predict variants of protein factors with increased/controlled function. Since this model is not yet broadly available, we can’t estimate the utility. But it reinforces the notion that there is still plenty of opportunity space for tuned/task-specific advances wherever we have data and compute.

Robots

A video of a nimble wheeled quadruped built by DEEP Robotics.

Posted in AI, News | Tagged audio, image synthesis, LLM, research, Science, vidoe | Leave a comment

AI News 2025-01-16

Posted on 2025-01-16 by KevinYager

General

The US White House issued a statement: FACT SHEET: Ensuring U.S. Security and Economic Strength in the Age of Artificial Intelligence. It calls to provide unrestricted access to AI hardware and software to 18 “key allies and partners”; with correspondingly restricted access to others.
OpenAI’s Economic Blueprint: policy proposals for how the US can maximize AI’s benefits, bolster national security, and drive economic growth. Full report: AI in America.
From chalkboards to chatbots: Transforming learning in Nigeria, one prompt at a time. The article reports major gains in education when using AI as a tutor (supposedly: 6 weeks of after-school AI tutoring = 2 years of typical learning gains).
Simple discussion of the environmental cost of genAI: Using ChatGPT is not bad for the environment.
- Relatedly: The carbon emissions of writing and illustrating are lower for AI than for humans.
Here’s a press release that provides a general-audience intro to my exocortex concept.

Research Insights

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought.

Safety

Writing Doom. A short film (27m) about superintelligence. The film does a good job of going-over the basic arguments for ASI threat; useful for those who haven’t heard these before. (C.f. my attempt to summarize the arguments.)

LLM

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs. They introduce a multi-step visual reasoning benchmark, and introduce a LlamaV-o1 visual reasoning model that leverages curriculum learning.
AutoRAG: RAG AutoML tool for automatically finding an optimal RAG pipeline for your data.
Enhancing Retrieval-Augmented Generation: A Study of Best Practices.
OpenAI introduces Tasks: the ability to schedule ChatGPT to perform an action and report the result (examples). Although simple, it points towards increasingly agentic, background activity by commercial LLMs.
MiniMax release (open-source) MiniMax-Text-01 and MiniMax-VL-01 (multi-modal visual). You can try it here. Using flash attention, they deploy a 4M token context length.
- Paper: MiniMax-01: Scaling Foundation Models with Lightning Attention.
Interesting developments to improve LLM reasoning over image/video data:
- VideoRAG: Retrieval-Augmented Generation over Video Corpus.
- Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.

AI Agents

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains (preprint, code). A base model is finetuned into a variety of specialized models using synthetic data.

Audio

Hailuo AI unveils T2A-01-HD, a text-to-speech model (try here, API here).

Image Synthesis

Nvidia releases (Apache 2.0 license) Sana image model (examples).

Video

Luma AI introduces their next video model: Ray2 (examples).

Science

Update to the NextBrain segmentation method: Bayesian Segmentation with Histological Atlas “NextBrain”.
- Previously, researchers evaluated whether Meta’s Segment Anything Model (SAM) was suitable for MRI.
A generative model for inorganic materials design. Uses the denoising concept (as used in image synthesis) to enable generation of novel inorganic material unit cells. This essentially allows text-to-material prompting.

Robots

Latest video of Unitree’s humanoid robot shows a more humanlike gait, and navigating more rugged terrain.

Posted in AI, News | Tagged audio, image synthesis, LLM, research, safety, video | Leave a comment

AI News 2025-01-09

Posted on 2025-01-09 by KevinYager

General

Blog post: The Intelligence Curse: With AGI, powerful actors will lose their incentives to invest in people.
Microsoft blog post: The Golden Opportunity for American AI.
Microsoft to Spend $80 Billion on AI Data Centers This Year. Over half this spending will be in the US.
Emirati billionaire Hussain Sajwani is reportedly planning to invest $20 billion in the US in data centers.
Anthropic is raising a further $2B, at a $60B valuation
Bloomberg interview: Sam Altman on ChatGPT’s First Two Years, Elon Musk and AI Under Trump; and Altman posts on his blog: Reflections. Altman reaffirms that agents will be developed in 2025, and they are on-track to AGI in the years following.

Research Insights

PRIME: Process Reinforcement Through Implicit Rewards (data/models, code)
- Builds on prior work: Free Process Rewards without Process Labels.
- The basic idea is: chain-of-thought (CoT) is a useful way to improve reasoning. But how to train better CoT? You can give scores to good vs. bad chains, but then the model only gets whole-chain feedback. It would be better to know where the reasoning chain went wrong (or right). In PRIME, alongside training the LLM, they train an LLM that acts as a per-token reward model. It learns what CoT-steps are looking good vs. bad, and so can provide more fine-grained direction control.
Differential Transformer. Explanation: The traditional transformer architecture spreads attention and can thus get distracted by noise (especially with large context). The differential architecture alters the attention equation so as to better amplify relevant context and suppress noise. This should improve retrieval and reduce hallucinations, especially for large contexts.
Metadata Conditioning Accelerates Language Model Pre-training. Pre-pending training data with meta-data (e.g. “from wikipedia.org”), for part of the training, allows more control. Training can be more data-efficient, and inference can be more steerable (by invoking a meta-data field associated with the desired output style).

LLM

Interesting idea to automate the ranking of LLMs (for a particular task). LLMRank (“SlopRank”) uses a set of LLMs to generate outputs, and evaluate each other. The top model can then be inferred from a large number of recommendations (from the other models), analogous to ranking pages in web-search using PageRank.
Rubiks AI releases new Sonus-1 models, including a reasoning variant.
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings (preprint, leaderboard).
Blog post: Can LLMs write better code if you keep asking them to “write better code”? The answers is “yes”, though the expected issues arise (prompting matters, hallucinations may occur, etc.). It does generally confirm the notion that iterative LLM work can exceed single-shot generation.
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM.
The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input.
Microsoft open-sources (MIT license) their small-but-performant (14B) phi-4 model.

AI Agents

Google whitepaper: Agents.
There are now lots of AI agent orchestration frameworks. Here’s the latest addition: orchestra (docs, code).
Agent Laboratory: Using LLM Agents as Research Assistants.
AgentRefine: Enhancing Agent Generalization through Refinement Tuning. Tuning a system only on successful task completion is not enough; one must train in the ability to handle errors.

Video

Fine-tuning of video models to a particular style is now starting. Examples of Hunyuan Video LoRAs.
Nvidia’s new GeForce RTX 5090 graphics card can use neural rendering for real-time ray-tracing (where only ~10% of pixels are computed using traditional ray-tracing, and a neural model is used to interpolate from that).

World Synthesis

Nvidia present Cosmos, a set of foundation models trained on 20 million hours of video. Intended to accelerate training (e.g. via synthetic data generation) of models for robotics, autonomous driving, industrial settings, etc.

Science

An automatic end-to-end chemical synthesis development platform powered by large language models.
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring (code). A 7B foundation model trained on 1.5T DNA/RNA base pairs, obtained from wastewater.
A foundation model of transcription across human cell types.
Accurate predictions on small data with a tabular foundation model (code). A foundation model using in-context learning can infer missing tabular data more correctly than traditional methods.

Brain

The Digital Twin Brain Consortium publishes: Simulation and assimilation of the digital human brain (preprint, code). They simulate 86B neurons and 48T synapses using 14k GPUs.
Predicting Human Brain States with Transformer. The system can predict the next 5s of fMRI data from the previous 20s.
Key-value memory in the brain. They provide some evidence that key-value style memory could be implemented biologically, and maybe even is the process of human memory retrieval. If this were true, it would imply that the limit on human memory is not storage, but retrieval (one forgets not because the memory/information is erased/over-written, but because one loses the key/pathway towards retrieving that specific memory).

Hardware

Nvidia described their BG200 NVL72 rack-sized supercomputer: 72 Blackwell GPUs, 1.4 exaFLOPS of compute, and 130 trillion transistors. For fun, Jensen Huang showed what the corresponding compute would look like if all placed on a single wafer as a superchip, though that is not how it is actually manufactured or used.
Nvidia announces $3,000 personal AI supercomputer called Digits, which uses a GB10 superchip. A single unit can run a 200B model; linking two should allow one to run 405B models.

Robots

OpenDriveLab and AgiBot-World release a large-scale robotics dataset: 1M trajectories from 100 real-world scenarios and 100 robots.
Nvidia describes Isaac GR00T Blueprint to accelerate robotics development.

Posted in AI, News | Tagged agents, brain, hardware, LLM, research, robots, Science, video, world synthesis | Leave a comment

AI New 2025-01-02

Posted on 2025-01-02 by KevinYager

General

Interesting essay: By default, capital will matter more than ever after AGI.
- Counter-argument.
Google DeepMind preprint: A theory of appropriateness with applications to generative artificial intelligence.
Can one objectively define “good taste” (e.g. in appreciating art?). If one can (e.g. to objectively understand the details and context that explain human preferences), then it seems likely that AIs will eventually exhibit superhuman taste, in that they will be able to analyze given data from a multitude of well-informed perspectives.

Research Insights

An interesting effect: fine-tuning GPT-4o on responses where the first letter of each line spells out H-E-L-L-O leads to a model that can correctly explain this underlying rule (even though the rule was never provided to it). This is surprising since when generating a reply, a token-wise prediction cannot “see ahead” and know that it will spell out HELLO; yet the LLM is somehow able to predict its own behavior, suggesting it has some knowledge of its own internal state.
- Further testing with the pattern HELOL gave far worse results, implying strong reliance on the existence of the HELLO pattern in the training data.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs. The authors analyze whether we are efficiently using inference-time compute, and propose mitigate strategies to avoid overthinking.

AI Agents

Huggingface introduce smolagents, a lightweight framework for agents.
Agentarium is a Python framework for orchestrating agents.
Eliza is a framework for AI models to access resources (documents, Discord, Twitter, etc.).

Audio

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization (code).

zoo.dev is developing workflows for CAD where one can switch between generative and traditional-edit modes.

Science

Where did viruses come from? AlphaFold and other AIs are finding answers.
- There is growing sophistication in bio-polymer prediction methods: AlphaFold, AlphaFold 2, AlphaFold 3, ESMFold, Evo, Chroma.

Robots

LimX released a video of a new humanoid design.
EngineAI released details of their PM01 design (c.f. existing SE01 design sells for $12,000).

Posted in AI, News | Tagged 3D, agents, audio, research, robots, Science | Leave a comment

AI News 2024-12-26

Posted on 2024-12-27 by KevinYager

General

Ethan Mollick provides a summary recent developments in AI: What just happened.
modernBERT is a replacement for the popular BERT style models. The claim is that it is both faster, and yields higher-quality embeddings.
xAI have raised a further $6B in series C funding.

Research Insights

I Don’t Know: Explicit Modeling of Uncertainty with an [IDK] Token (discussion by Vincent D. Warmerdam).
Meta-Reflection: A Feedback-Free Reflection Learning Framework. Allows an LLM to have reflection-like thinking in a single forward pass. Uses a codebook of reflections to draw from.
Let your LLM generate a few tokens and you will reduce the need for retrieval. After generating some tokens in reply to a query, an LLM will be better able to assess whether it knows the answer (and thus whether retrieval is warranted).
Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models. Computes per-token temperature, to better guide sequence of thoughts.

LLM

OpenAI reveal a new reasoning model: o3. It scores higher on math and coding benchmarks, including setting a new record of 87.5% on ARC-AGI Semi-Private Evaluation. This suggests that the model is exhibiting new kinds of generalization and adaptability.
- The ARC-AGI result becomes even more impressive when one realizes that the prompt they used was incredibly simple. It does not seem that they prompt engineered, nor used a bespoke workflow for this benchmark (the ARC-AGI public training set was included in o3 training). Moreover, some of the failures involve ambiguities; even when it fails, the solutions it outputs are not far off. While humans still out-perform AI on this benchmark (by design), we are approaching the situation where the problem is not depth-of-search, but rather imperfect mimicking of human priors.
- The success of o3 suggests that inference-time scaling has plenty of capacity; and that we are not yet hitting a wall in terms of improving capabilities.
More research as part of the trend of improving LLMs with more internal compute, rather than external/token-level compute (c.f. Meta and Microsoft research):
- Johns Hopkins: Compressed Chain of Thought: Efficient Reasoning Through Dense Representations.
- Google DeepMind: Deliberation in Latent Space via Differentiable Cache Augmentation. They design a sort of “co-processor” that allows additional in-model (latent space) computation, while the main LLM weights are frozen. This is part of a trend of improving LLMs with more internal compute (rather than external/token-level compute).
- Jeremy Berman presents: LANG-JEPA: Learning to Think in Latent Space. An experimental LLM architecture, based on Meta’s JEPA, that operates in concept space instead of token space.
Qwen released: QvQ-72B-preview visual reasoning model.
DeepSeek release DeepSeek-V3-Base (weights), 671B params. This is noteworthy as a very large open-source model, noteworthy for achieving competitive to state-of-the-art performance, and noteworthy for having (supposedly) required relatively little compute (15T tokens, 2.788M GPU-hours on H800, only $5.5M).

Safety

OpenAI releases paper: Deliberative Alignment: Reasoning Enables Safer Language Models. The method is similar to Anthropic’s constitutional AI (where one writes down principles the AI must consider and adhere to), but leveraging the improved reasoning of modern models (o1, o3) to correspondingly improve alignment.

Video

Pika launched their 2.0 model, including “Scene Ingredients” which provides methods for adding specific characters to scenes.
LTX Studio adds fine-grained control of facial emotions.
ByteDance INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations. Allows one to take audio and an image, and generate a lip-synced video (examples).

Audio

Adobe Sketch2Sound allows one to imitate sound effects, and use AI to convert it into appropriate sounds. This allows art direction for Foley sound.
MMAudio enables video-to-audio; i.e. it can add a soundtrack to silent video (project, code, examples: 1, 2).

World Synthesis

WonderWorld: Interactive 3D Scene Generation from a Single Image (preprint, examples).

Science

Sakana AI (c.f. AI Scientist) present Automating the Search for Artificial Life with Foundation Models (preprint, code). They use various environment that parametrize simple rulesets that can lead to complex emergent behavior (cellular automata, Conway’s game of life, Boids). These act as test environments with richness and complexity, and they use visual/language models (VLMs) to automate search for interesting behavior. Since artificial life environments can also provide inspiration for AI, this is AI-guided search through artificial life, towards improvement of AI.
Google DeepMind: OmniPred: Language Models as Universal Regressors. General text-to-text regression can be applied to arbitrary science (x,y) data.
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models. This exploits concepts from mechanistic interpretability to allow one to discovery new science.
LLMs can realize combinatorial creativity: generating creative ideas via LLMs for scientific research.

Hardware

Nvidia unveils a small form-factor compute platform (suitable for robotics).
Raven Resonance is another attempt to deliver augmented reality glasses.

Robots

Apptronik are partnering with Google DeepMind to bring humanoid robots to fruition a bit faster.
Figure claims they are now revenue-generating, as they are delivering real robots to a paying client.
PaXini is building TORA-ONE, a wheeled humanoid with dexterous hands.
Unitree B2-W (wheeled quadruped) is now available for purchase ($150,000 USD). It seems highly capable.
Some researchers are using video diffusion models (which can predict future frames) as a robot policy: Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations (preprint). They show the example of the robot doing chemistry experiments.
Atlas electric (Boston Dynamics) can do a backflip (even while wearing clothes).
Apptronik claim their humanoid robots are doing real work in a fulfillment warehouse.

Posted in AI, News | Tagged audio, hardware, LLM, research, robots, safety, Science, video, world synthesis | Leave a comment

AI News 2024-12-19

Posted on 2024-12-19 by KevinYager

General

Ilya Sutskever was co-recipient of the test-of-time award at NeurIPS 2024, for the 2014 paper: Sequence to Sequence Learning with Neural Networks, currently cited >28,000 times. Video of his speech here, in which he makes many provocative points: compute is growing but data is not (we only have one Internet, data is the fossil fuel of AI); scaling still matters, and we must determine what to scale; what comes next will be a mix of agents, synthetic data, and inference-time computer; strongly reasoning systems will be unpredictable; superintelligence is coming.
Anthropic present Clio, a system that provides an aggregated view of what people are using Claude to do. So this allows one to observe trends in AI usage. Paper: Clio: Privacy-Preserving Insights into Real-World AI Use.

OpenAI

Dec 12: video input for advanced voice mode is being enabled.
Dec 13: ChatGPT projects allow organizing conversations and specializing responses for particular subject areas.
In response to continued law suits from Elon Musk, OpenAI present further evidence that Musk was in-favor of the proposed shift towards a capped for-profit structure. (This new information has been added to this aggregation of the relevant communications.)
Dec 16: Improved search, more broadly available.
Dec 17: New developer tools for o1.
Dec 18: ChatGPT is now available by phone: 1-800-ChatGPT (1-800-242-8478) in US and Canada (you can also add it as a WhatsApp contact with that number).
Dec 19: ChatGPT integration into certain coding and note-taking apps.

Research Insights

A set of results push LLMs a bit away from the legible token representation we are currently used to:
- Meta publishes: Byte Latent Transformer: Patches Scale Better Than Tokens. Instead of tokenization, it dynamically converts the input byte-stream into patches. This yields significant gains in compute efficiency, with minimal loss in performance.
- Meta publishes: Large Concept Models: Language Modeling in a Sentence Representation Space. They train a model that operates at a higher level of abstraction than typical word/token LLMs. Their model operates in a space of concept embeddings (which are more akin to full sentences than individual words).
- Last week, Meta published: Training Large Language Models to Reason in a Continuous Latent Space, which involves feeding the latent representation directly back into the model, instead of tokenizing intermediate thoughts (Chain of Continuous Thought, a.k.a. Coconut).
- Microsoft previously described: DroidSpeak: Enhancing Cross-LLM Communication, wherein LLMs invent their own inter-communication language.
- Each of these is individually exciting in terms of increased performance. However, they all push away from human-legible intermediate representations, which is problematic from a safety and engineering perspective.
Thinking Fast and Laterally: Multi-Agentic Approach for Reasoning about Uncertain Emerging Events. They introduce more system-2 and lateral-thinking through multi-agent interactions.
Cultural Evolution of Cooperation among LLM Agents.
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers.

LLM

Microsoft releases a small-but-capable model: Phi-4 (14B). It heavily uses synthetic data generation and post-training to improve performance (including on reasoning tasks).
Google’s Project Mariner, a chrome extension for agentic AI.
Google release Gemini 2.0 Flash Thinking, a reasoning model (available in AI studio).

Safety

Anthropic releases a new method to jailbreak AI models, using an automated attack method. By identifying this vulnerability, one can build future models to resist it. Paper: Best-of-N Jailbreaking (code). The method iteratively makes small changes to prompts, attempting to slide through countermeasures.
- The flavor of successful attacks also gives insights into LLMs. Successful prompts may involve strange misspellings or capitalizations; or unusual images with text and colored boxes arranged peculiarly. This is similar to other adversarial attacks (e.g. on image classification models). They have a certain similarity to human optical illusions: generating perverse arrangements meant to trick otherwise useful processing circuits. Improved model training can progressively patch these avenues; but it’s hard to imagine models that completely eliminate them until one achieves truly robust intelligence.
Anthropic publish: Alignment Faking in Large Language Models. They find evidence for alignment faking, wherein the model selectively complies with an objective in training, in order to prevent modification of its behavior after training. Of course the setup elicited this behavior, but it is surprising in the sense that LLMs don’t have persistent memory/awareness, and troubling in the sense that this shows even LLMs can engage in somewhat sophisticated scheming (e.g. they have evidence for these decisions going on during the LLM forward-pass, not in chain-of-thought).

Video

MinT video improves consistency and control (examples). Preprint: Mind the Time: Temporally-Controlled Multi-Event Video Generation.
Google announces Veo 2 and Imagen 3 (available via Labs, more examples, examples with natural movement).

Audio

ElevanLabs introduce a Flash TTS model, with latency of just 75 milliseconds.

World Synthesis

Impressive demo of a new physics engine: Genesis: A Generative and Universal Physics Engine for Robotics and Beyond (code, project page). It appears to be an accelerated physics engine with a LLM interface.

Science

Superhuman performance of a large language model on the reasoning tasks of a physician.

Brain

Contextual feature extraction hierarchies converge in large language models and the brain. LLMs are becoming more brain-like as they advance.

Posted in AI, News | Tagged audio, brain, LLM, OpenAI, research, safety, Science, video, world synthesis | Leave a comment