Kevin G. Yager | Academic Summary

AI News 2025-01-16

Posted on 2025-01-16 by KevinYager

General

The US White House issued a statement: FACT SHEET: Ensuring U.S. Security and Economic Strength in the Age of Artificial Intelligence. It calls to provide unrestricted access to AI hardware and software to 18 “key allies and partners”; with correspondingly restricted access to others.
OpenAI’s Economic Blueprint: policy proposals for how the US can maximize AI’s benefits, bolster national security, and drive economic growth. Full report: AI in America.
From chalkboards to chatbots: Transforming learning in Nigeria, one prompt at a time. The article reports major gains in education when using AI as a tutor (supposedly: 6 weeks of after-school AI tutoring = 2 years of typical learning gains).
Simple discussion of the environmental cost of genAI: Using ChatGPT is not bad for the environment.
- Relatedly: The carbon emissions of writing and illustrating are lower for AI than for humans.
Here’s a press release that provides a general-audience intro to my exocortex concept.

Research Insights

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought.

Safety

Writing Doom. A short film (27m) about superintelligence. The film does a good job of going-over the basic arguments for ASI threat; useful for those who haven’t heard these before. (C.f. my attempt to summarize the arguments.)

LLM

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs. They introduce a multi-step visual reasoning benchmark, and introduce a LlamaV-o1 visual reasoning model that leverages curriculum learning.
AutoRAG: RAG AutoML tool for automatically finding an optimal RAG pipeline for your data.
Enhancing Retrieval-Augmented Generation: A Study of Best Practices.
OpenAI introduces Tasks: the ability to schedule ChatGPT to perform an action and report the result (examples). Although simple, it points towards increasingly agentic, background activity by commercial LLMs.
MiniMax release (open-source) MiniMax-Text-01 and MiniMax-VL-01 (multi-modal visual). You can try it here. Using flash attention, they deploy a 4M token context length.
- Paper: MiniMax-01: Scaling Foundation Models with Lightning Attention.
Interesting developments to improve LLM reasoning over image/video data:
- VideoRAG: Retrieval-Augmented Generation over Video Corpus.
- Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.

AI Agents

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains (preprint, code). A base model is finetuned into a variety of specialized models using synthetic data.

Audio

Hailuo AI unveils T2A-01-HD, a text-to-speech model (try here, API here).

Image Synthesis

Nvidia releases (Apache 2.0 license) Sana image model (examples).

Video

Luma AI introduces their next video model: Ray2 (examples).

Science

Update to the NextBrain segmentation method: Bayesian Segmentation with Histological Atlas “NextBrain”.
- Previously, researchers evaluated whether Meta’s Segment Anything Model (SAM) was suitable for MRI.
A generative model for inorganic materials design. Uses the denoising concept (as used in image synthesis) to enable generation of novel inorganic material unit cells. This essentially allows text-to-material prompting.

Robots

Latest video of Unitree’s humanoid robot shows a more humanlike gait, and navigating more rugged terrain.

Posted in AI, News | Tagged audio, image synthesis, LLM, research, safety, video | Leave a comment

AI News 2025-01-09

Posted on 2025-01-09 by KevinYager

General

Blog post: The Intelligence Curse: With AGI, powerful actors will lose their incentives to invest in people.
Microsoft blog post: The Golden Opportunity for American AI.
Microsoft to Spend $80 Billion on AI Data Centers This Year. Over half this spending will be in the US.
Emirati billionaire Hussain Sajwani is reportedly planning to invest $20 billion in the US in data centers.
Anthropic is raising a further $2B, at a $60B valuation
Bloomberg interview: Sam Altman on ChatGPT’s First Two Years, Elon Musk and AI Under Trump; and Altman posts on his blog: Reflections. Altman reaffirms that agents will be developed in 2025, and they are on-track to AGI in the years following.

Research Insights

PRIME: Process Reinforcement Through Implicit Rewards (data/models, code)
- Builds on prior work: Free Process Rewards without Process Labels.
- The basic idea is: chain-of-thought (CoT) is a useful way to improve reasoning. But how to train better CoT? You can give scores to good vs. bad chains, but then the model only gets whole-chain feedback. It would be better to know where the reasoning chain went wrong (or right). In PRIME, alongside training the LLM, they train an LLM that acts as a per-token reward model. It learns what CoT-steps are looking good vs. bad, and so can provide more fine-grained direction control.
Differential Transformer. Explanation: The traditional transformer architecture spreads attention and can thus get distracted by noise (especially with large context). The differential architecture alters the attention equation so as to better amplify relevant context and suppress noise. This should improve retrieval and reduce hallucinations, especially for large contexts.
Metadata Conditioning Accelerates Language Model Pre-training. Pre-pending training data with meta-data (e.g. “from wikipedia.org”), for part of the training, allows more control. Training can be more data-efficient, and inference can be more steerable (by invoking a meta-data field associated with the desired output style).

LLM

Interesting idea to automate the ranking of LLMs (for a particular task). LLMRank (“SlopRank”) uses a set of LLMs to generate outputs, and evaluate each other. The top model can then be inferred from a large number of recommendations (from the other models), analogous to ranking pages in web-search using PageRank.
Rubiks AI releases new Sonus-1 models, including a reasoning variant.
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings (preprint, leaderboard).
Blog post: Can LLMs write better code if you keep asking them to “write better code”? The answers is “yes”, though the expected issues arise (prompting matters, hallucinations may occur, etc.). It does generally confirm the notion that iterative LLM work can exceed single-shot generation.
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM.
The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input.
Microsoft open-sources (MIT license) their small-but-performant (14B) phi-4 model.

AI Agents

Google whitepaper: Agents.
There are now lots of AI agent orchestration frameworks. Here’s the latest addition: orchestra (docs, code).
Agent Laboratory: Using LLM Agents as Research Assistants.
AgentRefine: Enhancing Agent Generalization through Refinement Tuning. Tuning a system only on successful task completion is not enough; one must train in the ability to handle errors.

Video

Fine-tuning of video models to a particular style is now starting. Examples of Hunyuan Video LoRAs.
Nvidia’s new GeForce RTX 5090 graphics card can use neural rendering for real-time ray-tracing (where only ~10% of pixels are computed using traditional ray-tracing, and a neural model is used to interpolate from that).

World Synthesis

Nvidia present Cosmos, a set of foundation models trained on 20 million hours of video. Intended to accelerate training (e.g. via synthetic data generation) of models for robotics, autonomous driving, industrial settings, etc.

Science

An automatic end-to-end chemical synthesis development platform powered by large language models.
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring (code). A 7B foundation model trained on 1.5T DNA/RNA base pairs, obtained from wastewater.
A foundation model of transcription across human cell types.
Accurate predictions on small data with a tabular foundation model (code). A foundation model using in-context learning can infer missing tabular data more correctly than traditional methods.

Brain

The Digital Twin Brain Consortium publishes: Simulation and assimilation of the digital human brain (preprint, code). They simulate 86B neurons and 48T synapses using 14k GPUs.
Predicting Human Brain States with Transformer. The system can predict the next 5s of fMRI data from the previous 20s.
Key-value memory in the brain. They provide some evidence that key-value style memory could be implemented biologically, and maybe even is the process of human memory retrieval. If this were true, it would imply that the limit on human memory is not storage, but retrieval (one forgets not because the memory/information is erased/over-written, but because one loses the key/pathway towards retrieving that specific memory).

Hardware

Nvidia described their BG200 NVL72 rack-sized supercomputer: 72 Blackwell GPUs, 1.4 exaFLOPS of compute, and 130 trillion transistors. For fun, Jensen Huang showed what the corresponding compute would look like if all placed on a single wafer as a superchip, though that is not how it is actually manufactured or used.
Nvidia announces $3,000 personal AI supercomputer called Digits, which uses a GB10 superchip. A single unit can run a 200B model; linking two should allow one to run 405B models.

Robots

OpenDriveLab and AgiBot-World release a large-scale robotics dataset: 1M trajectories from 100 real-world scenarios and 100 robots.
Nvidia describes Isaac GR00T Blueprint to accelerate robotics development.

Posted in AI, News | Tagged agents, brain, hardware, LLM, research, robots, Science, video, world synthesis | Leave a comment

AI New 2025-01-02

Posted on 2025-01-02 by KevinYager

General

Interesting essay: By default, capital will matter more than ever after AGI.
- Counter-argument.
Google DeepMind preprint: A theory of appropriateness with applications to generative artificial intelligence.
Can one objectively define “good taste” (e.g. in appreciating art?). If one can (e.g. to objectively understand the details and context that explain human preferences), then it seems likely that AIs will eventually exhibit superhuman taste, in that they will be able to analyze given data from a multitude of well-informed perspectives.

Research Insights

An interesting effect: fine-tuning GPT-4o on responses where the first letter of each line spells out H-E-L-L-O leads to a model that can correctly explain this underlying rule (even though the rule was never provided to it). This is surprising since when generating a reply, a token-wise prediction cannot “see ahead” and know that it will spell out HELLO; yet the LLM is somehow able to predict its own behavior, suggesting it has some knowledge of its own internal state.
- Further testing with the pattern HELOL gave far worse results, implying strong reliance on the existence of the HELLO pattern in the training data.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs. The authors analyze whether we are efficiently using inference-time compute, and propose mitigate strategies to avoid overthinking.

AI Agents

Huggingface introduce smolagents, a lightweight framework for agents.
Agentarium is a Python framework for orchestrating agents.
Eliza is a framework for AI models to access resources (documents, Discord, Twitter, etc.).

Audio

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization (code).

zoo.dev is developing workflows for CAD where one can switch between generative and traditional-edit modes.

Science

Where did viruses come from? AlphaFold and other AIs are finding answers.
- There is growing sophistication in bio-polymer prediction methods: AlphaFold, AlphaFold 2, AlphaFold 3, ESMFold, Evo, Chroma.

Robots

LimX released a video of a new humanoid design.
EngineAI released details of their PM01 design (c.f. existing SE01 design sells for $12,000).

Posted in AI, News | Tagged 3D, agents, audio, research, robots, Science | Leave a comment

AI News 2024-12-26

Posted on 2024-12-27 by KevinYager

General

Ethan Mollick provides a summary recent developments in AI: What just happened.
modernBERT is a replacement for the popular BERT style models. The claim is that it is both faster, and yields higher-quality embeddings.
xAI have raised a further $6B in series C funding.

Research Insights

I Don’t Know: Explicit Modeling of Uncertainty with an [IDK] Token (discussion by Vincent D. Warmerdam).
Meta-Reflection: A Feedback-Free Reflection Learning Framework. Allows an LLM to have reflection-like thinking in a single forward pass. Uses a codebook of reflections to draw from.
Let your LLM generate a few tokens and you will reduce the need for retrieval. After generating some tokens in reply to a query, an LLM will be better able to assess whether it knows the answer (and thus whether retrieval is warranted).
Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models. Computes per-token temperature, to better guide sequence of thoughts.

LLM

OpenAI reveal a new reasoning model: o3. It scores higher on math and coding benchmarks, including setting a new record of 87.5% on ARC-AGI Semi-Private Evaluation. This suggests that the model is exhibiting new kinds of generalization and adaptability.
- The ARC-AGI result becomes even more impressive when one realizes that the prompt they used was incredibly simple. It does not seem that they prompt engineered, nor used a bespoke workflow for this benchmark (the ARC-AGI public training set was included in o3 training). Moreover, some of the failures involve ambiguities; even when it fails, the solutions it outputs are not far off. While humans still out-perform AI on this benchmark (by design), we are approaching the situation where the problem is not depth-of-search, but rather imperfect mimicking of human priors.
- The success of o3 suggests that inference-time scaling has plenty of capacity; and that we are not yet hitting a wall in terms of improving capabilities.
More research as part of the trend of improving LLMs with more internal compute, rather than external/token-level compute (c.f. Meta and Microsoft research):
- Johns Hopkins: Compressed Chain of Thought: Efficient Reasoning Through Dense Representations.
- Google DeepMind: Deliberation in Latent Space via Differentiable Cache Augmentation. They design a sort of “co-processor” that allows additional in-model (latent space) computation, while the main LLM weights are frozen. This is part of a trend of improving LLMs with more internal compute (rather than external/token-level compute).
- Jeremy Berman presents: LANG-JEPA: Learning to Think in Latent Space. An experimental LLM architecture, based on Meta’s JEPA, that operates in concept space instead of token space.
Qwen released: QvQ-72B-preview visual reasoning model.
DeepSeek release DeepSeek-V3-Base (weights), 671B params. This is noteworthy as a very large open-source model, noteworthy for achieving competitive to state-of-the-art performance, and noteworthy for having (supposedly) required relatively little compute (15T tokens, 2.788M GPU-hours on H800, only $5.5M).

Safety

OpenAI releases paper: Deliberative Alignment: Reasoning Enables Safer Language Models. The method is similar to Anthropic’s constitutional AI (where one writes down principles the AI must consider and adhere to), but leveraging the improved reasoning of modern models (o1, o3) to correspondingly improve alignment.

Video

Pika launched their 2.0 model, including “Scene Ingredients” which provides methods for adding specific characters to scenes.
LTX Studio adds fine-grained control of facial emotions.
ByteDance INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations. Allows one to take audio and an image, and generate a lip-synced video (examples).

Audio

Adobe Sketch2Sound allows one to imitate sound effects, and use AI to convert it into appropriate sounds. This allows art direction for Foley sound.
MMAudio enables video-to-audio; i.e. it can add a soundtrack to silent video (project, code, examples: 1, 2).

World Synthesis

WonderWorld: Interactive 3D Scene Generation from a Single Image (preprint, examples).

Science

Sakana AI (c.f. AI Scientist) present Automating the Search for Artificial Life with Foundation Models (preprint, code). They use various environment that parametrize simple rulesets that can lead to complex emergent behavior (cellular automata, Conway’s game of life, Boids). These act as test environments with richness and complexity, and they use visual/language models (VLMs) to automate search for interesting behavior. Since artificial life environments can also provide inspiration for AI, this is AI-guided search through artificial life, towards improvement of AI.
Google DeepMind: OmniPred: Language Models as Universal Regressors. General text-to-text regression can be applied to arbitrary science (x,y) data.
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models. This exploits concepts from mechanistic interpretability to allow one to discovery new science.
LLMs can realize combinatorial creativity: generating creative ideas via LLMs for scientific research.

Hardware

Nvidia unveils a small form-factor compute platform (suitable for robotics).
Raven Resonance is another attempt to deliver augmented reality glasses.

Robots

Apptronik are partnering with Google DeepMind to bring humanoid robots to fruition a bit faster.
Figure claims they are now revenue-generating, as they are delivering real robots to a paying client.
PaXini is building TORA-ONE, a wheeled humanoid with dexterous hands.
Unitree B2-W (wheeled quadruped) is now available for purchase ($150,000 USD). It seems highly capable.
Some researchers are using video diffusion models (which can predict future frames) as a robot policy: Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations (preprint). They show the example of the robot doing chemistry experiments.
Atlas electric (Boston Dynamics) can do a backflip (even while wearing clothes).
Apptronik claim their humanoid robots are doing real work in a fulfillment warehouse.

Posted in AI, News | Tagged audio, hardware, LLM, research, robots, safety, Science, video, world synthesis | Leave a comment

AI News 2024-12-19

Posted on 2024-12-19 by KevinYager

General

Ilya Sutskever was co-recipient of the test-of-time award at NeurIPS 2024, for the 2014 paper: Sequence to Sequence Learning with Neural Networks, currently cited >28,000 times. Video of his speech here, in which he makes many provocative points: compute is growing but data is not (we only have one Internet, data is the fossil fuel of AI); scaling still matters, and we must determine what to scale; what comes next will be a mix of agents, synthetic data, and inference-time computer; strongly reasoning systems will be unpredictable; superintelligence is coming.
Anthropic present Clio, a system that provides an aggregated view of what people are using Claude to do. So this allows one to observe trends in AI usage. Paper: Clio: Privacy-Preserving Insights into Real-World AI Use.

OpenAI

Dec 12: video input for advanced voice mode is being enabled.
Dec 13: ChatGPT projects allow organizing conversations and specializing responses for particular subject areas.
In response to continued law suits from Elon Musk, OpenAI present further evidence that Musk was in-favor of the proposed shift towards a capped for-profit structure. (This new information has been added to this aggregation of the relevant communications.)
Dec 16: Improved search, more broadly available.
Dec 17: New developer tools for o1.
Dec 18: ChatGPT is now available by phone: 1-800-ChatGPT (1-800-242-8478) in US and Canada (you can also add it as a WhatsApp contact with that number).
Dec 19: ChatGPT integration into certain coding and note-taking apps.

Research Insights

A set of results push LLMs a bit away from the legible token representation we are currently used to:
- Meta publishes: Byte Latent Transformer: Patches Scale Better Than Tokens. Instead of tokenization, it dynamically converts the input byte-stream into patches. This yields significant gains in compute efficiency, with minimal loss in performance.
- Meta publishes: Large Concept Models: Language Modeling in a Sentence Representation Space. They train a model that operates at a higher level of abstraction than typical word/token LLMs. Their model operates in a space of concept embeddings (which are more akin to full sentences than individual words).
- Last week, Meta published: Training Large Language Models to Reason in a Continuous Latent Space, which involves feeding the latent representation directly back into the model, instead of tokenizing intermediate thoughts (Chain of Continuous Thought, a.k.a. Coconut).
- Microsoft previously described: DroidSpeak: Enhancing Cross-LLM Communication, wherein LLMs invent their own inter-communication language.
- Each of these is individually exciting in terms of increased performance. However, they all push away from human-legible intermediate representations, which is problematic from a safety and engineering perspective.
Thinking Fast and Laterally: Multi-Agentic Approach for Reasoning about Uncertain Emerging Events. They introduce more system-2 and lateral-thinking through multi-agent interactions.
Cultural Evolution of Cooperation among LLM Agents.
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers.

LLM

Microsoft releases a small-but-capable model: Phi-4 (14B). It heavily uses synthetic data generation and post-training to improve performance (including on reasoning tasks).
Google’s Project Mariner, a chrome extension for agentic AI.
Google release Gemini 2.0 Flash Thinking, a reasoning model (available in AI studio).

Safety

Anthropic releases a new method to jailbreak AI models, using an automated attack method. By identifying this vulnerability, one can build future models to resist it. Paper: Best-of-N Jailbreaking (code). The method iteratively makes small changes to prompts, attempting to slide through countermeasures.
- The flavor of successful attacks also gives insights into LLMs. Successful prompts may involve strange misspellings or capitalizations; or unusual images with text and colored boxes arranged peculiarly. This is similar to other adversarial attacks (e.g. on image classification models). They have a certain similarity to human optical illusions: generating perverse arrangements meant to trick otherwise useful processing circuits. Improved model training can progressively patch these avenues; but it’s hard to imagine models that completely eliminate them until one achieves truly robust intelligence.
Anthropic publish: Alignment Faking in Large Language Models. They find evidence for alignment faking, wherein the model selectively complies with an objective in training, in order to prevent modification of its behavior after training. Of course the setup elicited this behavior, but it is surprising in the sense that LLMs don’t have persistent memory/awareness, and troubling in the sense that this shows even LLMs can engage in somewhat sophisticated scheming (e.g. they have evidence for these decisions going on during the LLM forward-pass, not in chain-of-thought).

Video

MinT video improves consistency and control (examples). Preprint: Mind the Time: Temporally-Controlled Multi-Event Video Generation.
Google announces Veo 2 and Imagen 3 (available via Labs, more examples, examples with natural movement).

Audio

ElevanLabs introduce a Flash TTS model, with latency of just 75 milliseconds.

World Synthesis

Impressive demo of a new physics engine: Genesis: A Generative and Universal Physics Engine for Robotics and Beyond (code, project page). It appears to be an accelerated physics engine with a LLM interface.

Science

Superhuman performance of a large language model on the reasoning tasks of a physician.

Brain

Contextual feature extraction hierarchies converge in large language models and the brain. LLMs are becoming more brain-like as they advance.

Posted in AI, News | Tagged audio, brain, LLM, OpenAI, research, safety, Science, video, world synthesis | Leave a comment

AI News 2024-12-12

Posted on 2024-12-12 by KevinYager

OpenAI

Dec 5: o1 is out of preview. The updated o1 is faster (uses fewer tokens) while improving performance. And they have introduced a “Pro” version of o1 (thinks for even longer).
- Here’s an example from a biomedical professor about o1-pro coming up with a legitimately useful and novel research idea.
Dec 5: There is now a ChatGPT Pro tier, $200/month for unlimited access to all the best models (including o1 Pro).
Dec 6: Reinforcement Fine-Tuning Research Program. Selected orgs will be able to RL OpenAI models for specific tasks. This is reportedly much more sample-efficient and effective than traditional fine-tuning. It will be reserved for challenging engineering/research tasks.
Dec 9: Sora officially released (examples).
Dec 10: Canvas has been improved and made available to all users.
Dec 11: ChatGPT integration into Apple products.
Dec 12: ChatGPT can pretend to be Santa.

Google

Google releases Gemini 2.0.
- Jules is an experimental code agent.
New “Deep Research” feature can search the web and pull together a coherent research report.
Imagen 3 and Veo image and video models are now available on Googl’es Vertex cloud platform.
Multimodal Live API in Google AI Studio. You can share your webcamera or screen to allow it to provide more directed help. (Example of using it as a research assistant.)

Research Insights

Google DeepMind: Mastering Board Games by External and Internal Planning with Language Models. Search-based planning is used to help LLMs play games. They investigate both externalized search (MCTS) and internalized (CoT). The systems can achieve high levels of play. Of course the point is not to be better than a more specialized/dedicated neural net trained on that game; but to show how search can unlock reasoning modalities in LLMs.
Training Large Language Models to Reason in a Continuous Latent Space. Introduces Chain of Continuous Thought (COCONUT), wherein you directly feed the last hidden state as the input embedding for the next token. So instead of converting to human-readable tokens, the state loops internally, providing a continuous thought.
New preprint considers how “capability density” is increasing over time: Densing Law of LLMs. They find that, for a given task, every 3 months the model size needed to accomplish it is halved. This shows that hardware scaling is not the only thing leading to consistent improvements.

LLM

Meta released Llama 3.3 70B, which achieves similar performance to Llama 3.1 405B. Meta also announced plans for a 2GW datacenter in Louisiana, for future open-source Llama releases.
Ruliad introduces Deepthought 8B (demo), which claims good reasoning for the model size.
Stephen Wolfram released a post about a new Notebook Assistant that integrates into Wolfram Notebooks. Wolfram describes this as a natural-language interface to a “computational language”.
GitIngest is a tool to “turn codebases into prompt-friendly text”. It will take a github repository, and turn it into a text document for easy inclusion into LLM context.
While we haven’t seen a “new class of model” (bigger/better than GPT4) in quite a while, it’s worth remembering the substantial improvements we’ve seen from perfecting the existing systems (from Epoch AI benchmarks). On Ph.D.-level Q&A, over the last year we’ve gone from no-better-than-random to roughly human-expert:

AI Agents

Article: Emergence’s AI orchestrator launches to do what big tech offerings can’t: play well with others. Of course there are many other scaffolding (LangChain, Pydantic, Flow, etc.) and orchestration (ell, swarm, AG2, etc.) frameworks (not to mention commercial attempts thereof: Amazon, Crew AI, MultiOn, etc.). But it’s good to see more development in this space.

Audio

ElevenLabs added GenFM to their web product: you can now generate AI podcasts, and listeners can tune in on the ElevenReader app.

Image Synthesis

Spawning AI is developing an image model based only on public domain data. It will be made available on Source.Plus. Preliminary images seem quite good (examples), suggesting that public data may be enough. Preprint: Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms.
Midjourney releases Patchwork, a multi-player world-building tool.

Vision

Nvidia introduces: NVILA: Efficient Frontier Visual Language Models.

Monumental Labs is using AI-enabled robotic stone carving to make Renaissance-style sculpture more common.

Science

Nature writeup: Virtual lab powered by ‘AI scientists’ super-charges biomedical research: Could human–AI collaborations be the future of interdisciplinary studies? Preprint: The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation. They use a team of AI assistants to accelerate work.
ORGANA: A Robotic Assistant for Automated Chemistry Experimentation and Characterization (video).

Posted in AI, News | Tagged agents, audio, Google, OpenAI, research, Science, vision | Leave a comment

AI News 2024-12-05

Posted on 2024-12-05 by KevinYager

General

The End of Productivity: Why creativity is the new currency of success. The essay argues that focus on pure productivity (and metrics) misses the things that humans value most. And that, potentially, the era of AI will actually shift in an emphasis from human productivity to human creativity being the focus of value.
An interesting experiment (assuming it’s true): an AI jailbreaking contest. An AI agent was tasked with not approving an outgoing money transfer. Anyone can spend a small amount of money to send the AI a message. The money is added to the pool, and the cost-per-message increases slightly. It started at $10/message, and quickly grew to $450/message with a prize-pool of $50k. At that point, someone tricked the AI by sending a message that explained an inverted meaning of approveTransfer. So, they won the money.
- This acts as the usual reminder that modern LLMs are not robust against dedicated attackers that seek to trick them and extract information.
Reportedly: Elon Musk lands priority for Nvidia GB200 delivery in January with US$1.08 billion. Paying a premium to get earlier access to next-gen chips may well be a good strategy.
An interesting blog post by Lilian Weng: Reward Hacking in Reinforcement Learning. Some notes about modern RLHF applied to LLMs (based on this paper):
- RLHF increases human approval, but not necessarily correctness.
- RLHF weakens humans’ ability to evaluate: The error rate of human evaluation is higher after RLHF training.
- RLHF makes incorrect outputs more convincing to humans. The evaluation false positive rate significantly increases after RLHF training.
Andrej Karpathy provides an interesting historical look at how the transformer architecture was invented (c.f. Attention Is All you Need.)
A critical analysis of “openness” in AI: Why ‘open’ AI systems are actually closed, and why this matters. They note that the current version of “open” does not preclude concentration of power.

Research Insights

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Reverse Thinking Makes LLMs Stronger Reasoners. Humans reason not just from problem-to-solution, but also from solution backwards.
Last week saw many results attempting to replicate OpenAI o1’s reasoning ability. Now we also have: o1-Coder: an o1 Replication for Coding (code).

LLM

Amazon enters the fight with Nova (docs, benchmarks). Although not leading on benchmarks, they promise good performance-per-dollar; will be available on Amazon Bedrock.

AI Agents

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (code).

Audio

Hume adds a voice creation mode where one can adjust intuitive sliders to pick out the desired voice.
ElevenLabs previously announced intentions to build a conversational AI platform. This capability is now launching; they claim it their interface makes it extremely easy to build a conversational voice bot, and allows you to select the LLM that is called behind-the-scenes.

Video

Google et al. show off: Generative Omnimatte: Learning to Decompose Video into Layers (preprint). It can separate a video into distinct layers, including associating affects (e.g. shadows) with the correct layer (parent object), and inpainting missing portions (e.g. occluded background). Obvious utility for visual effects work: can be used to make a particular person/object invisible (including their shadows), to apply edits to just one component (object or background), etc.
Invideo are demoing a system where a single prompt generates an entire video sequence telling a story (example). I think that creators generally want more granular control of output so they can put together a precise narrative. But there are use-cases where this kind of fully automated generation may make sense.
- It’s easy to look at the output and find the visual or narrative flaws. But also interesting to remember how advanced this is compared to what was possible 6-9 months ago. There is obviously a huge amount of untapped potential in these kinds of systems, as they become more refined.
Runway tease a prototype for a system to enable control over generative video, where videos are defined by keyframes and adjusting the connection/interpolation between them (blog post).
- In October 2023, there were some prototypes of a “prompt travel” idea wherein a video was generated by picking a path through the image-generation latent space. One would define keyframe images, and the system would continually vary the effective prompt to interpolate between them (preprint, animatediff-cli-prompt-travel). This provided a level of control (while not being robust enough to actually enforce coherent temporal physics). Runway’s approach (leveraging a video model) may finally enable the required control and consistency.
Tencent announce an open-source video model: Hunyuan Video (example, video-to-video example).

World Synthesis

World Labs (which includes Fei-Fei Li) is working on 3D world generation from a single image (examples, more examples).
Not to be outdone, Google then announced: Genie 2: A large-scale foundation world model, which can generate playable worlds.

Science

Google Introduces A.I. Agent That Aces 15-Day Weather Forecasts. Scientific paper: Probabilistic weather forecasting with machine learning.

Brain

Whole-brain mapping is advancing. We recently saw release of a fly brain map (140,000 neurons). Now, a roadmap effort claims that whole-brain mapping for mammalian brains should be possible in the coming years.

Hardware

ASML released a hype-video describing the complexity of modern lithography (in particular the computational lithography aspect). There is no new information, but it’s a nice reminder of the nature of the state-of-the-art.
I never grow tired of looking at plots of Moore’s Law:

Robots

MagicLab released a video purporting to show multi-(humanoid)robot collaboration on tasks.

Posted in AI, News | Tagged agents, audio, hardware, LLM, research, robots, video, world synthesis | Leave a comment

AI News 2024-11-28

Posted on 2024-11-28 by KevinYager

General

Google releases an essay on the potential of AI for science: A new golden age of discovery: Seizing the AI for Science opportunity. In addition to outlining an optimistic future (not dissimilar from Dario Amodei’s Machines of Loving Grace), it provides practical insight about what problems are best attacked using modern AI.
Aidan McLaughlin essay: The Problem with Reasoners. He notes three trends that suggest AI will progress more slowly that suggested by naive/optimistic scaling arguments:
- It was hoped that multi-modal models (ChatGPT 4o, voice+text models, etc.) would exhibit significant capability improvement from transfer learning across modalities. This has not borne out.
- Iterative/reasoning models (OpenAI o1, DeepSeek r1, etc.) show that using RL can yield gains in narrow domains with clear metrics (contrived math problems), but we are not seeing evidence of this leading to generalized improvements in intelligence (in areas without easy verification).
- No large model (larger than GPT4 or Claude 3 Opus) have been released, suggesting major challenges there.
Attitudes and perceptions of medical researchers towards the use of artificial intelligence chatbots in the scientific process: an international cross-sectional survey (Nature commentary: Quest for AI literacy). Overall, the study finds substantial interest in AI chatbots among researchers, but also a lack of understanding of these systems.

Research Insights

Replication of “o1-style” chain-of-thought reasoning is heating up:
- Last week saw announcement of DeepSeek-R1-Lite-Preview.
- Update from Walnut Plan’s attempt to replicate o1 (c.f. part 1, code): O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
- Paper from Alibaba: Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions.
- Alibaba Qwen releases: Qwen QwQ 32B (weights, demo). This appears to be a separate implementation of the “o1-style” reasoning chain-of-thought approach.
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models. There is always debate about whether LLMs “truly reason” or “simply memorize”. This paper proposes that reasoning is based on extracting procedures from training data, rather than simply memorizing outputs. So it is a matter of finding, memorizing, and using “templates” rather than specific results.
LLMs Do Not Think Step-by-step In Implicit Reasoning. They argue that while explicit chain-of-thought (CoT) generates stepwise reasoning, implicit reasoning (e.g. model trained to reproduce CoT outputs) does not internally invoke the same stepwise process.
Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers. Notes that inference-time scaling is limited by the quality of the verifier (at least for approaches relying on verification).

LLM

Nvidia releases: Hymba Hybrid-Head Architecture Boosts Small Language Model Performance (code). Combines transformer attention mechanism with state-space models (SSMs, c.f. Mamba) to achieve high performance.
Ethan Mollick provides some practical advice for prompting LLMs: Getting started with AI: Good enough prompting (Don’t make this hard).
A sub-culture of AI enthusiasts has developed around the idea of simply giving modern LLMs (limited though they may be) autonomy; or at least semi-persistence by allowing them to run for long time periods. Often, the AIs behave in strange and unexpected ways, as they attempt to continue a token-chain well beyond their original training/design.
- Infinite Backrooms generates extremely long conversations by creating chat-rooms where different LLMs talk to each other endlessly. Conversations often veer into strange and unexpected topics; with some LLMs even outputting tokens describing distress.
- truth_terminal is an 𝕏 handle that is reportedly an LLM given free reign to post. However, there is speculation that the human in charge (Andy Ayrey) is selective about what it actually posts.
- Venture capitalist Marc Andreessen gave the AI a $50,000 no-strings grant (in Bitcoin), so that it could pursue whatever actions it wanted.
- The bot started a memecoin (GOAT) that briefly reached a market cap of $1.3B (currently still at >$700M). The coin’s name is a reference to a (NSFW) shock-meme. The AI itself (or the human behind it) likely netted many million $.
- The AI reportedly “kept asking to play video games”; so it was given access to an “arcade” where the games are text-based games generated by another LLM. You can watch the streaming interactions: Terminal TV.
- It also has its own web-page (that it, ostensibly, authored).
- While it is hard to know how much human tampering is occurring in these implementations, it is interesting to see the bizarre and unexpected outputs that LLMs generate when unleashed.
AI models work together faster when they speak their own language. Letting AI models communicate with each other in their internal mathematical language, rather than translating back and forth to English, could accelerate their task-solving abilities.
- Preprint: DroidSpeak: Enhancing Cross-LLM Communication.
- Although allowing AIs to converse in an invented language could increase efficiency, it undercuts the legibility and auditability aspects of natural-language inter-communication. Overall, this approach could thus hamper both safety and capabilities of complex AI ecosystems.
Anthropic describes Model Context Protocol: an open standard for secure, two-way connections between data sources and AI (intro, quickstart, code).
Anthropic adds a style feature, where it will try to mimic a provided writing example.
Further evidence that model quantization can subtly impact performance: Aider reports that Details matter with open source models.
As a follow-up to last week’s paper on poetry (AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably); Colin Fraser provides this summary graphic, highlighting that humans objectively prefer AI poetry, but when told authorship (real or not), they rate things more highly when (ostensibly) made by humans and lower when (ostensibly) made by AI.

AI Agents

DynaSaur: Large Language Agents Beyond Predefined Actions. The agent improves capabilities over time by progressively writing more functions/code.

Image Synthesis

Black Forest Labs released FLUX.1 Tools, a suite of models to enable more control over image generation/editing (inpainting, outpainting, conditioning).
Runway Frames is a new image model, with good style control.
OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on (code, demo). Allows one to modify a person/character’s clothes in an image.
- There are other codebases to do similar things; e.g.: Kolors Virtual Try-On in the Wild.

Audio

ElevenLabs announces a podcast generator (competing with Google’s Notebook LM).

Video

Meta’s Segment Anything Model 2 (SAM2) has been adapted, adding motion-aware memory, which allows it to do zero-shot video masking (another example): SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory (code).
Runway adds Expand Video, allowing one to change aspect ratio by outpainting (e.g.). Includes prompt guidance, allowing one to change a shot significantly.
LTXStudio announce LTX Video, an open-source video model (code, docs). Although the quality is not quite state-of-the-art, it is remarkably good and it is real-time. Of course, not all generations are excellent; but the real-time generation speed points towards neural world simulation in the not-too-distant future.
Luma Dream Machine v1.6, including Luma Photon image generation and consistent characters.
A group claims to have leaked access to a turbo version of OpenAI’s Sora video model (examples).

World Synthesis

An interesting result: using Runway’s outpainting on video where a person’s face is barely visible (and distorted through refraction); the reconstructed face is remarkably coherent/correct. This implies that the model is implicitly building a valid world model.
Google et al present: CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models (project page with examples). Follow up to early CAT3D; but now the 3D objects can evolve in time.

Science

Large language models surpass human experts in predicting neuroscience results (writeup: AI can predict neuroscience study results better than human experts, study finds). This once again shows that LLMs can implicitly learn valid generalizations, picking up on subtle trends spread across a dataset.

Hardware

Epoch AI: Introducing Epoch AI’s Machine Learning Hardware Database.

Robots

Although the Unitree G1 humanoid robot was announced with a price of $16k (c.f.), the latest price chart shows a range of configurations, with prices from $40k to $66k.
Mercedes is running a trial for use of Apptronik robot in their Austin lab.

Posted in AI, News | Tagged agents, hardware, image synthesis, LLM, research, robots, video, world synthesis | Leave a comment

AI News 2024-11-21

Posted on 2024-11-21 by KevinYager

General

Elon Musk’s xAI raising up to $6 billion to purchase 100,000 Nvidia chips for Memphis data center. This is in addition to their existing 100,000 H100 GPU cluster (~100 exaflops FP16). If these are B100 GPUs, that would increase total compute to ~274 exaflops.
A US government commission released a report; among other things, it calls for a Manhattan-Project style AI initiative. (C.f. Leopold Aschenbrenner‘s Situational Awareness.)

Max Tegmark offers a rebuttal to this report: AGI Manhattan Project Proposal is Scientific Fraud. He contends that the report-writers misrepresent the scientific consensus, in that they seem to report that AGI will be easily controlled.

Research Insights

LLM

New study: AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably. At least part of the effect may come from non-experts judging the simpler and more conventional AI poems as being more understandable and superior (and thus human), while the complexity and inconsistency of human-generated poetry is perceived as incoherence.
- Nevertheless, this again shows that for short-form generation, AI has already reached human-level, and can be considered super-human in certain narrow ways.
Mistral releases a new large model (Mistral-Large-Instruct-2411, 123B) and Pixtral Large multimodal model (weights).
DeepSeek announces DeepSeek-R1-Lite-Preview. This is a “reasoning” model (inference-time chain-of-thought) that seems to be similar to OpenAI’s o1. Like o1, it achieves impressive results on math and science benchmarks. Some of the CoT reasoning traces are quite interesting (e.g.). The weights are not yet available, but they claim they will release it open-source.
- Also interesting to consider the rate of progress. A couple years ago, the prediction was we might reach 46% in the MATH benchmark by 2025. Instead, we now have a general LLM getting 92%. And o1 has also scored 97% on a challenging math exam (with novel questions that are nowhere in the training data).

AI Agents

Stripe adds mechanisms for AI agents to trigger payments.
Generative Agent Simulations of 1,000 People (code). They interview humans, using those to define the set of AI agents.
- Builds on their prior work: 2023-10: Generative Agents: Interactive Simulacra of Human Behavior.
AWS releases a multi-agent orchestrator framework.
Paper: Agent-as-a-Judge: Evaluate Agents with Agents. Argues for using evaluation agents in workflows.
Automated-AI-Web-Researcher-Ollama. Code for using local LLMs to automated online research.
Someone is trying to use a team of AI agents to write a full book autonomously. Different agents are responsible for different characters, or different aspects of writing (consistency, researching facts, etc.).

Image Synthesis

A recent survey of 11,000 people has completed: How Did You Do On The AI Art Turing Test? The median score (to differentiate AI and human art) was 60%, a bit above chance. AI art was often preferred by humans. Overall, AI art has already crossing a Turing-Test threshold.

Audio

Suno releases their v4 music generator.
ElevenLabs now offers ability to build conversational AI agents.

Video

Pickle AI is offering a virtual avatar for your meetings ($30/month). You still attend the meeting, and talk when you want. But your avatar pretends to pay attention, and lip-syncs your speech. So this is an alternative to having your camera turned off.
Runway releases some small updates, including longer (20s) video-to-video, vertical aspect ratio for Act-One, and more camera controls.
Current quality of video generations:
- Coca-Cola holiday ad (c.f. McDonald’s commercial, Aug 2024), and parody thereof.
- A Dream Within A Dream (by PZF, selected for the Czech International AI Film Festival).
- Making Friends (by Everett World; see also Childhood Dream and City Echoes).
- Anime: test shots, Ultimate Ceremony, Echoes of Love.
- Echoes of Grace (KakuDrop using Sora).

Science

Sequence modeling and design from molecular to genome scale with Evo. A 7B genomic multi-modal foundation model trained on 2.7 million genomes. It can interpret DNA, RNA, and protein sequences; and can predict across molecular, system, and genomic scales. Can be used to predict effect of mutations, design CRISPR systems, etc.

Hardware

Google has a history of using deep reinforcement learning for automated chip design. This work has been met with some skepticism. Google has now published a rebuttal, claiming that the era of AI chip design is well upon us: That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design.
- April 2020 blog post: Chip Design with Deep Reinforcement Learning.
- June 2021 paper: A graph placement methodology for fast chip design.
- Sept 2023 blog post: How AlphaChip transformed computer chip design.
- August 2024 preprint: ShortCircuit: AlphaZero-Driven Circuit Design (code).

Posted in AI, News | Tagged agents, audio, hardware, LLM, research, Science, video | Leave a comment

AI News 2024-11-14

Posted on 2024-11-14 by KevinYager

General

OpenAI’s data scraping wins big as Raw Story’s copyright lawsuit dismissed by NY court. The crux is that the plaintiffs could not demonstrate a concrete, actual harm from OpenAI’s actions.
An article on Reuters: OpenAI and others seek new path to smarter AI as current methods hit limitations. It repeats the assertions (disputed by many experts in the community) that next-generation models (under development) are under-performing, and that AI labs are hitting data walls. They also emphasize that the path forward involves more “inference-time compute” to unlock reasoning.
- It is interesting to see the article including a quote from Ilya Sutskever, who has been largely quiet in the public sphere, after his departure from OpenAI and founding of SSI.
The AI Semiconductor Landscape.

Lex Fridman interviews Anthropic: Dario Amodei (CEO), Amanda Askell (develops Claude’s personality), Chris Olah (works on mechanistic interpretability).

Research Insights

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning (code). They implement temporary updates to weights at inference-time, using a loss and gradients in the usual (training) manner. They show strong performance on ARC tasks.
Mansi Sakarvadia’s thesis: Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning. Develops a system to allow the user to inject prompt-specific information into inference, which can improve multi-step reasoning. Also describes Attention Lens, to convert attention heads into interpretable tokens.
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (code).

LLM

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models (weights, preprint).
Release of: Qwen2.5-Coder Series: Powerful, Diverse, Practical. Currently at the top of the coding leaderboard.

AI Agents

Microsoft introduces: Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks.
Microsoft releases an experimental library: TinyTroupe 🤠🤓🥸🧐: LLM-powered multiagent persona simulation for imagination enhancement and business insights.
Nous Research announces: Introducing the Forge Reasoning API Beta and Nous Chat: An Evolution in LLM Inference. They claim this provides an easy way to take an existing model and run it in a reasoning mode (using inference-time compute).
Mina Fahmi produced this image listing the ways that human and AI could work together:

Video

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions (preprint, code, examples).
Pollo AI has released a video generator. Outputs are quite good, though not quite challenging the state-of-the-art.
Current quality of video generations:
- Plants dancing.
- Insect on tree.
- Trailers for The Silmarillion and The Fall of Gondolin (by Abandoned Films).
- Moody sci-fi.
- Migration (made by combining Runway ML Gen3-Alpha and traditional animation).
- After the Winter (music made using Suno v4).
- Horror: Ridge to Southwest.
- The Gardener (by Machine Mythos).

World Synthesis

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model. Just two images of a scene are enough to reconstruct a 3D model.

Science

Robots

New Deep Robotics video shows very good terrain navigation from a quadruped-with-wheels design.

Posted in AI, News | Tagged agents, LLM, research, robots, Science, video, world synthesis | Leave a comment