Kevin G. Yager | Academic Summary

AI News 2025-07-10

Posted on 2025-07-10 by KevinYager

General

Harvard Business Review: What Gets Measured, AI Will Automate.
OpenAI: Working with 400,000 teachers to shape the future of AI in schools. OpenAI joins the American Federation of Teachers to launch the National Academy for AI Instruction.

Research Insights

Predicting thinking time in Reasoning models. By predicting estimate token needs during CoT reasoning, feedback to user (e.g. progress bar) could be provided. Models internally have state that is predictive of further token usage.
Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory. LLMs can coherently succeed at iterative games. Models from different vendors have identifiable personalities in how they approach games.
How We Replicated Five Peer-Reviewed Papers in Five Hours.
- Prior work: The Discovery Engine.
- Preprint: Benchmarking the Discovery Engine.

LLM

gremllm is clever and/or diabolical. It is a Python library that generates on-the-fly the attributes and methods of a Python object. Thus, one need not actually define the methods for a new class; simply allow the LLM to hallucinate them when they are called for.
- Although this sounds silly and dangerous, there are viable use-cases. In March 2023 (site and code no longer online), there was some exploration of “imaginary programming” wherein one would define a function’s requirements but never actually code the function (the LLM would instead stand-in for the function at call time).
xAI release Grok 4 (and Grok 4 Heavy). Benchmarks are strong, taking the lead on several, including 100% on AIME, 44% on Humanity’s Last Exam, and 16% on ARC-AGI-2 (c.f. 9% Claude Opus 4). If real-world utility matches benchmarks, then Grok 4 may take the lead as the best model.

Safety

Anthropic: Why Do Some Language Models Fake Alignment While Others Don’t? (code).

World Synthesis

Odyssey is again teasing their “interactive video” system (precursor to generative playable games).

Science

AI4Research: A Survey of Artificial Intelligence for Scientific Research.

Robots

Huggingface announces: Reachy Mini – The Open-Source Robot for Today’s and Tomorrow’s AI Builders. Appears to be optimized for education and hobbyist hacking.

Posted in AI, News | Tagged LLM, research, robots, safety, Science, world synthesis | Leave a comment

AI News 2025-07-03

Posted on 2025-07-03 by KevinYager

General

Andrej Karpathy has a knack for distilling the trends in AI/ML:
- 2017-11: Software 2.0 (“Gradient descent can write code better than you. I’m sorry.”)
- 2022-10: Transformers as general-purpose differentiable computers (talk)
- 2023-01: The hottest new programming language is English
- 2023-09: LLM as kernel of a new Operating System (diagram/diagram, OS analogies)
- 2025-02: Vibe coding
- 2025-06: Software 3.0 (talk): “Prompts as Programs”. Software 1.0 is code; 2.0 is model weights; 3.0 is prompts.
- 2025-06: “Context Engineering” instead of “Prompt Engineering”
- Now (2025-06): Prediction of LLMs being scaled down into “cognitive cores”; small edge-optimized (on-device inference) LLMs that have minimal knowledge but maximized reasoning and tool-use abilities. Can rapidly iterate to retrieve required results and build answers.
Epoch reports on improvements in context window length: LLMs now accept longer inputs, and the best models can use them more effectively.

Research Insights

A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap.
- Response to Apple’s paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, which argues LLMs fail as complexity increases, demonstrating a lack of true reasoning.
- This new paper argues that what seems like a lack of reasoning is more like a lack of agentic ability (tool access, etc.).
Towards Scalable Parameter Decomposition. They show a method to decompose models using parameters (rather than activations).
- Paper with technical details: Stochastic Parameter Decomposition (code)
VLMs can think visually without generating pixels. Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (paper, code).
The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements. The benchmark measures agentic abilities by asking the agent to improve the training speed for a small LLM (as a proxy for more general “AI recursive self-improvement”). Current agents do surprisingly badly (sub-human performance even with significant hints). Going forward, this eval (or variants thereof) should prove useful to measure “useful agentic” performance.

LLM

Inception Labs launch Mercury, a diffusion LLM. The fast inference of diffusion architecture puts it in a new regime for speed-vs-performance.

Agents

Anthropic tested Claude’s ability to operate a small business: Project Vend: Can Claude run a small shop? (And why does that matter?). Although surprisingly capable in certain ways, the agent overall lost money over time, and had a mini-identity-crisis for a day.

Safety

The Singapore Consensus on Global AI Safety Research Priorities.

Image Synthesis

Open source model for image editing: OmniGen2: Exploration to Advanced Multimodal Generation (preprint, code, demo).
Qwen-VLo image generation (try). Conversational interface for image generation.
ByteDance XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation (preprint).

Video

Nano (Greyscale Labs) is a visual effects plugin that exploits ML depth-estimation to allow editing of volumetric haze.
Nvidia: UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting. Enables relighting of an existing image by estimate albedo.
Alibaba: OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation. Expressive avatar control.

World Synthesis

Mirage Research Preview: The World’s First AI-Native UGC Game Engine Powered by Real-Time World Model. We are getting closer to real-time generative gameplay.

Science

Chai-2: Zero-shot antibody design in a 24-well plate.
Microsoft reports on improved AI medical diagnostics: The Path to Medical Superintelligence.
Ai2 introduces new benchmark: SciArena: A New Platform for Evaluating Foundation Models in Scientific Literature Tasks (cast votes, data, code). The top model is currently OpenAI o3.
A foundation model to predict and capture human cognition.

Cars

Tesla shows off a car self-driving (no occupants) from factory to customer’s address. No doubt the route was carefully selected and vetted. Nevertheless, it is impressive.
Tesla launched a limited rollout of their full-self driving Robotaxi (with in-vehicle employee monitor, for now), in Texas.

Robots

K-Scale announces that you can now order one of their open-source humanoid robots (9k$ early adopter price; 16k$ nominal price).

Posted in AI, News | Tagged agents, cars, image synthesis, LLM, research, robots, safety, Science, video, world synthesis | Leave a comment

AI News 2025-06-26

Posted on 2025-06-26 by KevinYager

General

Mira Murati’s Thinking Machines Lab raises $2B ($10B valuation).
Scaling Test Time Compute to Multi-Agent Civilizations: Noam Brown.
OpenAI schedules their next DevDay for October 6, 2025.

Research Insights

Dense SAE Latents Are Features, Not Bugs. They find pairs of opposing features that fire very frequently. Far from being useless, they find these encode meaningful concepts.
Sakana AI: Reinforcement Learning Teachers of Test Time Scaling (preprint). Rather than using to RL to improve solution, they focus on improving the ability for the model to teach other models. RL reward is based on how useful generated training examples are (to smaller learning models); rather than being rewarded on correctness of their final answer.

LLM

Google releases Gemma 3n, for on-device/edge multimodal processing.

Agents

Kimi-Researcher: End-to-End RL Training for Emerging Agentic Capabilities. Achieves 27% on Humanity’s Last Exam.
Google introduce Gemini CLI, an open-source AI agent for coding in your terminal.

Audio

ElevenLabs introduces 11ai, a voice conversational assistant; exploits MCP to enable connection to resources (calendar, etc.).
ElevenLabs introduces Voice Design v3, an improvement to their text-to-voice system for designing a voice.

Image Synthesis

Higgsfield Soul is a new high-aesthetics image model (examples).

Video

ByteDance: InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions. Demonstrates the possibility of controlling multiple characters talking, matched to provided audio.

World Synthesis

Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition (preprint, video example). Improved quality and consistency.
Runway is experimenting with generative text adventures.
PlayerOne: Egocentric World Simulator (preprint).

Science

Google DeepMind releases AlphaGenome (including API capabilities); it takes base-pair sequences as input, and predicts genomic behavior outputs.

Robots

Having humanoid robots walking around in the real world points towards improvements to robustness and reliability.
Google release a VLA that allows robotic control on-device: Gemini Robotics On-Device brings AI to local robotic devices.
- See also tech report: Gemini Robotics: Bringing AI into the Physical World.

Posted in AI, News | Tagged agents, audio, image synthesis, LLM, research, robots, Science, video, world synthesis | Leave a comment

AI News 2025-06-19

Posted on 2025-06-19 by KevinYager

General

The U.S. Army Reserve has formed a new Detachment 201: Executive Innovation Corps. The group (which includes OpenAI CPO Kevin Weil, Palantir CTO Shyam Sankar, Meta CTO Andrew Bosworth, and Bob McGrew) will focus on tech issues.
Epoch reports continued progress in AI, including on their hard FrontierMath benchmark.

OpenAI has been awarded a US Department of Defense contract for $200M to develop AI models for defense applications.
Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce. Interesting delineation of jobs:

Research Insights

Distillation Robustifies Unlearning (preprint, demo, discussion). Normal unlearning suppresses knowledge in a model, but adversarial prompting or fine-tuning can bring the knowledge/behavior back. They show that distilling into a new model more reliably eradicates the undesired information.
Self-Adapting Language Models. The models update their own fine-tuning data and update directives.

LLM

Google is nearing release of Gemini 2.5 Pro Deep Think, which deploys more inference-time compute to improve reasoning.
Google launch 2.5 Flash-Lite, a very fast (very low cost) reasoning model.
Google DeepMind technical report: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. The paper lists a thousand contributors. (Informal summary.)

Agents

Anthropic describes the multi-agent system underlying Claude’s research capabilities: How we built our multi-agent research system.

Safety

Anthropic blog post: SHADE-Arena: Evaluating sabotage and monitoring in LLM agents (paper).
Avoiding Obfuscation with Prover-Estimator Debate. They show that honesty is incentivized at equilibrium (under certain conditions).
OpenAI: Toward understanding and preventing misalignment generalization.
- Paper: Persona Features Control Emergent Misalignment.
- They show that intentional misalignment training (e.g. to write bad code) causes an emergent “evil” personality. But this can be detected and countered.

Video

Seedance 1.0: Exploring the Boundaries of Video Generation Models.
Mystery model “Kangaroo” being tested in the AI video arena.
Hailuo AI (MiniMax) unveils Hailuo 02 (examples: various, various, tsunami, fight scene, fox running, blogger).
Midjourney releases their video model. Outputs are often beautiful (early examples, release examples: various, various, Ethan Mollick, highly rated, complex environments).

Science

Automation of Systematic Reviews with Large Language Models. By automating the medical document review process, they can save years of human labor.

Hardware

AMD unveiled its new MI350 chip, optimized for AI workloads. They are focusing on open/compliant coding standards, and energy/cost efficiency.

Cars

Data from Waymo: New Insights for Scaling Laws in Autonomous Driving. They show scaling laws apply to autonomous driving: using more data and more compute for training yields reliable improvements in performance.

Robots

RoboBrain 2.0 is an open-source, general purpose robot control model (video).
1X World Model: Evaluating Bits, not Atoms (preprint).
Generalist shows off a model that can enable relatively simple robot arms to perform precise work.
Hexagon robotics announces Aeon humanoid robot (on wheels), optimized for industrial work (video).

Posted in AI, News | Tagged agents, cars, hardware, LLM, research, robots, safety, Science, video | Leave a comment

AI News 2025-06-12

Posted on 2025-06-12 by KevinYager

General

Analysis of the location of datacenters (750 AI supercomputers): Trends in AI Supercomputers.

Scaling Reinforcement Learning: Environments, Reward Hacking, Agents, Scaling Data.

Research Insights

Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. Reasoning models don’t benefit from “think step by step” prompting.
Epoch AI was given access to the full reasoning traces of o3-mini (normally only summaries are shown to the user) to conduct this research: Beyond benchmark scores: Analyzing o3-mini’s mathematical reasoning. Mathematicians reviewed the traces of working on math problems; one evaluated described o3 as a “vibes-based inductive reasoner”.
Corrector Sampling in Language Models. Resampling prior tokens can be used to do small amounts of backtracking and thereby improve performance.
SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat. The models compete and evaluate each other, without requiring external (human) scoring; models are updated on the computed preference ordering. This seems to provide a simple way to recursively self-improve.

LLM

Anthropic adds Claude Gov; models intended for national security.
Mistral announces Magistral, a reasoning model. Two variants: 24B open-source or a larger enterprise version via API.
- An interesting result from the report (section 7.2: Eating the multimodal free lunch): They base model is multi-modal, but RL is done using text only. Yet, they observe this text-only training does not harm multi-modal performance; in fact multi-modal performance improves. This suggests modalities are well-entangled and that transfer learning between modalities is naturally occurring.
OpenAI announced the released of o3-pro (release notes).
- Review from an early tester: God is hungry for Context: First thoughts on o3 pro.

Vision

Meta announce V-JEPA 2 (paper) a vision model that builds a world model, and could be useful for robotic control.

Audio

ElevenLabs introduces v3, an expressive text-to-speech system that supports intonation, accent, and even non-words like laughs and sighs or sound effects (examples: joke, affecting accents, various).

World Synthesis

4DV is demoing 4D Gaussian Splatting, wherein multi-camera video data is converted into a temporal/video 3D-spatial reconstruction (videos showing interaction: 1, 2, 3).
- FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction (preprint).

Science

Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology.

Cars

Tesla has provided some updated details on their current “full self-driving” (FSD) implementation. Some claims: 3.5B miles driven by FSD across 6 million vehicles, 54% safer than human.

Robots

Video of Figure 02 robot sorting deformable packages. (Uncut 1 hour video of this activity, proving the prior footage was not cherry-picked.)
1X announce Redwood, their vision-language transformer model for robot control.

Posted in AI, News | Tagged audio, cars, FSD, LLM, research, robots, Science, vision, world synthesis | Leave a comment

AI News 2025-06-05

Posted on 2025-06-05 by KevinYager

General

Trends – Artificial Intelligence. A 340 page report detailing trends in usage, compute, etc. (summary of main points).

Research Insights

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning. They investigate how LLMs represent concepts. They find that:
- LLMs do map concepts into categories, similar to humans.
- But LLMs do not capture “typicality” in the way humans do; so category membership is not identical.
- A difference between LLMs and humans: the former optimize for compression, while the latter optimize for representational flexibility.
How much do language models memorize? This builds on previous work estimating deep learning capacity at ~2 bits/parameter. They estimate 3.6 bits/parameter for GPT-style models.
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning (code). Interesting result showing that one can improve model performance by training against negative signals (failure). In fact, this has certain advantages; e.g. purely rewarding successes means that alternative success pathways are not reinforced, whereas negative signals to failure boosts all success pathways.
General agents need world models. They provide formal evidence that any agentic system that achieves generalized capability must be building and exploiting some kind of world model.
Predicting Empirical AI Research Outcomes with Language Models. They show that LLMs can exceed human performance in predicting which AI/ML research projects will be fruitful.

LLM

LisanBench (github) is a new benchmark that evaluates long-term task coherence (“stamina”) through a game where the LLM must progressively alter a word (one character at a time), always yielding a valid English word, to build the longest possible chain. Although highly contrived, this does seem to test longer-range planning. The results conform to vibes about model intelligence.

Anthropic has launched Claude Explains, a blog of AI generated posts (with human verification). The focus (currently) appears to be teaching simple coding concepts.
OpenAI announces updates to ChatGPT for business.
- Deep research can now search across defined private data repositories (Sharepoint, Google Drive, Dropbox, etc.).
- Chat queries and data analysis requests can draw directly from connected data sources.
- ChatGPT now supports custom connectors, based on MCP.
- Being deployed for Teams, Enterprise, and Edu.
- Record mode transcribes meetings, providing a summary document with pointers to the transcript/timecode.
Google updated Gemini 2.5 Pro.

Agents

Sakana AI, Jeff Clune, et al. report on: The Darwin Gödel Machine: AI that improves itself by rewriting its own code (github). This builds on the earlier ADAS work (Automated Design of Agentic Systems) that searches for good agent designs, but ups the ante by also recursively improving the search system.
- Preprint: Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents.

Safety

Yoshua Bengio launches LawZero, a non-profit dedicated to advancing safe-by-design AI.

Audio

Elevenlabs introduces a multi-modal assistant, that can handle mixture between voice and text input (at the same time; not requiring toggling between modes). It does seem like a productive way to interact with an AI.
Play AI is open-sourcing PlayDiffusion (demo) a diffusion-LLM for speech, which allows for inpainting (example).
Bland announces an improvement in their text-to-speech model, with cloning of voice, accent, style, etc. They claim it is finally past the uncanny valley.

Image Synthesis

Fal AI introduces FLUX Kontext, which allows image editing.

Video

AMC is integrating Runway ML genAI into its workflows (mostly for ideation, pre-vis, and promotional materials).
Luma introduces Modify Video, allowing style transfer or video-generation conditioned on an input video.

Science

FutureHouse releases ether0, a 24B reasoning model for molecular design.
- Preprint: Training a Scientific Reasoning Model for Chemistry.

Posted in AI, News | Tagged agents, audio, image synthesis, LLM, research, safety, Science, video | Leave a comment

AI News 2025-05-29

Posted on 2025-05-29 by KevinYager

General

Essay by Pete Koomen: AI Horseless Carriages (video version: Why AI Apps Still Feel Broken with Pete Koomen). It makes the case that our current approach of adding AI to existing applications is akin to early horseless carriages (which added engines to existing carriage designs; instead of being designed from scratch to optimally take advantage of an engine). Future AI-first applications need to rethink the user experience in light of AI capabilies.

Research Insights

LLMs on the Line: Data Determines Loss-To-Loss Scaling Laws. They report that dataset curation and tokenization scheme have a strong effect on final loss, while architecture has a more minor effect (any reasonable deep learning structure can learn).
Some work showing how doing RL on internal confidence can improve models (without needing external data):
- Learning to Reason without External Rewards (code). They show that LLMs can learn despite lacking ground truth answers, but rather by optimizing their own internal confidence.
- Maximizing Confidence Alone Improves Reasoning (code); a.k.a. RENT: Reinforcement Learning via Entropy Minimization.
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models.
Creative Preference Optimization. Creativity can be optimized.

LLM

ByteDance releases BAGEL: Unified Model for Multimodal Understanding and Generation (7B, weights, code, demo).
Recently released Claude 4 Opus achieves record 8.5% on ARC-AGI-2.
DeepSeek releases an minor upgrade: DeepSeek-r1-0528.

Agents

OpenAI updates Operator to use the o3 model.
Manus introduce a system that will build a slide deck on demand.

Safety & Interpretability

Anthropic open sources some interpretability tools (circuit tracer, tutorial).

Audio

Kyutai demos Unmute, a text-to-speech and speech-to-text capability. Will be open-sourced.
Anthropic announce that they will begin rolling out voice conversation mode.
Chatterbox TTS is a high-quality open source speech synthesis model (try).

Image Synthesis

Goodfire presents: Painting with concepts using diffusion model latents (try). One can apply semantic labels spatially, in order to guide image generation.
Runway introduces Layout Sketch, allowing one to create images with positional guidance of reference elements.

Video

Viggle Live enables real-time avatar control.
Workflow: Use Google Street View imagery combined with image synthesis (e.g. Runway References) and then video generation (e.g. Runway Gen3) to generate a sequence of “on location” clips.
Google DeepMind report SignGemma, a forthcoming open model for converting sign language video into text.

World Synthesis

EVA: Expressive Virtual Avatars from Multi-view Videos (preprint). Enables 3D virtual avatars viewable from any direction, based on a monocular input.
Odyssey has a crude prototype for generative worlds that you can explore (try it here); they claim frames are generated every ~40 ms. Blog post: AI video you can both watch and interact with in real-time. As they note, the technology will improve rapidly. Already you can get a taste of exploring environments that are generated on demand.

Science

OpenAI adds to ChatGPT scaffolding the ability to visualize molecules (RDKit library).

Robots

Video of Weave robot tidying up.
Video of a general-purpose robot playing badminton.

Posted in AI, News | Tagged agents, audio, image synthesis, LLM, research, robots, safety, Science, video, world synthesis | Leave a comment

AI News 2025-05-22

Posted on 2025-05-22 by KevinYager

General

In November 2024, this paper made bold claims: Artificial Intelligence, Scientific Discovery, and Product Innovation; in particular that AI greatly increased patenting and that top performers benefiting most. MIT conducted an investigation and found the work fraudulent: Assuring an accurate research record. The claims of AI-generated patents cannot be substantiated. The other results may be true, but this particular report should not be used as evidence.
Survey shows US workers are rapidly adopting AI: 30% of workers in Dec 2024; now up to 40% of workers.
- The Labor Market Effects of Generative Artificial Intelligence.
Pew research: How the U.S. Public and AI Experts View Artificial Intelligence. Significant negative sentiment and worry about the future of AI.
Publication of: ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems.
METR provides a preliminary update on their analysis of “AI task completion”, by including examples other than software engineering tasks. The results suggest different scaling based on task type, but a general trend of exponentially increasing capabilities.

Research Insights

Qwen publish: Parallel Scaling Law for Language Models (code). They propose parallel computation for improved scaling of training-time and inference-time compute. This requires a learned transformation step before going into the model, and a learned aggregation on model output.
Harnessing the Universal Geometry of Embeddings. They find evidence for a “Strong Platonic Representation Hypothesis” wherein all models learn essentially the same representation. This implies a well-defined “consensus reality” for any given dataset.

LLM

Why We Think (Lilian Weng) provides a nice review of inference-time compute methods.
Google announce Gemini 2.5 Pro Deep Think. It demonstrates extremely good performance on math and code benchmarks.
Google announce Gemini Diffusion, a text diffusion model that enables 5× faster generation.
Google add native audio output to Gemini 2.5 Pro and 2.5 Flash.
Google add compute use capabilities to Gemini.
Google is expanding its rollout of AI Mode for the main Google search product.
Anthropic announce: Claude Sonnet 4 and Claude Opus 4.

Agents

OpenAI announces: A research preview of Codex in ChatGPT. Whereas Codex-CLI runs locally, this new system runs on OpenAI’s servers. Uses Codex-1 (based on o3, optimized for coding), and can be used for things like: understanding a repo, fixing bugs in a repo, etc.
Google adds an Agent Mode to Gemini, allowing you to delegate tasks for it to work on.
Google release Jules, an asynchronous coding agent.
Google published a video demo of their Project Astra research prototype, an AI assistant operating from your smartphone.

Image Synthesis

Google announce Imagen 4.

Video

Viggle introduce LIVE, real-time webcam character/avatar animation (that runs in browser).
Google announce Veo 3. It also natively generates audio. Examples: conversation, cooking, singing, simple story, cinematic action sequence, car show interviews, We Can Talk, podcat, various.
Google announce Flow, an AI filmmaking tool that integrates with Veo.
Google announce that NotebookLM will be adding video overviews, with graphics generated to match the audio presentation.

Audio

Google announce improvements to their Lyria 2 music generator.

Science

FutureHouse report AI-accelerated research into combating a form of blindness: Demonstrating end-to-end scientific discovery with Robin: a multi-agent system.
Generalization bias in large language model summarization of scientific research. They show that AI summarization, though not actually erroneous, is less precise than human experts.

Hardware

OpenAI announces that io (led by Jony Ive) is merging with them, in order to develop hardware optimized for interfacing with AI.
- Rumors for what this might mean: Tantalizing details of Jony Ive’s AI device leak after OpenAI meeting.

Robots

Video of LimX Dynamics TRON 1, performing various real-world relevant tasks.
Video of Tesla Optimus performing real-world tasks autonomously. Reportedly, all tasks are accomplished using a single neural network trained on human POV data.

Posted in AI, News | Tagged agents, audio, hardware, image synthesis, LLM, research, robots, Science, video | Leave a comment

AI News 2025-05-15

Posted on 2025-05-15 by KevinYager

General

The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis. ChatGPT, when used appropriately, can improve learning outcomes.
Analysis of AI usage: Large Language Models, Small Labor Market Effects. They find evidence of significant uptake/usage, but relatively minor effects on economic indicators.

Research Insights

Reinforcement Learning for Reasoning in Large Language Models with One Training Example. Training over and over on a single example can yield gains.
- Interesting to compare/contrast to RL on unlimited synthetic data: Absolute Zero: Reinforced Self-play Reasoning with Zero Data.
Sakana AI introduces Continuous Thought Machines (interactive report, preprint, code), a new neural approach where the neurons are synchronized and allowed to move their attention around over time. This allows a crude approximation of temporal “thinking” as the neurons modify their focus and state.

LLM

OpenAI add o4-mini to their reinforcement fine-tuning API.
ByteDance releases SeedCoder 8B.
OpenAI adds GPT-4.1 to the ChatGPT web product.
OpenAI release HealthBench. In addition to providing a useful way to track progress on LLMs for healthcare applications, the current results demonstrate just how effective existing LLMs can be in this application space.

Agents

Google describes: AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms (paper). An evolutionary approach is used to optimize algorithms for tasks including math, chip design, datacenter scheduling, and fine-tuning LLMs.
- Among many other things, it was able to improve on various optimal packing math problems.

Safety

OpenAI launch safety evaluations hub.

Audio

Stability AI and Arm release Stable Audio Open Small (weights).

Video

Google announces that the latest improvements to Gemini 2.5 (Pro and Flash) greatly improve video understanding.
Tencent HunyuanCustom is an open-source video model (preprint, code).
New lipsync model: KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution (preprint, code, demo).
LTXV 13B Distilled claims rendering 5× faster than base model.

World Synthesis

Enigma Labs claims they have made the first multiplayer AI-generative video game (a multiplayer car racing game). They say they will open-source the work eventually. Although the gameplay video shows crude graphics, it is further evidence that generative environments are a key part of future entertainment.

Science

AI-Engineered DNA Turns Genes On and Off in Blood Cells.
- Publication in Cell: Design principles of cell-state-specific enhancers in hematopoiesis.
Inference-time compute methods being applied to AI for bioscience: Test-Time Scaling Unlocks a Leap Forward in De Novo Antibody Design.
- Preprint: De novo design of hundreds of functional GPCR-targeting antibodies enabled by scaling test-time compute.
Behind the Noise (github). Self-supervised denoising can learn meaningful representations for science images.
- Preprint: Behind the Noise: Conformal Quantile Regression Reveals Emergent Representations.
Meta report: Sharing new breakthroughs and artifacts supporting molecular property prediction, language processing, and neuroscience.
- Open Molecules 2025 (OMol25) and Meta’s Universal Model for Atoms (UMA): Revolutionizing design at the atomic scale
- Adjoint Sampling: A breakthrough in highly scalable, reward-driven generative modeling
- Unlocking how the human brain develops language

Hardware

Google announces: Gemini smarts are coming to more Android devices. E.g. Gemini will be available from one’s smart watch. A Samsung headset with Gemini integration will supposedly launch later this year.

Robots

Tesla shows a video of Optimus robot dancing. Fluid motion like this tests the limit of hardware and software (latency, real-time compensation, etc.).

Posted in AI, News | Tagged hardware, LLM, research, robots, safety, Science, video, world synthesis | Leave a comment

AI News 2025-05-08

Posted on 2025-05-08 by KevinYager

General

OpenAI is purchasing Windsurf (who have produced an AI-coding editor) for $3B.
AI usage: ChatGPT Is Still Leading the AI Wars but Google Gemini Is Gaining Ground.

Research Insights

Layers at Similar Depths Generate Similar Activations Across LLM Architectures. Fascinating to see such strong similarities between LLMs trained by different groups on different datasets.
Absolute Zero: Reinforced Self-play Reasoning with Zero Data (preprint, code, models). Models learn to propose tasks that improve learning, and then solve those tasks. This approach thus claims the models can iteratively improve through self-play, providing hope that RL-like methods can be applied even to domains without verification.

LLM

Anthropic announces Integrations for Claude, allowing connection to tools like Zapier, Stripe, etc. This includes the ability to build connections to custom tools.
Mistral announce Mistral Medium 3.
A quick-and-dirty benchmark obtained by averaging 28 popular benchmarks (updated version). The leaders are Gemini 2.5 Pro, o3, and Claude 3.7 Thinking.

Audio

xAI adds voice mode to Grok.
Nvidia open-sources Parakeet TDT 0.6B V2, a fast audio transcription model.

Video

LTX Studio announce open-source video model: LTX-Video 13B (code, examples, more examples).
HeyGen announces Avatar IV (example, another example). Simpler usage and more expressive output.

Brain

Google announce: A new light on neural connections. They use light microscope to map neuronal connections, bypassing the need for more expensive/complicated imaging (e.g. electron microscopy). Their method improves upon existing expansion microscopy (ExM), wherein tissue is infused and stretched to expand the sizescale of small structures, using chemical labeling to highlight relevant proteins and deep learning to do image processing. This should increase the rate at which we can image connectomics in neuronal tissue.
- Nature publication: Light-microscopy-based connectomic reconstruction of mammalian brain tissue.

Robots

π0.5: a Vision-Language-Action Model with Open-World Generalization. The claim is that this model enables a robot to handle never-seen real-world environments.
VideoMimic: Visual imitation enables contextual humanoid control (preprint). Shows good progress in having robots learn motion policies from monocular video data.

Posted in AI, News | Tagged audio, brain, LLM, research, robots, video | Leave a comment