Research Insights
- Symbolic Learning Enables Self-Evolving Agents. Demonstrates automated data-driven optimization of LLM workflows. This tries to mimic back-propagation and gradient descent (c.f. TextGrad). This is also another hint of recursive-self-improvement, since an AI model is optimizing an AI model.
- The Remarkable Robustness of LLMs: Stages of Inference? They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
- They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
- Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
- Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
- Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and “suppression neurons” playing a major role in upvoting/downvoting.
- Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
- This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).
- They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
- A group at MIT introduced Diffusion Forcing, a sort of hybrid method between next-token prediction and full-sequence generation via diffusion. The different tokens to-be-denoised can have different noise levels, providing more control. The concept is general, but they apply it specifically to video and planning. They show how one can generate unlimited-length video (with control/guidance). Planning can handle uncertainty through variable noise levels, and could be useful for robotics. Although only demonstrated on a small model, the concept shows promise.
- Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems introduces a more challenging task for large-context LLMs (to summarize, with sourcing, a large amount of information). This should be a useful metric/benchmark for future improvements.
- The comparison to humans is also interesting. Humans outperform LLMs, if they take enough time to complete the task. But there are obviously cases where a <1 min imperfect summary is preferable to a ~1 hour better-quality human analysis. And, of course, LLM performance will improve over time.
- Self-Play Preference Optimization for Language Model Alignment presents an alternative to RLHF or DPO. The SPPO method treats human preferences as probabilities, seeking to find a Nash equilibrium policy in a constant-sum two-player game. This better captures the intransitivity and irrationality of human preferences.
Tools
There are several demos of multi-agent orchestration systems (Camel, LoopGPT, JARVIS, OpenAGI, AutoGen, TaskWeaver, MetaGPT). Increasingly, cloud solutions are also appearing:
- Numbers Station released Meadow which is an agentic framework for data workflows (code).
- CrewAI says they provide multi-agent automations (code).
- LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.
A related coordination strategy is to triage user queries, to balance between fast/small models and expensive/better larger models:
- RouteLLM: Learning to Route LLMs with Preference Data; they evaluate router models that balance between cost and quality.
LLM
- Perplexity adds multi-step search to their Pro Search product ($20/month); they claim it performs “deeper research on more complex queries with multi-step reasoning, Wolfram|Alpha, and code execution.”
- Microsoft released the code for GraphRAG, which does document retrieval in a graph-based approach.
- kyutai Open Science AI lab presented a demo of a real-time voice AI (moshi), based on their multimodal foundation model. It can listen and speak, with very low latency, allowing rather natural conversations. (To some extent, they beat OpenAI to release of a conversational agent, though their model does not seem as smart as GPT-4o.) You can play with it now; code will apparently be released soon.
OpenAI
- OpenAI demoed features that we’ve heard about (real-time speech, adjusting tone, rapid OCR, desktop content sharing) and something new: adding voiceovers to Sora videos by cloning your own voice (including changing language). (Full video from AI Engineer World’s Fair.) Hopefully some of these will be available to the public soon.
Audio
- ElevenLabs partnered with estates to bring iconic voices to their service (Judy Garland, James Dean, Burt Reynolds and Sir Laurence Olivier).
- ElevenLabs also released voice isolator, which can eliminate noisy backgrounds (demo).
Video
- Runway Gen3-3 Alpha now available to all (including prompting guide).
- Google DeepMind released some more examples of generation from Veo. But the model is still not available to anyone.
- All the elements are in place to put together AI-generated short-form content. Runway or Luma (especially with Midjourney image prompting) for video, Elevenlabs for Foley audio and narration, Suno or Udio for backing music. Here’s a simple example of putting this together. We are starting to see this being used for commercial efforts. Toys R Us partnered with OpenAI to use Sora to generate this commercial. Motorola released this genAI commercial, which integrates their logo into fashion. Seems like an appropriate use of genAI (advertising an AI-enabled photo, generating something that would be hard to do with other methods).
3D
- Meta 3D Gen shows improved text-to-3D.
World Synthesis
Continuing my survey of methods leading towards neural world synthesis:
- Object-Aware Gaussian Splatting for Robotic Manipulation: Can do 3D reconstruction and semantic segmentation in real-time. Robot can use this as a world model.
- GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping.
- PAPR in Motion: Seamless Point-level 3D Scene Interpolation: Can smoothly deform point-clouds in ways that make sense. Further demonstration that animated 3D worlds will be possible.
- 4Real Towards Photorealistic 4D Scene Generation via Video Diffusion Models: Text-to-4D scene generation.
- M-LRM: Multi-view Large Reconstruction Model: Better 3D reconstruction.
- RTG-SLAM: Real-time 3D Reconstruction at Scale Using Gaussian Splatting.
- ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering.
- NeRFiller: Completing Scenes via Generative 3D Inpainting.
- Autonomous driving company Wayve has 4D reconstruction models (PRISM-1) that can be used to simulate driving situations.
- Nvidia video-to-4D synthesis.
- Video generation can now be done at real-time speeds.
- Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text (paper). Uses diffusion-generation of Gaussian splats and camera motions to enable text-to-video where scenes are rigidly consistent.
Brain
- Using AI to interpret brain-scan data shows promise.
- Intracranial-EEG+AI to reconstruct the song a person is hearing.
- External-EEG+AI reconstructing words a person is thinking.
- In 2022, some researchers showed how you could combine fMRI brain scans with stable diffusion, and use that to reconstruct a rough version of the image a person is imagining in their mind.
- In 2023, Meta combined MEG with AI: Toward a real-time decoding of images from brain activity (preprint).
- Some new work improves on this idea, showing rather good image reconstruction from brain scans. They use an attentional mechanism, so that the model identifies the relevant parts of the data and focuses on that.
- These methods are potentially relevant for future brain-computer interfaces. One of the challenges in such systems (e.g. Neuralink) is transmitting and interpreting the large amount of data that can be generated by in-brain probes. Attentional systems could quite effectively analyze and compress the raw data, packaging it more suitably for transmission and understanding. The fact that AI methods can reconstruct decent images from weak data (MRI brain scans) bodes well for viable brain-computer interfaces.
Robots
- Stanford HumanPlus leverages training from human data. They first train the robot controller via RL in simulation. Then do imitation of humans in the real world. They demonstrate ‘shadowing’ where the robot is teleoperated in real-time (using only a camera). This bootstraps to the robot doing autonomous tasks (including tying a shoe).
- Similarly, there is a UCSD effort to develop Open Tele-Vision, a teleoperation scheme for robots that also acts as useful platform for gathering training data.
- In robotics, there is a philosophical split between “build a bunch of specialized robots for each task” and “build one general-purpose design”. And even if one wants a general design, is a humanoid the best form factor? The argument in favor of humanoid robots is that our work and living environments are already optimized for humanoids, so it makes sense for our robots to conform and take advantage of existing tools/infrastructure. Additionally, these recent papers emphasize an additional advantage: by selecting a humanoid shape, it is easier to access/generate relevant training data, since one can more directly train on humans.
- Red Rabbit Robotics is trying to develop an open-source humanoid robot design that others could reproduce for $1,000. Still early days, but it looks like they have a prototype of sorts.
- Leju Robotics launched a humanoid-robot called Kuavo. It seems to be able to do what the other humanoid robots can do (semi-contrived tasks in a slow/deliberate manner).
- Figure recently started shipping humanoid robots to a real client. This video shows their robot working on BMW use-cases.
- GXO logistics has signed an agreement to use Agility Robotics Digit in their warehouses (video). Apparently this is subscription-based (robots-as-a-service); which may well become the business model for humanoid robot companies?
- Clone Robotics continues to release videos of their micro-hydraulic arm that is remarkably dextrous: hand, lifting, pronation and supination, thumb.