Research Insights
- TextGrad tries to do the equivalent of gradient backpropagation for LLMs; computing “gradients” of performance in the text input/outputs sent between LLMs so that you can automatically optimize the behavior of interconnected LLM agents. I don’t know if this particular approach is the right one, but something like this seems promising.
- Mixture-of-Agents appears to be applying a well-rationalized architecture to the general “LLMs working together” workflow. Layers of models are used, with initial/rough LLM replies being fed into the next layer, whereupon the LLM-output can be further refined. Selection of models within layers can be used to increase diversity (use different LLMs to balance each other) and performance (the best LLM for a given input can be emphasized). They show improved performance compared to just using one of the underlying LLMs single-shot. (Video going through paper.)
- Aidan McLaughlin claims that we are ~1 year away from AGI, because current models combined with search (testing out many options) can already unlock enormous capabilities. Seems like an overzealous take, but there is mounting evidence of search greatly improving capabilities. For instance, Ryan Greenblatt claims he was able to massively improve performance on one of the most challenging benchmarks simply by using GPT-4o to sample thousands of options and pick the best one.
- There’s also plenty of academic papers working on search methods. New preprint: Transformers meet Neural Algorithmic Reasoners. They seem to combine LLMs with graph neural networks except instead of searching/iterating in the text outputs, they refine internal to the LLM by using graph methods.
World Synthesis
Neural radiance and Gaussian splatting are making it possible to generate high-quality 3D imagery that is fast to render. Where is this headed?
- These methods are bandwidth-efficient. To interact with a 3D scene traditionally, one would either need to render on the server and transmit 2D video (high-latency), or transmit tons of 3D data (vertex models) so the user’s computer can render locally (assuming their computer is powerful enough). But now you just transmit a point-cloud, which is fast to render. (You can play with examples: Luma captures.)
- These methods are scalable. They’ve been adapted to large-scale scenes. Google Maps is already integrating this in select ways, and we will probably soon see a true virtual-Earth product (where you can move around in 3D anywhere).
- Text-to-3D is steadily improving (Point-E, threestudio, ProlificDreamer, DreamFusion, Magic3D, SJC, Latent-NeRF, Fantasia3D, TextMesh, Zero-1-to-3, Magic123, InstructNeRF2NeRF, Control4D, DreamFusion, Cat3D). Neural methods should allow one to merge together real 3D (from photoscans) with traditional 3D renders and with AI generations.
- Given the progress in generative images (2D), objects (3D), and video (2D+time=3D), the obvious next step is 4D: volumetric scene evolving in time. There was initial work on dynamic view synthesis from videos, and dynamic scene rendering. And now, Vidu4D demonstrates generation of non-rigid 3D objects transforming appropriately over time. Currently crude; but you can see the potential.
- Some folks (e.g. David Holz, founded of Midjourney) see the end goal as having immersive environments that are neural-rendered in real-time, so that you can do exploration and interaction with worlds generated according to your inputs. (A holodeck, of sorts.)
Video
- A couple weeks ago, Kling was demoed (Chinese company, limited access). The outputs appear rather coherent video (more examples in this thread, and this one shows a two minute generation).
- Now Lumalabs (known for 3D capture tech) surprised everyone with a video model: Dream Machine. It seems to be the best publicly-available model. Examples. Free to use, but the wait times can be quite long.
- Runway is teasing Gen-3 Alpha, a video model that might be as good as Luma DreamMachine.
- Hedra is another video-avatar animator. Looks quite good.
Audio
- Camb.ai released an open-source voice generation/cloning model. 140 languages, reportedly very good quality. Not sure how it compares to ChatTTS. But it’s nice to have a variety of open-source options.
- ElevenLabs have added video-to-audio to their many AI-audio options.
- Google DeepMind demonstrate video-to-audio, which can generate plausible audio (sound effects, music) for a video clip.
Apple
- Apple announces a bunch of AI features. It’s the expected stuff: integrated writing assistants, on-the-fly generation of images and emojis, a much-smarter Siri.
- OpenAI will now be available in Apple products.
- At first, people were concerned that all AI requests were being routed to OpenAI. But it actually sounds like Apple is industry-leading in terms of user privacy with cloud-computing/AI: many parts of the workflow will operate on-device, and cloud aspects use a hardened architecture (encryption, stateless, enforceable guarantees, etc.).