Research Insights
- Guiding a Diffusion Model with a Bad Version of Itself combines an image model with an intentionally worse version of itself, and shows how this combination can be used for image synthesis that better balances coherence vs. diversity. (Despite neural methods being largely “block boxes”, results like this show that we do actually understand enough about internals to make meaningful interventions.)
- LLMs are notoriously bad at math. A new preprint investigates fixing that: Transformers Can Do Arithmetic with the Right Embeddings.
- The model can do 100-digit addition (99% accuracy) after being trained on 20-digit numbers. Capabilities also adapted to multiplication. The trick is to enforce an embedding that explicitly captures the position of digits within a number. So numerical representations are first-class during tokenization (conceptually similar to the Polymathic xVal number encoding).
- Of course LLMs can just call external functions to make sure math gets done correctly. But I like the idea of crafting the LLMs themselves to correctly embody basic concepts from math and logic, as it might generalize to improved performance on a range of other planning/deliberation activities.
Audio
- Machine translation has been scaled to 200 languages. The impressive part is that many of these languages have very little training data. The point is that the model can learn language structure from the well-represented languages, and generalize to the languages will less training data.
Avatars
AI audio/video avatars are advancing rapidly. (So this is your periodic reminder to be increasingly skeptical of videos you see, and of random phone calls from loved ones asking you for money.)
- Synthesia EXPRESS-1 avatars show emotions that match the text.
- HeyGen has also demonstrated that they can apply their AI avatar trick (resync lip motions in an existing video to match a new script) to videos where the person is in motion. One of the main use-cases is converting videos to other languages; so this broadens the range of content that can be targeted. Of course one can also use it to nefariously change what someone said in an otherwise very-non-AI-looking video.
- V-Express improves further on virtual avatars (generates video aligned with an audio track, based on a single photo).
- ChatTTS is a text-to-speech system that is remarkably good, including being able to add natural-sounding pauses, laughs, etc. Open source, so you can run it all locally if you want.