AI News 2024-07-04

Research Insights

  • Symbolic Learning Enables Self-Evolving Agents. Demonstrates automated data-driven optimization of LLM workflows. This tries to mimic back-propagation and gradient descent (c.f. TextGrad). This is also another hint of recursive-self-improvement, since an AI model is optimizing an AI model.
  • The Remarkable Robustness of LLMs: Stages of Inference? They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
    • They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
      • Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
      • Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
      • Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and “suppression neurons” playing a major role in upvoting/downvoting.
      • Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
    • This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).
  • A group at MIT introduced Diffusion Forcing, a sort of hybrid method between next-token prediction and full-sequence generation via diffusion. The different tokens to-be-denoised can have different noise levels, providing more control. The concept is general, but they apply it specifically to video and planning. They show how one can generate unlimited-length video (with control/guidance). Planning can handle uncertainty through variable noise levels, and could be useful for robotics. Although only demonstrated on a small model, the concept shows promise.
  • Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems introduces a more challenging task for large-context LLMs (to summarize, with sourcing, a large amount of information). This should be a useful metric/benchmark for future improvements.
    • The comparison to humans is also interesting. Humans outperform LLMs, if they take enough time to complete the task. But there are obviously cases where a <1 min imperfect summary is preferable to a ~1 hour better-quality human analysis. And, of course, LLM performance will improve over time.
  • Self-Play Preference Optimization for Language Model Alignment presents an alternative to RLHF or DPO. The SPPO method treats human preferences as probabilities, seeking to find a Nash equilibrium policy in a constant-sum two-player game. This better captures the intransitivity and irrationality of human preferences.


There are several demos of multi-agent orchestration systems (Camel, LoopGPT, JARVIS, OpenAGI, AutoGen, TaskWeaver, MetaGPT). Increasingly, cloud solutions are also appearing:

A related coordination strategy is to triage user queries, to balance between fast/small models and expensive/better larger models:


  • Perplexity adds multi-step search to their Pro Search product ($20/month); they claim it performs “deeper research on more complex queries with multi-step reasoning, Wolfram|Alpha, and code execution.”
  • Microsoft released the code for GraphRAG, which does document retrieval in a graph-based approach.
  • kyutai Open Science AI lab presented a demo of a real-time voice AI (moshi), based on their multimodal foundation model. It can listen and speak, with very low latency, allowing rather natural conversations. (To some extent, they beat OpenAI to release of a conversational agent, though their model does not seem as smart as GPT-4o.) You can play with it now; code will apparently be released soon.



  • ElevenLabs partnered with estates to bring iconic voices to their service (Judy Garland, James Dean, Burt Reynolds and Sir Laurence Olivier).
  • ElevenLabs also released voice isolator, which can eliminate noisy backgrounds (demo).


  • Runway Gen3-3 Alpha now available to all (including prompting guide).
  • Google DeepMind released some more examples of generation from Veo. But the model is still not available to anyone.
  • All the elements are in place to put together AI-generated short-form content. Runway or Luma (especially with Midjourney image prompting) for video, Elevenlabs for Foley audio and narration, Suno or Udio for backing music. Here’s a simple example of putting this together. We are starting to see this being used for commercial efforts. Toys R Us partnered with OpenAI to use Sora to generate this commercial. Motorola released this genAI commercial, which integrates their logo into fashion. Seems like an appropriate use of genAI (advertising an AI-enabled photo, generating something that would be hard to do with other methods).



World Synthesis

Continuing my survey of methods leading towards neural world synthesis:



  • Stanford HumanPlus leverages training from human data. They first train the robot controller via RL in simulation. Then do imitation of humans in the real world. They demonstrate ‘shadowing’ where the robot is teleoperated in real-time (using only a camera). This bootstraps to the robot doing autonomous tasks (including tying a shoe).
  • Similarly, there is a UCSD effort to develop Open Tele-Vision, a teleoperation scheme for robots that also acts as useful platform for gathering training data.
  • In robotics, there is a philosophical split between “build a bunch of specialized robots for each task” and “build one general-purpose design”. And even if one wants a general design, is a humanoid the best form factor? The argument in favor of humanoid robots is that our work and living environments are already optimized for humanoids, so it makes sense for our robots to conform and take advantage of existing tools/infrastructure. Additionally, these recent papers emphasize an additional advantage: by selecting a humanoid shape, it is easier to access/generate relevant training data, since one can more directly train on humans.
  • Red Rabbit Robotics is trying to develop an open-source humanoid robot design that others could reproduce for $1,000. Still early days, but it looks like they have a prototype of sorts.
  • Leju Robotics launched a humanoid-robot called Kuavo. It seems to be able to do what the other humanoid robots can do (semi-contrived tasks in a slow/deliberate manner).
  • Figure recently started shipping humanoid robots to a real client. This video shows their robot working on BMW use-cases.
  • GXO logistics has signed an agreement to use Agility Robotics Digit in their warehouses (video). Apparently this is subscription-based (robots-as-a-service); which may well become the business model for humanoid robot companies?
  • Clone Robotics continues to release videos of their micro-hydraulic arm that is remarkably dextrous: hand, lifting, pronation and supination, thumb.
This entry was posted in AI, News and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply