Kevin G. Yager | Academic Summary

AI News 2024-07-18

Posted on 2024-07-18 by KevinYager

Research Insights

Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking. Chain-of-thought and self-critique approaches try to invoke multi-step LLM reasoning. But the process is still linear. Some newer approaches (tree-of-thought, mixture-of-agents, etc.) try to leverage parallel consideration to improve diversity. Flow-of-reasoning tries to create a tree of reasoning paths to improve diversity, using Markovian flow modeling.

LLMs struggle with math and logic. There are efforts to add-in or train on logic schemes (symbolic chain-of-thought, symbolic solver). New preprint: Teaching Transformers Causal Reasoning through Axiomatic Training, demonstrates training on causal axioms can work.
Human-like Episodic Memory for Infinite Context LLMs. It is obvious that current LLMs lack the long-term memory that humans leverage to address new problems. This work tries to cluster tokens into episodes that are efficiently stored and later retrieved.
AgentInstruct: Toward Generative Teaching with Agentic Flows. Framework generates synthetic data for training other models, that is higher-quality and more diverse than prior methods.
Transformer Layers as Painters, analyzes how LLMs operate. They intentionally skip layers, or swap layer execution order (strong similarities to Tegmark’s “Stages of Inference” paper, c.f.). They find the LLM degrades gracefully, which suggests that every layer matters (is performing a distinct computation) but also that subsequent middle layers are operating on a common representation. They find that math-heavy tasks are most sensitive (biggest degradation). They show that middle layers can even be applied in parallel instead of sequentially (optionally looping over this parallel block). This could suggest some alternative architectures with faster inference.

AI Agents

Decomposing Agency — capabilities without desires. Goes through different possible splits between the crucial components for a fully-featured agent (goals, awareness, planning, capabilities). An important point is that one can build different kinds of agents, with subsets of these components. E.g. the high-level motivating goals can come from the human, such that the AI agent has no goals of its own.

LLM

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models. They customize the encoding of a spreadsheet, allowing the LLM to reason on the semi-structured data organization of spreadsheets. Given how much data and workflow complexity can be captured by spreadsheets (combined with the availability of spreadsheet training data), this seems like a useful generic capability.
OpenAI published some guidelines on how to improve the accuracy of LLM output.
OpenAI posted: Prover-Verifier Games improve legibility of language model outputs. They use a strong LLM to generate answers/proofs in a way that a weaker model could verify them. This sacrifices a bit of performance, but increases legibility to humans.
OpenAI added a “mini” version of GPT-4o to its model lineup. It is meant to be the most capable and cost-effective model.

Multi-modal Models

Llava-NeXT-Interleave is a new group of vision-language models trained on image, video and 3D data (demo).

Chatbots

The Pantheon Interface is a new idea for how to interact with LLMs (live instance, code). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming.
- This could be an element of the human-computer interface for my proposed science exocortex (swarm of AI agents that help researchers).
- Loom is a somewhat related idea, where one have LLMs created branched writings.

Vision

Nvidia MambaVision models use a hybrid mamba-transformer. State-of-the-art in performance and throughput. Can be applied to classification, detection, segmentation, etc.

Images

This is a fun demo of using a physical interface to tune image synthesis model parameters, making it easier to explore the latent space.

Video

Realtime implementations of LivePortrait are appearing (gradio app, fal.ai).
A new text-to-video/image-to-video option: haiper.ai (examples).

World Synthesis

A weakness of Gaussian splats are that they bake-in lighting/environment. But there is work to improve this.
- 3D Gaussian Ray Tracing: Fast Tracing of Particle Scenes tackles one part of this.
- StyleSplat: 3D Object Style Transfer with Gaussian Splatting; takes the approach of style-transfer to enable editing of 3D splats.
- RRM: Relightable assets using Radiance guided Material extraction.
It is also hard to identify and modify objects in splat point-clouds.
- Let It Flow: Simultaneous Optimization of 3D Flow and Object Clustering; uses flow-based methods to cluster points.
- Click-Gaussian: Interactive Segmentation to Any 3D Gaussians; allows one to interactively pick objects, and then modify (move/rescale) them (video).
Animate3D: Animating Any 3D Model with Multi-view Video Diffusion (video).
Buildbox 4 has a platform for building games by iteratively prompting an AI.

Policy

The US Department of Energy announced a new initiative: Frontiers in Artificial Intelligence for Science, Security and Technology (FASST). There is a senate bill proposing to fund this effort at $2.4B/year.

Education

Andrej Karpathy has announced a new venture that will leverage AI to improve education. Eureka Labs will build AI teaching assistants to work alongside teachers in helping students understand complex topics. The company’s first concrete output is (naturally) a course focused on how to build an AI model (aimed at undergraduates).

Brain

Scaling Law in Neural Data: Non-Invasive Speech Decoding with 175 Hours of EEG Data. They synthesize speech using EEG data fed through a neural model. They show that performance improves continually as a function of dataset size (up to 175 hours; by comparison usually people only use ~10 hours of data). The lack of plateau in the scaling is good news in the bitter lesson sense: it suggests that there is plenty of available performance by simply scaling up known methods on more and more brain data.

Consciousness

A survey of the top 200 definitions of what consciousness might be: A landscape of consciousness: Toward a taxonomy of explanations and implications.

Hardware

NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules.

Robots

Disney published on the control approach for their cute bipedal robot: Design and Control of a Bipedal Robotic Character. Their reinforcement learning approach includes the usual (stability, locomotion) with artistic control aspects (including behaviors that humans find pleasing/amusing).
DextrAH-G: Pixels-to-Action Dexterous Arm-Hand Grasping with Geometric Fabrics. Using reinforcement learning in simulation to train object-grasping policies. The transfer learning from sim to reality works remarkably well (videos).
Robotic Control via Embodied Chain-of-Thought Reasoning. Chain-of-thought reasoning improves chatbot response. This approach has been applied to vision-language-action (VLA) models, improving task performance since it plans ahead (project page, videos).

Posted in AI, News | Tagged agents, brain, chatbots, consciousness, hardware, image synthesis, images, multimodal, research, robots, video, vision, world synthesis | Leave a comment

AI News 2024-07-11

Posted on 2024-07-11 by KevinYager

Research Insights

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities. Gives LLM the ability to think through an answer visually by writing code that outputs images, and then analyzing said image. Combined with iterative self-prompting, this should allow a model to reason visually. It of course makes sense that an LLM would have trouble with visual tasks, which humans typically solve by visually imagining the problem. Of course, one can also train multimodal (text+vision) models; but even in that case there is likely an advantage to models using internal scratch-space to work through problems before answering.
Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling. RLHF is used to elicit desired behavior from base models. However, this leads to a tradeoff, where the agentic RLHFed model is better at the selected tasks, but becomes worse at generic next-token prediction and thus worse at world modeling. So goal-directed behavior worsens overall understanding. An obvious solution is to build systems that mix models. E.g. an agentic RLHFed system that can call a powerful base model for predictions.
- My own suggestion is to build swarms of AI agents, each specialized in some way. It does seem like we should keep the untuned base model available as an agent or tool in the mix; supervised by other agents.
A set of nominally unrelated results all point in a similar direction:
- Mixture of A Million Experts. Google DeepMind shows that one can replace the feedforward layers in a transformer with a PEER layer (parameter efficient expert retrieval). The PEER layer draws from a large pool (over a million) of “tiny experts”. This outperforms feedforward, and also the usual coarse-grained mixture-of-experts (MoE) method.
- Memory³: Language Modeling with Explicit Memory. LLMs have different kinds of memory: contextual (current state captured by activation of key-value in transformer), implicit (baked into the network weights), and retrieval (if RAG systems pull in documents into context window). This work proposes to add another form of memory that is more robust/concrete than implicit (weights). During training, they learn a sparse attention key-values (highly compressed and efficient); during training, memories are retrieved and integrated into self-attention layers.
- Learning to (Learn at Test Time): RNNs with Expressive Hidden States (summary from one of the authors). This method introduces Test-Time-Training (TTT) layers into a recurrent neural network (RNN). So the hidden state (memory) of the RNN, instead of being a simple vector, is a small neural network. This internal NN is optimized via gradient descent to capture the required “current state” information as a long sequence of tokens is processed. This provides better expressive/memory power, while retaining the good scaling of RNNs for long sequences. The authors claim this yields much better scaling on long context-window problems than transformers or even Mamba (a structured state space model). TTT replaces the need for attention. Of course, transformers have many advantages; so it remains to be seen if TTT can match the capabilities of transformer systems. But it seems clever (and the general idea of having some NNs that learn to capture active state, inside of larger pretrained systems, could be useful).
- The common thread is increasing sophistication for the internal modules of a NN, with the internal weights being updated at runtime. This massively expands the expressive power of the system, without correspondingly increasing model size (since the larger range of possibilities is externalized). This seems like an attractive concept for improving LLMs.
Distilling System 2 into System 1, uses LLM to do (expensive) “system 2 reasoning” by askingfor chain-of-thought solutions. Then retrains the system on that text. Thus, improved system 2 reasoning becomes baked-in to the LLM’s fast/reflexive response. Clever, useful, and points towards recursive self-improvement of LLMs. (Similar to STaR.)
Associative Recurrent Memory Transformer. Tackles long-context windows by combining transformer self-attention for local context, with segment-level recurrence to capture distributed information. They show results for a 50M token context.

Safety

It’s not easy to imagine how humans will provide suitable oversight to LLMs as they become smarter and more broadly deployed. One strategy is to have LLMs debate with each other, allowing the human to judge which argument is best even in instances where they don’t fully understand the topic. (C.f. Debate Helps Supervise Unreliable Experts, Debating with More Persuasive LLMs Leads to More Truthful Answers.) New Google DeepMind contribution: On scalable oversight with weak LLMs judging strong LLMs.
Yoshua Bengio provides balanced arguments for why we should take AI safety seriously.
CIGI paper: Framework Convention on Global AI Challenges. Discusses both near-term challenges and long-term risks. This image summarizes risks:

Chatbots

GPT-4o and Kyutai Moshi (c.f.) show a shift towards conversational/audio chatbots.
This 2016 paper (via 𝕏) is relevant: Turn-taking in Human Communication – Origins and Implications for Language Processing.
- Most human conversation involves rapid back-and-forth; in fact the average speaking time for a person is only 2 seconds.
- This pace of switching is faster than possible for language encoding, and certainly for deliberative thinking. So, participants are instead predicting the other person’s speech and when their turn will come.
- Current chatbots are ill-suited to this modality. They monologue too much, their latency is still too high, they don’t handle interruptions well, and they are not actively predicting the user’s speech as they are talking.
- But, these are all solvable problems. It would certainly be interesting to see a class of models trained and tuned to exhibit true conversational dialogue.
Swift is a very fast voice-bot demo (based on Groq, Cartesia, VAD, and Vercel). Code here.

Images

Paints-UNDO is a new model can turn an image into a video sequence of steps needed to sketch it, refine it, color it, etc. (more examples, huggingface demo). This model is part of an exploration for AI to better understand how humans create artistic images. It could also be used, perhaps, to generate drawing tutorials for humans.

Video

Now that numerous AI tools are available for video and audio (c.f.), creators are starting explore. Here are some example creations. Right now these are quite short-form, but as tools improve in controllability and speed, we can expect to see longer-form content.
Live Portrait allows you to drive the facial animation of an image using a provided video (examples). Also available on replicate.
RenderNet has a video face swapping tool.
YouTube Erase Song tool allows one to remove music from video (while leaving other audio intact). The main use-case is to avoid copyright claims (e.g. from background music).
Odyssey announced that they intend to release AI tools for “Hollywood-grade visuals”. They are training models that don’t just output text-to-video, but output intermediate representations (depth maps? meshes?), allowing the user to iteratively ask for AI refinements. The idea is to give the level of control and quality that prestige TV/movies demand. Currently it’s just a teaser video; no results to inspect or demos to play with. But it will be exciting if they can deliver on this idea.

Zoo has added text-to-CAD.

World Synthesis

Segment Any 4D Gaussians (preprint) shows how to extract content from a 4D (3D reconstruction+time) clip, and copy-paste that content into another 3D/4D reconstruction.
HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models (preprint).
GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting (preprint, code) applies the Gaussian splatting to two-dimensional data, using it as a compression algorithm. Seems to work quite well.

Art

Style transfer is a well-studied class of methods for recreating an image with a different art style. It has somewhat fallen by the wayside since generative AI art (image synthesis) is now so good. But StyleShot shows improvements in style transfer (code, demo).
Generative Art in Websim shows how to make generative art by prompting an LLM (such as Anthropic’s Claude chatbot).

AI for Science

OpenAI and Los Alamos National Laboratory announce bioscience research partnership. (Same announcement from Los Alamos.) The focus is on bioscience research, and AI biosecurity. They also mention deploying the OpenAI multi-modal models (GPT-4o voice assistant) in real-world wet-lab settings, as a companion to the human scientist.

Health

Sam Altman and Arianna Huffington announced a new AI-health venture: Thrive AI Health. The idea is hyper-personalization of AI to help people make behavioral changes for better health.

Brain

Paper: Semantic encoding during language comprehension at single-cell resolution. Researchers measured single cells in the prefrontal cortex of live humans. The activation patterns suggest that specific neurons respond to word semantics. There is a strong analogy to what’s seem in LLMs. Concepts are encoded in activation patterns, with specific neurons capturing meaning (at some level of abstraction).
Paper: Task-driven neural network models predict neural dynamics of proprioception. Writeup: Artificial intelligence meets body sense: task-driven neural networks reveal computational principles of the proprioceptive pathway. They use musculoskeletal modeling and neural networks to mimic proprioception.
There is an interesting convergence (c.f.) between artificial neural networks and understanding of biological brains. The two efforts are complementary, helping us better understand AIs, better understand brains, and improve interfaces between them.
Synchron is developing a brain-computer interface (BCI) that is inserted into blood vessels (like a catheter) and therefore doesn’t require open brain surgery. They are planning to use OpenAI technology to improve interface/control, since a chatbot can provide contextually meaningful options.

Robots

Robot control is advancing, with several methods showing promise.

Diffusion methods show promise for planning, including for robots to generate path-plans in environments:
- C.f. Diffusion Forcing, which denoises tokens allowing arbitrary-length videos/plans to be generated.
- Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World.
- DiPPeST: Diffusion-based Path Planner for Synthesizing Trajectories Applied on Quadruped Robots.
- LDP: A Local Diffusion Planner for Efficient Robot Navigation and Collision Avoidance.
As previously discussed, world reconstruction (via 3D Gaussian splats) is also relevant for robotic planning:
- Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling. Can track object motion, including deformable objects, enabling improved planning.
Reinforcement learning exploiting generative models is improving:
- New paper: Lifelike agility and play in quadrupedal robots using reinforcement learning and generative pre-trained models (Nature paper, preprint). They use a hiearchy of controllers, based on generative modeling (trained in part on animal motion).

Robot hardware/systems continue to advance.

Most current robots lack a sense of touch. There are efforts to add pressure sensors. An alternative is for the robot to measure audio signals, and train models that can infer from that the necessary tactile information. ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data (preprint). Clever.
Xiaomi claims they are bringing online a robot factory that will operate 24/7 without humans, delivering 60 smartphones/minute. I’m skeptical (I assume there will still be humans tasked with oversight, maintenance, repair, and intervention); but it is an interesting trend to watch.
A new entrant to the humanoid-robot startup space: BXI Elf robot. Already available for purchase ($25k), though it seems a bit primitive compared to other efforts.

Posted in AI, News | Tagged 3D, AI, art, BCI, brain, chatbots, health, images, research, robots, Science, video, world synthesis | Leave a comment

AI News 2024-07-04

Posted on 2024-07-04 by KevinYager

Research Insights

Symbolic Learning Enables Self-Evolving Agents. Demonstrates automated data-driven optimization of LLM workflows. This tries to mimic back-propagation and gradient descent (c.f. TextGrad). This is also another hint of recursive-self-improvement, since an AI model is optimizing an AI model.
The Remarkable Robustness of LLMs: Stages of Inference? They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
- They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
  - Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
  - Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
  - Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and “suppression neurons” playing a major role in upvoting/downvoting.
  - Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
- This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).
A group at MIT introduced Diffusion Forcing, a sort of hybrid method between next-token prediction and full-sequence generation via diffusion. The different tokens to-be-denoised can have different noise levels, providing more control. The concept is general, but they apply it specifically to video and planning. They show how one can generate unlimited-length video (with control/guidance). Planning can handle uncertainty through variable noise levels, and could be useful for robotics. Although only demonstrated on a small model, the concept shows promise.
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems introduces a more challenging task for large-context LLMs (to summarize, with sourcing, a large amount of information). This should be a useful metric/benchmark for future improvements.
- The comparison to humans is also interesting. Humans outperform LLMs, if they take enough time to complete the task. But there are obviously cases where a <1 min imperfect summary is preferable to a ~1 hour better-quality human analysis. And, of course, LLM performance will improve over time.
Self-Play Preference Optimization for Language Model Alignment presents an alternative to RLHF or DPO. The SPPO method treats human preferences as probabilities, seeking to find a Nash equilibrium policy in a constant-sum two-player game. This better captures the intransitivity and irrationality of human preferences.

Tools

There are several demos of multi-agent orchestration systems (Camel, LoopGPT, JARVIS, OpenAGI, AutoGen, TaskWeaver, MetaGPT). Increasingly, cloud solutions are also appearing:

Numbers Station released Meadow which is an agentic framework for data workflows (code).
CrewAI says they provide multi-agent automations (code).
LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.

A related coordination strategy is to triage user queries, to balance between fast/small models and expensive/better larger models:

RouteLLM: Learning to Route LLMs with Preference Data; they evaluate router models that balance between cost and quality.

LLM

Perplexity adds multi-step search to their Pro Search product ($20/month); they claim it performs “deeper research on more complex queries with multi-step reasoning, Wolfram|Alpha, and code execution.”
Microsoft released the code for GraphRAG, which does document retrieval in a graph-based approach.
kyutai Open Science AI lab presented a demo of a real-time voice AI (moshi), based on their multimodal foundation model. It can listen and speak, with very low latency, allowing rather natural conversations. (To some extent, they beat OpenAI to release of a conversational agent, though their model does not seem as smart as GPT-4o.) You can play with it now; code will apparently be released soon.

OpenAI

OpenAI demoed features that we’ve heard about (real-time speech, adjusting tone, rapid OCR, desktop content sharing) and something new: adding voiceovers to Sora videos by cloning your own voice (including changing language). (Full video from AI Engineer World’s Fair.) Hopefully some of these will be available to the public soon.

Audio

ElevenLabs partnered with estates to bring iconic voices to their service (Judy Garland, James Dean, Burt Reynolds and Sir Laurence Olivier).
ElevenLabs also released voice isolator, which can eliminate noisy backgrounds (demo).

Video

Runway Gen3-3 Alpha now available to all (including prompting guide).
Google DeepMind released some more examples of generation from Veo. But the model is still not available to anyone.
All the elements are in place to put together AI-generated short-form content. Runway or Luma (especially with Midjourney image prompting) for video, Elevenlabs for Foley audio and narration, Suno or Udio for backing music. Here’s a simple example of putting this together. We are starting to see this being used for commercial efforts. Toys R Us partnered with OpenAI to use Sora to generate this commercial. Motorola released this genAI commercial, which integrates their logo into fashion. Seems like an appropriate use of genAI (advertising an AI-enabled photo, generating something that would be hard to do with other methods).

Meta 3D Gen shows improved text-to-3D.

World Synthesis

Continuing my survey of methods leading towards neural world synthesis:

Object-Aware Gaussian Splatting for Robotic Manipulation: Can do 3D reconstruction and semantic segmentation in real-time. Robot can use this as a world model.
GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping.
PAPR in Motion: Seamless Point-level 3D Scene Interpolation: Can smoothly deform point-clouds in ways that make sense. Further demonstration that animated 3D worlds will be possible.
4Real Towards Photorealistic 4D Scene Generation via Video Diffusion Models: Text-to-4D scene generation.
M-LRM: Multi-view Large Reconstruction Model: Better 3D reconstruction.
RTG-SLAM: Real-time 3D Reconstruction at Scale Using Gaussian Splatting.
ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering.
NeRFiller: Completing Scenes via Generative 3D Inpainting.
Autonomous driving company Wayve has 4D reconstruction models (PRISM-1) that can be used to simulate driving situations.
Nvidia video-to-4D synthesis.
Video generation can now be done at real-time speeds.
Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text (paper). Uses diffusion-generation of Gaussian splats and camera motions to enable text-to-video where scenes are rigidly consistent.

Brain

Using AI to interpret brain-scan data shows promise.
- Intracranial-EEG+AI to reconstruct the song a person is hearing.
- External-EEG+AI reconstructing words a person is thinking.
In 2022, some researchers showed how you could combine fMRI brain scans with stable diffusion, and use that to reconstruct a rough version of the image a person is imagining in their mind.
In 2023, Meta combined MEG with AI: Toward a real-time decoding of images from brain activity (preprint).
Some new work improves on this idea, showing rather good image reconstruction from brain scans. They use an attentional mechanism, so that the model identifies the relevant parts of the data and focuses on that.
These methods are potentially relevant for future brain-computer interfaces. One of the challenges in such systems (e.g. Neuralink) is transmitting and interpreting the large amount of data that can be generated by in-brain probes. Attentional systems could quite effectively analyze and compress the raw data, packaging it more suitably for transmission and understanding. The fact that AI methods can reconstruct decent images from weak data (MRI brain scans) bodes well for viable brain-computer interfaces.

Robots

Stanford HumanPlus leverages training from human data. They first train the robot controller via RL in simulation. Then do imitation of humans in the real world. They demonstrate ‘shadowing’ where the robot is teleoperated in real-time (using only a camera). This bootstraps to the robot doing autonomous tasks (including tying a shoe).
Similarly, there is a UCSD effort to develop Open Tele-Vision, a teleoperation scheme for robots that also acts as useful platform for gathering training data.
In robotics, there is a philosophical split between “build a bunch of specialized robots for each task” and “build one general-purpose design”. And even if one wants a general design, is a humanoid the best form factor? The argument in favor of humanoid robots is that our work and living environments are already optimized for humanoids, so it makes sense for our robots to conform and take advantage of existing tools/infrastructure. Additionally, these recent papers emphasize an additional advantage: by selecting a humanoid shape, it is easier to access/generate relevant training data, since one can more directly train on humans.
Red Rabbit Robotics is trying to develop an open-source humanoid robot design that others could reproduce for $1,000. Still early days, but it looks like they have a prototype of sorts.
Leju Robotics launched a humanoid-robot called Kuavo. It seems to be able to do what the other humanoid robots can do (semi-contrived tasks in a slow/deliberate manner).
Figure recently started shipping humanoid robots to a real client. This video shows their robot working on BMW use-cases.
GXO logistics has signed an agreement to use Agility Robotics Digit in their warehouses (video). Apparently this is subscription-based (robots-as-a-service); which may well become the business model for humanoid robot companies?
Clone Robotics continues to release videos of their micro-hydraulic arm that is remarkably dextrous: hand, lifting, pronation and supination, thumb.

Posted in AI, News | Tagged 3D, AI, audio, LLM, OpenAI, Perplexity, research, robots, tools, video | Leave a comment

AI News 2024-06-27

Posted on 2024-06-27 by KevinYager

Research Insights

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data. Surprising result: Training LLM on (x,y) pairs enables it to infer the underlying function (define it in code, invert it, compose it). Reasoning is occurring non-transparently in weights/activations.
I posted a preprint of an idea for a “science exocortex”. Essentially, a swarm of AI agents working together on tasks, and picking only high-value ideas and decisions that require human consideration. It’s just a set of ideas for now.

Anthropic

Anthropic released Claude 3.5 Sonnet. It is better than the larger Claude 3 Opus, and beats GPT-4o on many evals. (So presumably 3.5 Opus will be very smart?) It also has “artifacts”, which are sidebar visualizations/interactions that it can update and modify based on your requests. Interestingly, it seems to use special <antThinking> tags so that it can do chain-of-thought but have that output hidden from the user.

OpenAI

OpenAI acquired Rockset, a database/analytics company. The intended use seems to be for customers (especially corporate) to integrate data retrieval into LLM products.
Multi is a MacOS app for slick collaborative screenshare. They are shutting down their offering and instead “joining” OpenAI (merging with? being acquired by?). Some are guessing this means OpenAI will launch a radically new kind of operating system, where AI agents are first-class components. I think the simpler prediction is that they want their AI agent to “screenshare” by being able to see what’s on your screen and point at things, or even edit things or click buttons (with your permission). That would be useful.
Announced a partnership with TIME. Could either represent training data, or integration of sourced results in future ChatGPT replies (probably both). This is on top of other partnerships they’ve announced: Financial Times, Stack Overflow, Reddit, News Corp, Vox Media, The Atlantic, Apple.
Taken together, these make it seem like OpenAI are putting more focus on delivering a compelling consumer product.
On the research side, OpenAI put out a preprint showing how an LLM can be trained to critique another LLM. The critic can catch errors in the code output of ChatGPT. Small step towards iteration loops to improve outputs.

LLMs

Nvidia releases Nemotron-4 340B models and training dataset.
Google opens developer access to Gemini 1.5 Pro with 2M context window. That’s a lot of context.

Science

AlphaFold is already having a sizable impact on protein structure determination. Now, startup EvolutionaryScale has announced ambitions to enable programmable biology. Their preprint is equally ambitious: Simulating 500 million years of evolution with a language model. (See also prior publication cred.) They have open-sourced their ESM3 foundation model, which is trained on sequence, structure, and function of proteins. So you can (e.g.) input a desired function and it will generate a candidate protein. If these claims pan out, this could accelerate bio/medical research.
Some new work has demonstrated an RNA method for gene editing. In terms of utility, this is similar to CRISPR; in fact it could provide some capabilities beyond what CRISPR can do. Combined with more and more AI-based bio-design, this could lead to some interesting developments.

Robots

Kinda novel approach to AI/control for robotics: Dreamitate involves having the AI ‘dream’ an upcoming action (i.e. predict what the required action would look like in its camera vision), and then imitate that set of actions. The advantage here is that this leverages the power of generative video. You train a model on a bunch of video, so that it can correctly predict the next frame. Then that’s what you use for robot control. (This is the sense in which OpenAI claim Sora is a world-simulator and hence can be used to understand and act.)
A related robot-control effort: Embodied Instruction Following in Unknown Environments. Multi-modal model for robot following commands. Language model to understand human request. Builds a high-level plan and steps within it. Explores environment if necessary to learn more. Leveraging LLM means it can handle arbitrary tasks that it wasn’t specifically trained on.

Vision

Supervision is a generic (and open-source) vision system. Seems to work very well for semantic video tracking.
Microsoft open-sourced Florence-2, a lightweight vision-language foundation model useful for captioning, object detection, grounding, and segmentation. Interestingly, they created their training dataset by taking existing data and existing specialized models to create a unified set of well-labelled images. So this is another example of AI generating improved training data for AI.

Virtual Avatars

C.f. prior progress.
Synthesia is making avatars better; now the avatars can do realistic hand gestures timed to speech.

Tools

One idea for easily creating AI workflows is to use spreadsheet-like interfaces, where cells can invoke AI/LLM/etc. in order to run tasks across a whole bunch of data. V7 Go and Otto are offering this.

Hardware

Groq transitioned to being an AI cloud compute provider, instead of trying to sell people their custom chips directly. Their pricing on many models (including Whisper Large V3) are very good. They clearly have something to offer.
Etched raises $125M for their specialized chips.
Preprint recasts LLMs in a way that avoids matrix multiplication. Some are claiming this means the end of GPUs and Nvidia; that seems unlikely to me since there are so many current (and future!) data/ML/AI tasks that benefit from GPU/CUDA. But it is an interesting reminder that we don’t know what the optimal software architecture will be, thus it’s hard to know what the right hardware will be.

Posted in AI, News | Tagged AI, Anthropic, avatars, hardware, LLM, OpenAI, research, robots, Science, tools, vision | Leave a comment

What is the future of AI in science? I propose that the community should work together to build an exocortex—an expansion to a researcher’s cognition and volition made possible by AI agents operating on their behalf.

The rise of large language models (LLMs) presages a true paradigm shift in the way intellectual work is conducted. But what will this look like in practice? How will it change science?

LLMs are often used as chatbots, but that perhaps misses their true potential, which is as decision-making agents. Andrej Karpathy (1,2) thus centers LLMs as the kernel (orchestration agent) for a new kind of operating system. The LLM triggers tools and coordinates resources, on behalf of the user.

In the future, every person might have an exocortex: a persistently-running swarm of AI agents that work on their behalf, thereby augmenting their cognition and volition. Crucially, the AI agents do not merely communicate with the human; they talk to each other, solving complex problems through iterative work, and only surfacing the most important results or decisions for human consideration. The exocortex multiplies the human’s intellectual reach.

A science exocortex can be built by developing a set of useful AI agents (for experimental control, for data exploration, for ideation), and then connecting them together to allow them to coordinate and work on more complex problems.

Here is a paper with more details: Towards a Science Exocortex Digital Discovery 2024 doi: 10.1039/D4DD00178H (originally posted to arXiv).

The exocortex is obviously speculative. It is a research problem to identify the right design, build it, and deploy it for research. But the potential upside is enormous, in terms of liberating scientists from micro-managing details, allowing them to focus on high-level scientific problems; and correspondingly for massively accelerating the pace of scientific discovery.

Posted in AI | Tagged agents, AI, Exocortex, LLM | 1 Comment

AI News 2024-06-14

Posted on 2024-06-14 by KevinYager

Research Insights

TextGrad tries to do the equivalent of gradient backpropagation for LLMs; computing “gradients” of performance in the text input/outputs sent between LLMs so that you can automatically optimize the behavior of interconnected LLM agents. I don’t know if this particular approach is the right one, but something like this seems promising.
Mixture-of-Agents appears to be applying a well-rationalized architecture to the general “LLMs working together” workflow. Layers of models are used, with initial/rough LLM replies being fed into the next layer, whereupon the LLM-output can be further refined. Selection of models within layers can be used to increase diversity (use different LLMs to balance each other) and performance (the best LLM for a given input can be emphasized). They show improved performance compared to just using one of the underlying LLMs single-shot. (Video going through paper.)
Aidan McLaughlin claims that we are ~1 year away from AGI, because current models combined with search (testing out many options) can already unlock enormous capabilities. Seems like an overzealous take, but there is mounting evidence of search greatly improving capabilities. For instance, Ryan Greenblatt claims he was able to massively improve performance on one of the most challenging benchmarks simply by using GPT-4o to sample thousands of options and pick the best one.
There’s also plenty of academic papers working on search methods. New preprint: Transformers meet Neural Algorithmic Reasoners. They seem to combine LLMs with graph neural networks except instead of searching/iterating in the text outputs, they refine internal to the LLM by using graph methods.

World Synthesis

Neural radiance and Gaussian splatting are making it possible to generate high-quality 3D imagery that is fast to render. Where is this headed?

These methods are bandwidth-efficient. To interact with a 3D scene traditionally, one would either need to render on the server and transmit 2D video (high-latency), or transmit tons of 3D data (vertex models) so the user’s computer can render locally (assuming their computer is powerful enough). But now you just transmit a point-cloud, which is fast to render. (You can play with examples: Luma captures.)
These methods are scalable. They’ve been adapted to large-scale scenes. Google Maps is already integrating this in select ways, and we will probably soon see a true virtual-Earth product (where you can move around in 3D anywhere).
Text-to-3D is steadily improving (Point-E, threestudio, ProlificDreamer, DreamFusion, Magic3D, SJC, Latent-NeRF, Fantasia3D, TextMesh, Zero-1-to-3, Magic123, InstructNeRF2NeRF, Control4D, DreamFusion, Cat3D). Neural methods should allow one to merge together real 3D (from photoscans) with traditional 3D renders and with AI generations.
Given the progress in generative images (2D), objects (3D), and video (2D+time=3D), the obvious next step is 4D: volumetric scene evolving in time. There was initial work on dynamic view synthesis from videos, and dynamic scene rendering. And now, Vidu4D demonstrates generation of non-rigid 3D objects transforming appropriately over time. Currently crude; but you can see the potential.
Some folks (e.g. David Holz, founded of Midjourney) see the end goal as having immersive environments that are neural-rendered in real-time, so that you can do exploration and interaction with worlds generated according to your inputs. (A holodeck, of sorts.)

Video

A couple weeks ago, Kling was demoed (Chinese company, limited access). The outputs appear rather coherent video (more examples in this thread, and this one shows a two minute generation).
Now Lumalabs (known for 3D capture tech) surprised everyone with a video model: Dream Machine. It seems to be the best publicly-available model. Examples. Free to use, but the wait times can be quite long.
Runway is teasing Gen-3 Alpha, a video model that might be as good as Luma DreamMachine.
Hedra is another video-avatar animator. Looks quite good.

Audio

Camb.ai released an open-source voice generation/cloning model. 140 languages, reportedly very good quality. Not sure how it compares to ChatTTS. But it’s nice to have a variety of open-source options.
ElevenLabs have added video-to-audio to their many AI-audio options.
Google DeepMind demonstrate video-to-audio, which can generate plausible audio (sound effects, music) for a video clip.

Apple

Apple announces a bunch of AI features. It’s the expected stuff: integrated writing assistants, on-the-fly generation of images and emojis, a much-smarter Siri.
OpenAI will now be available in Apple products.
At first, people were concerned that all AI requests were being routed to OpenAI. But it actually sounds like Apple is industry-leading in terms of user privacy with cloud-computing/AI: many parts of the workflow will operate on-device, and cloud aspects use a hardened architecture (encryption, stateless, enforceable guarantees, etc.).

Posted in AI, News | Tagged AI, Apple, audio, research, video, world synthesis | Leave a comment

Situational Awareness

Posted on 2024-06-09 by KevinYager

Leopold Aschenbrenner (previously at OpenAI) offers some unique perspectives on the future of AI. His paper “situational awareness” paints a picture of an inevitable AI-Manhatten project.

If you want to look into his arguments, here are some different formats:

Aschenbrenner’s Situational Awareness paper (160 pages)
Select quotes from that paper
Podcast (4.5 hours) with Dwarkesh Patel
Text summary of podcast

It’s hard to summarize that much material. But here are my notes on the main points he argues:

Geopolitics will undoubtedly be at play once we get close to AGI; and definitely when ASI is at play.
Most people talk about AI as a project of corporate research labs (which it currently is), but as capabilities improve, it will be impossible for the national security apparatus to ignore.
Simple scaling arguments suggest we will reach AGI in ~2-3 years, unless we hit a barrier (he lists many). Of course, we may well hit a barrier; but caution requires us to plan assuming AGI could be very near.
Once you have AGI, you will achieve ASI very quickly. One of the easiest jobs to automate with AGIs will be AI research, so you will suddenly have an army of tireless AI researchers making exponential improvements. This is probably enough to go from AGI to ASI within a year.
Obviously, whoever controls ASI will have a massive geopolitical advantage (superhuman cyber-warfare, autonomous drone swarms, rapid development of new WMDs, optimal allocation of resources, etc.).
The US nuclear arsenal, the bedrock of recent global peace and security, will become essentially obsolete.
The corporate labs are operating like startups, with almost no regard for security. They need to transition to a strong security mindset sooner rather than later. Some of the key insights for building AGI and ASI are likely being developed right now. And those insights are not being safeguarded.
Obviously (within this mindset) open-sourcing anything would be irresponsible. Everything must be kept secret.
Western democracies are on the cusp of making a serious error, wherein they cede control of AI (and thus AGI and thus ASI and thus the future of the species) to an authoritarian regime.
We are very soon going to see major geopolitics (including espionage, assassinations, combat, bombing datacenters, etc.) focused on AI; as soon as more leaders “wake up” to what’s going on.
So, the US will aggressively pursue but lock-down AI research. It is a strategic asset. The US will invest in an enormous (multi-trillion $) Manhattan-style project to develop AGI first.
This will involve building massive compute clusters on US soil, investing in the research enterprise, locking it down using nuclear-weapons caliber security, and building tons of power plants (including bypassing clean energy laws if that’s what it takes to deliver the required power).
So, the near-future will be a contentious time period, with greater hostilities between countries and a greater threat to democracy.

His opinions are mostly predictions, but he is also prescriptive in the sense that he believes the West (and the US in particular) need to win this. I don’t agree with all his claims, but many of his points are hard to argue against. He is indeed correct that most of the general discussion on AI (across many ‘sides’) is missing some key points.

Posted in AI | Tagged AI, LLM, safety, scaling | Leave a comment

AI News 2024-06-06

Posted on 2024-06-06 by KevinYager

Research Insights

Guiding a Diffusion Model with a Bad Version of Itself combines an image model with an intentionally worse version of itself, and shows how this combination can be used for image synthesis that better balances coherence vs. diversity. (Despite neural methods being largely “block boxes”, results like this show that we do actually understand enough about internals to make meaningful interventions.)
LLMs are notoriously bad at math. A new preprint investigates fixing that: Transformers Can Do Arithmetic with the Right Embeddings.
- The model can do 100-digit addition (99% accuracy) after being trained on 20-digit numbers. Capabilities also adapted to multiplication. The trick is to enforce an embedding that explicitly captures the position of digits within a number. So numerical representations are first-class during tokenization (conceptually similar to the Polymathic xVal number encoding).
- Of course LLMs can just call external functions to make sure math gets done correctly. But I like the idea of crafting the LLMs themselves to correctly embody basic concepts from math and logic, as it might generalize to improved performance on a range of other planning/deliberation activities.

Audio

Machine translation has been scaled to 200 languages. The impressive part is that many of these languages have very little training data. The point is that the model can learn language structure from the well-represented languages, and generalize to the languages will less training data.

Avatars

AI audio/video avatars are advancing rapidly. (So this is your periodic reminder to be increasingly skeptical of videos you see, and of random phone calls from loved ones asking you for money.)

Synthesia EXPRESS-1 avatars show emotions that match the text.
HeyGen has also demonstrated that they can apply their AI avatar trick (resync lip motions in an existing video to match a new script) to videos where the person is in motion. One of the main use-cases is converting videos to other languages; so this broadens the range of content that can be targeted. Of course one can also use it to nefariously change what someone said in an otherwise very-non-AI-looking video.
V-Express improves further on virtual avatars (generates video aligned with an audio track, based on a single photo).
ChatTTS is a text-to-speech system that is remarkably good, including being able to add natural-sounding pauses, laughs, etc. Open source, so you can run it all locally if you want.

Posted in AI, News | Tagged AI, audio, avatars, research | Leave a comment

How to break apart Python pathlib Paths?

Posted on 2023-11-22 by KevinYager

Python pathlib is the modern way to handle file paths. But I always forget how to break apart a path into components (directory part, filename part, etc.). This image is a cheat-sheet for working with Path, breaking it apart into root, directory path, filename, suffix, etc.

Posted in Helpguide | Leave a comment

How to convert dates/times in Python?

Posted on 2023-04-19 by KevinYager

Working with dates and times in Python often involves converting between the various possible representations. Here is a graphic to quickly lookup how to convert between the different formats (epoch, struct_time, Python datetime object, string representation, and matplotlib date convention).

Posted in Helpguide | Leave a comment