Kevin G. Yager | Academic Summary

Can we Distinguish Human from AI?

Posted on 2024-08-16 by KevinYager

Let’s pull together some information (as of 2024-08-16):

Modern LLMs can generate highly coherent text, and in some sense have quietly surpassed the famous Turing Test. This has been evaluated, with GPT-4 caliber systems broadly passing the test.
- 2023 May: Human or Not? A Gamified Approach to the Turing Test
- 2023 Oct: Does GPT-4 pass the Turing test?
- 2024 May: People cannot distinguish GPT-4 from a human in a Turing test
In March 2023, there was brief online debate about whether these videos feature a human or an AI avatar: video 1, video 2.
- Is the answer obvious to you? There are details that make it look fake (e.g. fuzziness between hair/background, unphysical hair motion, blurriness around nose-ring). Conversely other aspects (hands, mannerisms) seem too good to be AI-generated. And one must be on the lookout for an intentional fake (human acting/voicing strangely on purpose, intentionally adding visual artifacts, etc.).
- The answer, it seems is that this is a deepfake (made using Arcads) wherein the user provides a video, and then the voice is replaced and mouth movements synced to the new audio. So it is normal human-actor video, with AI audio and lip-synch. Not AI-generated from scratch.
- Of course, the deepfake implications are obvious, since there is plenty of video of notable people to draw from. E.g. here’s an Obama deepfake made using Argil.
In August 2024, this image (and corresponding video) were presented as an example of genAI that a casual observer would initially assume to be real.
In August 2024, the 𝕏 account 🍓🍓🍓 (@iruletheworldmo) began spreading rumors about upcoming OpenAI releases (related to Q*/Project-Strawberry, GPT-5, forthcoming AGI, UBI, etc.). It grew a large following (30k followers in two weeks), despite only one of its many outlandish predictions being validated. (The account mentioned SWE-Bench Verified three days before the official announcement.)
- This sparked rumors that this account was actually an AI (e.g. OpenAI test of agentic system, or a marketing firm demonstrating engineered hype-based follower growth) or even a test of a next-generation model (e.g. GPT-5).
- Although the evidence for these claims is weak, the fact that it is not easy to rule out is also telling.
On the evening of 2024-08-15, there was an 𝕏 spaces meetup wherein various users voice-chatted with Lily Ashwood (@lilyofashwood). The discussion centered on figuring out whether Lily was human or AI (clip, full recording). Her responses seemed at times to draw upon remarkably encyclopedic knowledge, her voice was measured and slightly stilted, and her interactions were occasionally strange. These all point to her being a language/voice model. But at other times, her jokes or creative responses were surprisingly human-like. Was this truly an AI-model, or a human mimicking TTS speaking style (and using an LLM to come up with AI-like responses)? The discussion space was surprisingly split in opinion.
New paper: Personhood credentials: Artificial intelligence and the value of privacy-preserving tools to distinguish who is real online.
- It is becoming increasingly difficult to distinguish human from synthetic. Captcha tests are now often solvable by automated systems. And genAI photo/voice/video is now sufficiently convincing that it will be taken as genuine at first glance.
- They propose personhood credentials, that could be generated by a trusted authority (e.g. government) using cryptography. This would allow a person to demonstrate they are a particular person, without revealing exactly who they are, in various online interactions.

Overall, the ability to distinguish human from AI in an online setting is becoming challenging; especially in cases where a human can intervene where necessary to maintain the ruse.

Update 2024-09-01

A video from the Beijing World Robotics Congress, showing a variety of cyberpunk female-coded robots moving with surprising smoothness spurred surprise.
- It turns out that the robots in the background are animatronics, the robots in the foreground are human actors made up to look like robots. Behind-the-scenes videos show applying makeup and posing for the camera. The act is partially sold by the actors moving in slightly inhuman ways.
1X unveiled a video of their new Neo robot. The fabric cladding and remarkably smooth movement led many people to say that this looks like a person in a costume (not unlike the first teaser for the Tesla Optimus robot).
- In this case, it appears to be a genuine robot. One can see more examples of this robot’s motion in other videos: doing chores, making coffee, walking slowly, technical discussion.
In the past, some websites (e.g. Reddit) have used photo verifications (e.g. holding handwritten note) to confirm someone is human. But now, AI-generated photo and even video verifications are quite good: example 1, example 2.
AI avatars, which map a human performance onto a synthetic character, are also rapidly improving: example 1, example 2. While these currently require some effort, the automated and real-time versions are also rapidly improving. (E.g. Deep-Live-Cam, example.)

Posted in AI, News, Philosophy | Tagged human factors, Turing test | Leave a comment

AI News 2024-08-15

Posted on 2024-08-15 by KevinYager

Research Insights

An empirical investigation of the impact of ChatGPT on creativity. They find that people using ChatGPT as an aid generate more creative outputs, though these are mostly incremental ideas. The results are roughly consistent with an earlier study that using genAI makes individual users more creative, but also reduces the overall diversity of ideas from the group of users.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers. They describe rStar (code), self-play mutual reasoning approach. A small model adds to Monte Carlo Tree Search using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
- The body of work describing inference-time search strategies continues to grow. They all show improvements of various sorts. It remains unclear whether there is one strategy that substantially out-performs.

LLMs

Qwen released Qwen2-math, 1.5B, 7B, 72B (huggingface, github). Top performance on math tasks.
Anthropic is experimenting with adding inline actions to Artifacts. For instance, you can select code and pick “Improve” or “Explain”.
Anthropic released prompt caching, which can greatly reduce inference costs.
Researchers released LLMs tuned for healthcare.
xAI released a beta of Grok-2. They have also achieved roughly “GPT-4” caliber performance, with benchmarks similar to GPT-4o-mini, Claude 3.5 Sonnet, or Gemini 1.5-Pro. The system has real-time access to 𝕏 posts; there are mixed reactions about whether this is useful or not.
- Grok 2 currently uses Flux for image generation. The implementation is less restricted than other major image synthesis providers.
OpenAI making incremental progress:
- Finally released the GPT-4o system card, which describes some aspects of training and safety.
- Quietly pushed out an updated to GPT-4o. People do indeed report that it feels slightly smarter.
- Released a new-and-improved SWE-bench Verified, to enable better evaluation of AI ability to solve real-world software issues.

AI Agents

Cosine AI put out a report on their Genie system for software generation. They claim record-setting performance on SWE-bench.
Salesforce describe a software engineering (SWE) approach that is a meta-system that manages existing SWE-agents or frameworks (preprint). It can extract better overall performance by combining a diversity of different AI agents.
Sakana AI report the development of an AI scientist, and released code and a preprint: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (explainer video). The description is quite ambitious. They describe a succession of LLMs that conduct all parts of a research workflow, including generating the final publication (example).
MultiOn AI describe Agent Q (paper), AI agents with planning and self-correcting capabilities.
Stanford describe: STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. An open-source agent that can write articles. Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models.

Safety

Better Alignment with Instruction Back-and-Forth Translation. They create synthetic training data from existing (web) data, by generating viable prompts and responses in a consistent manner.
The US is currently broadly supportive of open-source AI efforts (NTIA report).

Image

It’s no surprise that the recently-released open-source FLUX.1 image model (c.f.) is being hosted in a wide variety of places: Deforum discord, Fal.ai, Replicate, EverArt AI, HuggingFace, Abacus ChatLLM, model download. Many of these are free (for now).
Anifusion is a tool for creating comic/manga pages.
Google released a paper on Imagen 3. Generations are quite good, but not better than the current Flux or Midjourney capabilities.

Video

Generative video (text-to-video and image-to-video) has advanced rapidly over the last couple of years. It’s interesting to look back over the evolution of capabilities.
- Nov 2016: Sync-Draw
- April 2021: GODIVA
- Oct 2022: Meta Make-a-video
- Oct 2022: Google Imagen video
- April 2023: Will Smith eating spaghetti
- April 2023: Runway Gen 2
- April 2023: Nvidia latents
- December 2023: Fei-Fei Li
- January 2024: Google VideoPoet
- January 2024: Google Lumiere
- February 2024: OpenAI Sora
- April 2024: Vidu
- May 2024: Veo
- May 2024: Kling
- June 2024: Luma DreamMachine
- June 2024: RunwayML Gen-3 Alpha
- July 2024: Toys-R-Us Commercial made using Sora
- July 2024: Motorola commercial made using genAI
- July 2024: haiper.ai
- August 2024: Hotshot (examples)
- August 2024: Examples of state-of-the-art works made using genAI video:
  - Runway Gen3 music video
  - Runway Gen3 for adding FX to live action (another example)
  - Midjourney + Runway Gen3: Hey It’s Snowing
  - Flux/LoRA image + Runway Gen3 woman presenter
  - McDonald’s AI commercial
  - Sora used by Izanami AI Art to create dreamlike video and by Alexia Adana to create sci-fi film concept

World Synthesis

FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework. Uses NeRF methods to reconstruct images of plants and identify/count the fruit (video). This seems quite useful by itself, but also points more broadly to the power of 3D reconstruction improving a host of visual real-world tasks. (Here’s a video showing a similar result from a different group.)
Nvidia has presented some demos of how raytracing can be combined with Gaussian splats (preprint): shadows, depth of field, refraction, etc.
High dynamic range (HDR) Gaussian splatting has also been demonstrated.
Nvidia demoed real-time world-building (text-to-object, etc.).

Hardware

Based Hardware is trying to make open-source AI wearables including glasses (OpenGlass) and the Friend AI pendant (not to be confused with the Friend AI pendant, c.f.).
Google event announced the Pixel 9 Pro smartphones.
- The phones are incrementally improved. Includes tensor G4 chip, to enable more on-device AI features.
- Gemini will become even more deeply-integrated into Android.
- Gemini Live will allow multi-modal conversations (back-and-forth conversations, AI can use camera for added context, etc.).
- The new Pixel Buds are designed to be an interface to Gemini Live.

Robots

Clone Robotics released a video showing teleoperation of their sophisticated hand. The mechanics of the Clone hand are remarkable, but the fidelity in the teleoperation appears quite low.
Google DeepMind demos a robot that can play table tennis at a solidly amateur level. The robot is engaging in a physical activity at human level performance (without simply resorting to super-human hardware solutions); i.e. the control system is capable for simple (hit ball) and complex (plan where to hit in order to win) actions.
New videos show off improvements in LimX CL-1 humanoid: doing simple tasks, shuffling sideways, walking up stairs (more confidently than before).
Presentation on Boston Dynamics Atlas (including new electric version). Seems agile; e.g. doing pushups.
Apptronik claims that with minimal training (10 hours), their Apollo robot could autonomously handle soft/deformable objects. They are also projecting that some of the initial demand for humanoid robots will be at-home assistants for the elderly.

Posted in AI, News | Tagged agents, hardware, image synthesis, LLM, research, robots, safety, video, world synthesis | Leave a comment

AI News 2024-08-08

Posted on 2024-08-08 by KevinYager

Research Insights

Paper from 2023: Self-Compressing Neural Networks. Puts the model size (in bytes) as parameter in training, so that it optimizes for a small NN (using quantization). Clever way to make models very small (example implementation, using tinygrad).
Grokfast: Accelerated Grokking by Amplifying Slow Gradients. Novel approach is, instead of trying to improve model size/capacity, they modify the optimizer be biased against memorization and toward understanding.
- Grokking is the observation that during training, a model might first over-fit (effectively memorizing behavior), but thereafter (after much, much more training) slip into a more generalized and robust modeling/behavior. This thus represents a shift towards true understanding.
- Obviously an overall goal is to emphasize grokking in models and avoid rote memorization.
- This work analyzes the gradients during model optimization, decomposing them into fast gradients (which represent over-fitting) and a set of slower updates (that have to do with grokking). One can thus emphasize grokking (making it occur 50× sooner).
- However, there are concerns that the observed behavior could be an artifact of the setup.
The context length is a critical parameter for an LLM, and larger context lengths are being demonstrated (unlocking new capabilities). However, larger context lengths often lead to progressively worse performance, where models fail to identify the right information in needle-in-haystack problems. Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation. Analyzes in detail, and shows how very long contexts can overwhelm attentional mechanisms, leading to (e.g.) forgetting that something had already been said/enumerated.
Why Does New Knowledge Create Messy Ripple Effects in LLMs? Considers how adding new knowledge (editing a fact) can properly or improperly propagate to related bits of knowledge (ripples).
System-1.x: Learning to Balance Fast and Slow Planning with Language Models. A common hope for future AI is to combine the strong reflexive/intuitive response of LLMs (equivalent to system 1 in humans) with some form of iteration/deliberation/search (system 2). System 1.x Planner is a framework that allows flexibility between approaches. Tasks are broken into plans, with each step being evaluated as easy (use system 1 methods) or complex (using system 2). The blending between the two is user-controllable. Show improvement on toy problems.
Anthropic posted an update from their interpretability team: Circuits Updates.
Diffusion Models as Data Mining Tools. Training a diffusion model for images is typically done for image synthesis (generate novel images). But the training of course learns many meaningful aspects of the data. So, in principle, one could use this training as a way to understand datasets. They show how the model can pull out representative image elements for a particular sub-domain, or to localize abnormalities in images (useful for medical images, for instance).
Google publishes: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. This adds to recent work (c.f.) about tradeoffs in training vs. inference compute. Google shows that there are scaling laws for inference-time compute.
Similarly this was just released: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. They show that a smaller model combined with search is Pareto-optimal (similar to this result).
Google DeepMind publishes: Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning (project page). They combine language-vision models with diffusion models to generate visual data. This allows agents to learn in simulated physical environments.

LLMs

PyTorch released torchchat, which makes it easy to install and run LLMs locally.
sqlite-vec is an extension to the popular SQLite, that enables vector database retrieval that is local and very fast.
With the cost of LLM inference dropping rapidly (Llama 3 8B, 4o-mini, Gemma 2 2B, etc.; hardware acceleration via Cerebras, Graphcore, Groq, etc.), it is increasingly attractive to brute-force problems through iteratively calling the LLM (many-shot, etc.). Greenblatt claimed good performance on ARC-AGI by brute-force writing/testing programs. Hassid et al. showed tradeoffs between model size and iteration (with repeatedly calling smaller models often better). Brown et al. showed scaling of sampling inference (c.f.). This post claims a simple method: give the LLM a problem, and just repeatedly ask it to improve code (“fix bugs, add features, …”). (Final app, iteration code, even better result using Claude 3.5 Sonnet.) Even without any feedback (from human or code execution), the code becomes better over time. This approach is overall “inefficient” in the sense that more optimal workflows no doubt exist. But with LLM inference quite cheap, generate decent solution in this manner seems viable.
Aidan McLau tries to address the disconnect between existing benchmarks (or the preference-ratings of lmsys arena) and the vaguer sense that some models are notably better at creative or reasoning tasks. Aiden-Bench asks a given LLM some questions repeatedly, evaluating whether they can continue generating novel (but coherent) answers. Notably, these scores are quite different than conventional (lmsys) scores. Mistral Large 2 wins, GPT-4 performs better than GPT-4o, but 4o-mini does well considering its size.
LangChain announced LangGraph Studio, an IDE for designing agent workflows.
OpenAI introduces structured outputs to their API, so that one can force outputs to follow a strict JSON schema.
- A recent paper notes that enforcing format restrictions on an LLM reduces quality: Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. This is perhaps not surprising, since you are constraining token output to a lower-probability branch (otherwise you wouldn’t need the constraint), which will thus not be the optimal/trained output. Nevertheless, this might still be the strongest possible answer within the constraints of the schema. Conversely, one can use a chain-of-thought solution where the model generates its best free-form answer, and then reformulates it into the rigid schema.
- Open-source code to implement structured LLM outputs.
- The new schema-compatible model gpt-4o-2024-08-06 also has slightly higher performance and is half the cost for inference.
There are a few results showing that LLMs can predict the outcome of social science experiments: model human, virtual worlds, social predictors, predict surveys/experiments (demo). This is expected in the sense that the LLM is model fit to aggregate human outputs; but also neat in the sense that one can ask new questions and get decent predictions. Of course one should still conduct new experiments to fill in novel parts of the space.
Research brief: The Adoption of ChatGPT. Usage is quite high (especially among jobs that are most impacted by AI replacement). There is a surprisingly large gender gap (male usage 20% higher than female).

Voice

Dialog is central to human communication (average human speaking time in conversation is only 2 seconds, c.f.). Older chatbots would explicitly transcribe voice and feed it to an LLM, and convert the respond to audio using TTS. This is slow and loses the nuance of language. More modern chatbots directly tokenize the audio stream (moshi, rtvi-ai, 4o). A new paper takes this even further: Language Model Can Listen While Speaking. This goes beyond turn-based dialog, allowing the model to speak and listen simultaneously, so that conversation can overlap naturally.

Safety

Dan Hendrycks et al. publish preprint: Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? They find that ~50% of benchmarks do not meaningfully measure safety progress.
Tamper-Resistant Safeguards for Open-Weight LLMs (project, code). They train a network in a way that certain behaviors/outputs are suppressed, and cannot be recovered via fine-tuning. They do this through adversarial training against tampering attacks, where a tampering-resistant loss is added. If this pans out, it is good news for continued release of open-weight models.
OpenAI plans partnering with the US AI Safety Institute. This would give the US government early previews of models, so that they can contribute to safety testing.

Image Synthesis

Black Forest Labs (which includes many of the original Stable Diffusion folks) unveiled Flux, a state-of-the-art open-source image model. Outputs seem quite good. They are planning to work on text-to-video. Flux.1 is available on fal.ai and replicate.

Vision

Last week, Segment Anything Model v2 was released; and the idea was floated to use this for medical imaging. Now some folks have released a preprint: Medical SAM 2: Segment medical images as video via Segment Anything Model 2 (code). SAM2 can be used for one-click segmentation of 3D elements in medical scans. Can also be used for one-shot identification of target features in images. Impressive.

Video

As AI video systems improve, a possible near-term use-case is to add visual effects to otherwise conventional live-action video (example).

MeshAnything V2 improves upon the automatic conversion of point-clouds into traditional meshes (preprint, demo).

Science

Google published: Neural general circulation models for weather and climate. This neural climate model gives high prediction accuracy for short-term weather, and also for medium or long term climate.
Diffusion models for image synthesis work by training a system to remove noise from corrupted images. This paper applies this logic to chemical structures; training a diffusion model to simulate molecular relaxation as a ‘denoising’ of distorted molecular structures. Efficient way to compute molecular structures.

Hardware

Publication in npj unconventional computing: Experimental demonstration of magnetic tunnel junction-based computational random-access memory. Computing elements that combine memory and logic in a single unit would save a lot of energy/time wasted in conventional von Neumann architectures that shuttle data from system memory to compute elements. This paper provides an experimental demonstration of computational memory elements (CRAM). Interesting possible direction for future neural accelerators.
Nvidia’s Blackwell GPUs will be delayed by at least 3 months as they address a design flaw. Enormous orders have been put in (by Google, Microsoft, etc.), so all those AI efforts will correspondingly be delayed.

Robots

Neura released a video of their 4NE-1 humanoid robot.
UBTECH reports that their Walker S Lite worked in a real factory for 21 days as a demo.
Figure released a video for their new Figure 02 humanoid robot. More capable than previous version. Has onboard compute for inference (including doing tasks and voice-to-voice interaction with human operator). It is not yet available for purchase, but is being used in a test mode in a BMW plant. Another step towards commercial humanoid robots.

Posted in AI, News | Tagged hardware, image synthesis, LLM, research, robots, safety, Science, speech, voice | Leave a comment

AI News 2024-08-01

Posted on 2024-08-01 by KevinYager

Research Insights

Several results relevant to recursive self-improvement:

LLMs are trained on human text, of which there is a finite amount. Some estimates put the ‘data wall’ (running out of larger training datasets on which to train ever-larger models) in 2027-2028. A possible solution is to use AI to generate synthetic training data. Is that a good idea?
New paper: AI models collapse when trained on recursively generated data. Adds to the body of work that shows that repeatedly training new models on synthetic data generated by previous models reduces diversity and eventually causes the model to collapse into garbage.
- Also studied previously for images (stability, MAD) and text (models forget, knowledge collapse).
- However, these results have been criticized as being unrealistic of how synthetic data training occurs in practice. Prior studies have tended to progressively replace all original data with synthetic. In practice, synthetic data is used to augment the original training set. Thus data accumulation, focused generation, and reinforcement can avoid model collapse.
LLM training on synthetic data is not just theoretical. The recently-release Llama 3.1 used a variety of synthetic data augmentation methods.
LLMs are notoriously bad at math. There are many approaches to fix this, including giving the LLM access to tools (calculator, Python) or using special encodings for numbers. However, with the right training scheme even a GPT-2 class model can learn to multiply numbers.
- Preprint: From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step. They start with a model that does explicit chain-of-thought to come to the right answer, and then progressively remove intermediate steps so that the underlying logic becomes internalized in the model. They show it works for 20 digit numbers (demo).
- Distillation (e.g. training a small model on the output of a bigger one) broadly also shows that complex thoughts can be compactly internalized. This bodes well for model self-play, where it searches problem-spaces in a computationally expensive manner, but progressively internalizes the corresponding complexity.
Preprint: Recursive Introspection: Teaching Language Model Agents How to Self-Improve. The LLM detects and corrects its own mistakes, which is used to iteratively fine-tune the model in an RL scheme.
Preprint: Self-Discover: Large Language Models Self-Compose Reasoning Structures. LLM selects and combines reasoning structures to solve tasks.
Preprint: Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge. LLMs can self-improve by generating outputs and judging the quality of their own outputs. However, improvements tend to saturate. This new work uses a meta approach where the LLM also judges its judgements, in order to progressively improve its own judging. This expands the range of possible improvement; while still being fully unsupervised.

Some new work investigates LLM reasoning and compute tradeoffs:

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation. Larger models are better. But for a tested coding task, a smaller budget given more execution time could outperform a larger model given the same compute budget. This is surprising in the sense that a sufficiently small model will presumably under-perform, no matter how much compute it is given. But across a range of meaningful tasks, smaller models can yield more compute-efficient results.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Another approach that involves inference-time compute (“search”) to improve smarts. It also shows that repeated calls to simple models can out-perform a larger model. The method is strong successful where a verifier is available; much harder when those are lacking.
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process.
- Models learn reasoning skills (they are not merely memorizing solution templates). They can mentally generate simple short plans (like humans).
- When presented facts, models develop internal understanding of what parameters (recursively) depend on each other. This occurs even before an explicit question is asked (i.e. before the task is defined). This appears to be different from human reasoning.
- Model depth matters for reasoning. This cannot be mitigated by chain-of-thought prompting (which allow models to develop and then execute plans) since even a single CoT step may require deep, multi-step reasoning/planning.

There are conflicting messages here. You can trade-off between model complexity/power and repeated calls to that model. Is it better to use a large/smart model, or better to repeatedly call a small model? For some problems, iterating or searching using a small model is better. But there are cases where individual steps are sufficiently complex that they require properly parallel/coherent attention among disparate elements. So in that case you need the power of the large model. This still points in a familiar direction: models should be made more powerful (so that system-1 intuition becomes as refined as possible), but should also be wrapped in a way that lets them iterate on tasks (so that more explicit system-2 iteration/search can occur).

Other research results:

Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models. They identify that non-truth-grounded hallucinations may arise from imbalance in training data, such that the LLM over-generalizes. So they can detect and mitigate hallucinations.
- Other works have shown that hallucinations are in some sense beneficial, for creativity and even reasoning (for narrative and communication). So it would be interesting to know whether this mitigation decreases creativity.

Safety

NIST released draft AI safety guidelines: Managing Misuse Risk for Dual-Use Foundation Models. Now open for public comments.

Policy

OpenAI posted their interpretation of the EU AI Act.
Microsoft releases report: Generative AI in Real-World Workplaces. There is growing evidence of genAI improving productivity, though the gains realized so far are generally modest and vary substantially by job type and work context.
Manhattan Institute posts A Playbook for AI Policy. (Contains similar arguments to Leopold Aschenbrenner‘s Situational Awareness, c.f.) Argues that the US must maintain strategic lead in AI.

LLMs

Anthropic is reportedly working on a folder-sync feature, to streamline interacting with the LLM on projects.
OpenAI is having some users alpha-test GPT-4o with a long output (64k tokens).
Topology AI opened a demo of their chatbot. The novelty of their offering is a Continuous Learning Model (CLM), which they claim “remembers interactions, learns skills autonomously, and thinks in its free time”. The documents describe this as being distinct from fine-tuning or document retrieval, and note using a combination of open-source and proprietary LLMs. It sounds vaguely like RAG on past conversations, but inserted more directly into the model than simply copy-pasting into the context window. Conceptually, model memory is definitely something future chatbots need.
Google a Gemma 2 2B model. In the small-model regime, it seems to be doing quite well. It is small enough that it can run on smartphones, and is open-weights.
Google made an experimental version (0801) of Gemini 1.5 Pro available (Google AI Studio). There are no details about what makes this model different. The LMSYS leaderboard currently puts it at the #1 spot overall. Some are disputing the rank and worrying that the benchmarks are not correctly capturing reasoning power. Nevertheless, seems like an impressive achievement from Google.
SambaNova has a demo of running LLMs extremely fast (using their custom hardware).
Some folks formulate the baba-is-ai benchmark (preprint), where the AI must play the Baba Is You puzzle video game, which involves manipulating your character, the environment, and the game rules themselves. AIs currently fail horribly at this task, which makes it a good test-bed for improved reasoning.

Image Synthesis

Midjourney released v6.1 which features improved image quality and coherence.

Vision

Meta released Segment Anything Model 2 (SAM2); a follow-up to their very successful SAM. SAM2 can segment/isolate objects in images, and in video data (with excellent temporal consistency). Apparently fast enough for real-time and interactive use (demo). Can handle very complex multi-object scenes. Interestingly, it can even track objects/people across cuts and scene changes. Applications in video editing software (compositing, etc.) are obvious. But it should also be relevant for robotic vision or other automated video analysis. A quick test shows that it could also do segmentation in 3D medical imaging contexts.
- Combining with T-Rex2 allows tracking multiple objects with a single prompt.
TAPTR (Tracking Any Point with TRansformer) does robust point-tracking in video data (examples, point demo, area demo).

Video

Clapper is an open-source in-browser video editor that uses AI.
HumanVid allows video generation with control of camera and character-pose.
Runway ML Gen-3alpha text-to-video system:
- Added image-to-video
- Announced a turbo version; 7× faster (and cheaper)
Generative video and music keep improving. Here’s a nice example of what can now be accomplished.

World Synthesis

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions. We are getting closer to the goal of generative virtual worlds.
Nvidia introduces fVDB, a deep learning framework to facilitate large/complex world representations (for autonomous driving, climate, etc.).
GaussMR adds interactive particles and fluids to Gaussian splat scenes.

Hardware

There is continued interest in making an AI Companion hardware device of sorts. The Humane AI Pin ($700) and Rabbit R1 ($200) did not receive strong reviews; mostly since they over-promised and under-delivered with respect to the AI capabilities. A new wave of options appear to be making more modest claims:
- The Limitless clip-on ($100) can record meetings, conversations, or spoken-aloud thoughts. It can then do straightforward transcription and AI summarization.
- Compass necklace ($100) similarly records and transcribes.
- Crush ($130) is a simple pushbutton voice recorder with AI summaries.
- Friend ($100) is a necklace that listens to your life, and the AI companion periodically messages you thoughts. You can also press a button to explicitly talk to it. This seems to be targeting wellness and fun (not utility). The advertising video left many wondering if this is satire. While there will undoubtedly be downsides and social changes associated with AI companions, one recent study shows that AIs can indeed alleviate loneliness.
- Confusingly, there is another AI-companion-via-pendant also called Friend (wearable necklace, $70); more focused on utility (transcription, summarization). The two Friend startups are not friendly with one another.

Robots

Nvidia updated their Project GR00T robotic effort. They take video of humans performing tasks, and do data augmentation in a simulation environment (RoboCasa) with generative actions (MimicGen).
Unitree robot-dog Go2 just got upgraded with wheels. This affords it the flexibility of walking over rough terrain but driving in flat areas. The previous Go2 was priced at $1,600.

Posted in AI, News | Tagged hardware, image synthesis, LLM, model collapse, policy, research, robots, safety, synthetic data, video, vision, world synthesis | Leave a comment

Consistency Problems with Simulation Hypothesis

Posted on 2024-08-01 by KevinYager

The simulation hypothesis is quite simple: we could be living in a simulation. Nick Bostrom makes this more concrete. Since simulating realities appears to be physically possible, either intelligent species (like future humans) choose not to simulate worlds (for some reason), or else the number of simulated worlds (and thus simulated minds) is probably quite large. In that case, most entities with experiences like ours are actually in a simulation. So, statistically, we are in a simulation.

The argument is simple enough to be persuasive. Of course, there are many counter-arguments. Here, I just want to consider some consistency aspects.

Physics

In order to bolster the simulation hypothesis, people sometimes point to aspects of our reality’s physics. For instance, they note that if space and time are discretized at the quantum level (as they might be under quantum gravity or even QFT), then this sounds a lot like the partitioning one would need to make reality computationally tractable. Others note that the speed-of-light partitions the universe into causal regions, which is convenient if you’re trying to simulate it across a set of servers running in parallel. Others sometimes point to the weirdness of physics (quantum collapse of wavefunction, etc.) as evidence of the limits of the simulation. These arguments are often made semi-humorously; but they point to something real.

There is a problem with this kind of thinking. Specifically, the arguments implicitly assume that the base reality used to simulate our world has similar physical laws. But this is strange. There are two possible cases:

(1) Either the base reality has different physics from our, in which case we can’t really infer anything about their reality (or ours) from how our reality is being computed. For instance, the base reality could be computationally parallel such that they don’t actually need to impose information-propagation constraints to make our reality tractable.

(2) Or the base reality has similar physical laws as us (including separated causal regions, etc.). This means their attempts to computationally simulate our reality are limited in expected ways. However, this also means that you can’t point to those same features as evidence that you’re in a simulation. After all, the people in the base reality could make the same argument (but they would be wrong).

Overall, looking at our physics doesn’t seem to tell us anything about whether we are in a simulation. Or rather, it may well provide clues; but it does not allow one to construct a self-consistent argument and thus it doesn’t provide any believable evidence one way or the other.

Boltzmann Brains

Let us take a detour into Boltzmann brains. The idea is that if you have a bunch of hot matter churning randomly (gas at equilibrium, or whatever), then it could eventually (by random chance) happen to coalesce to form a brain just like yours, including all your memories and current thoughts. Of course this is low-probability and it would dissipate near-instantly, but for a brief moment that brain would think itself real and moreover would have evidence (based on false/random memories) that it lived in a proper universe with a meaningful personal history.

However, this creates a problem. If the universe has more “random equilibrium gasses” in it than “real proper brains”, then you, as an observer, should actually statistically conclude that you are probably a Boltzmann brain and not a real person. If we look at our actual universe, we find that our current time period (of non-equilibrium stars emitting energy, creation of complexity, conscious observers) is quite fleeting. The universe is predicted to expand without bound and enter a heat death that is effectively an infinite-long equilibrium. This means that the amount of space/time for Boltzmann brains is much larger than the space/time for real brains. So you’re not real.

There might be reasons why Boltzmann brains don’t arise in quantum universes. But even if they could arise, there is an additional philosophical problem: cognitive instability. Suppose you find the arguments convincing, and you thus accept you are a Boltzmann brain. Well, upon accepting that, you should doubt your own memories are authentic. Your understanding of physics (and even math/logic) cannot be relied upon now that you believe your memories are just random generations. There is no reason to assume your logic is correct. So you’ve undermined your own belief in being a Boltzmann brain. Since there is no way to consistently believe to be a Boltzmann brain, you must instead (as unsatisfying as it may be) assume that you are one of the special cases of being a real brain.

Simulated Consistency

A similar instability argument can be levied against the simulation arguments. If one believes there is convincing evidence that we are in a simulation, then one must immediately begin to suspect the evidence, physics, our memories, etc. If we are being simulated, what reason do we have to be confident that our memories are trustworthy? What reason do we have to believe that the physics we infer is telling us something meaningful about “reality” (whether that means the hardware/software simulating us, or base reality, or whatever)? What reason do we have to believe that base reality is a collection of sentient entities choosing to run a simulation? Whatever arguments you bring to bear will rely upon evidence (memories) and logic (calculation of your simulated mind); whose very reliability we should now question.

So being-in-a-simulation is cognitively unstable. That doesn’t make it wrong. But it means that one cannot provide believable evidence for it.

Conclusion

The point is that we may well be in a simulation. But it is very difficult to provide believable evidence for the same. Whereas it is at least possible to construct a consistent understanding that we are real entities in a consistent physical universe.

Posted in Philosophy | Tagged simulation | 1 Comment

AI News 2024-07-25

Posted on 2024-07-25 by KevinYager

Research Insights

Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures.
Reliable Reasoning Beyond Natural Language. Another neurosymbolic approach. The LLM converts problems into code statements (Prolog), and exploits that for rigorous reasoning. This is similar to how humans attack complex problems (by mentally or externally converting them into symbols/code, and using that abstraction for reasoning).
Truth is Universal: Robust Detection of Lies in LLMs. They identify a 2D subspace that captures the semantics of truth/falsehood (universal across the LLMs they tested).

Capabilities

Google Deepmind demonstrates vastly improved AI reasoning on math problems. AlphaProof and an improved AlphaGeometry 2 combine a language model with the AlphaZero reinforcement learning algorithm (and leveraging Lean formal language). This system achieves silver-medal quality on math Olympiad problems. Combining LLM heuristics (as system 1 intuitions) with more rigorous iteration (as system 2 reasoning) seems like a viable path towards improved intelligence.
- It seems increasingly likely that AI will achieve gold-medal performance soon enough.
- OpenAI presented some similar work in 2022, and UC Berkeley just published a related result using Prolog. It is also known that tree search (e.g. MCTS) improves LLM math abilities (1, 2, 3, 4, 5). Overall this body of work points towards a viable way to improve LLM math performance. The hope is that this translates to improved general reasoning.
OpenAI announced SearchGPT, a web-searching prototype (not yet available to the public). Looks like it will be useful.

AI Agents

Google is open-sourcing project Oscar, a framework for AI agents.

LLM

Llama 3.1 405b is now released. 750GB on disk, requires 8-16 GPUs to run inference. 128k context length. Benchmarks show it competitive with state-of-the-art (OpenAI GPT-4o and Anthropic Claude 3.5 Sonnet).
- Zuckerberg published a companion essay: Open Source AI Is the Path Forward.
- Llama 3.1 also has smaller models distilled from the larger.
- Of course we are already seeing real-time voice chatbots that take advantage of the small/fast models: RTVI (demo, code) runs Llama on Groq for responsive voice chatting.
Mistral Large 2 released (download). 123B parameters, 128k context length. Appears roughly competitive with Llama 3.1, GPT-4o, etc.

Multi-modal Models

Apple publishes: SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models. It uses two streams: a low-frame rate branch with high spatial (image pixels) resolution, and a high-frame-rate but downsampled branch. Combining them provides balance between spatial and temporal information.

Audio

Suno AI has added instrumental and vocal stems, allowing users to separate the vocals and instrumentals from songs.
Udio released v1.5 with improved audio quality. Also added the ability to download stems.

Text-to-blueprint: Generating 3D House Wireframes with Semantics (c.f. text-to-CAD).

World Synthesis

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting, can segment objects in a dynamic 3D scene (code now available).
Shape of Motion: 4D Reconstruction from a Single Video.
DreamCatalyst can edit NeRF scenes (e.g. text prompts to convert people/characters).
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency. Synthesizes new vies of moving 3D objects, with good consistency.

Policy

Meta won’t release its multimodal Llama AI model in the EU.
Sam Altman opinion piece in Washington Post. Altman expresses urgency, and suggests the need for a US-led global coalition to develop AI safely. (Similar to Aschenbrenner’s Situational Awareness, c.f. summary.)
Survey: What Do People Think about Sentient AI? Broadly negative views towards AI, with opinions against the development of sentient AI. However, there is also much confusion about what we mean by sentience.

Hardware

xAI has just turned on their cluster. 100,000 Nvidia H100 GPUs, which is roughly 100 petaflops (FP16) of compute (hardware cost ~$3B). They claim this is the most powerful single cluster for AI applications. (Supposedly, OpenAI’s next cluster will have 100k GB200, which would be ~250 petaflops and cost ~$6.5B.)

Robots

Agility’s robots-as-a-service (c.f.) effective cost is being estimated at $30/hour for the humanoid robot work.
Automated construction of cognitive maps with visual predictive coding. An agent in a virtual environment was able to build spatial maps using predictive coding (i.e. to succeed in a next-image predictive task, it must build a reliable map of the environment). Although the demo operates in a spatial environment, the same idea could be applied to more abstract spaces.
R+X: Retrieval and Execution from Everyday Human Videos. The idea is for humans to record their POV while performing everyday tasks. This forms the dataset for robot actions. No video labeling is required.
- For a requested task, they retrieve instances of that task from video (using Gemini). This is used to condition an in-context policy to execute the task.
- This fits into the growing trend of training on human actions, and thereby justifying the humanoid robot form-factor (c.f.).

Posted in AI, News | Tagged 3D, agents, hardware, LLM, multimodal, policy, research, robots, world synthesis | Leave a comment

AI News 2024-07-18

Posted on 2024-07-18 by KevinYager

Research Insights

Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking. Chain-of-thought and self-critique approaches try to invoke multi-step LLM reasoning. But the process is still linear. Some newer approaches (tree-of-thought, mixture-of-agents, etc.) try to leverage parallel consideration to improve diversity. Flow-of-reasoning tries to create a tree of reasoning paths to improve diversity, using Markovian flow modeling.

LLMs struggle with math and logic. There are efforts to add-in or train on logic schemes (symbolic chain-of-thought, symbolic solver). New preprint: Teaching Transformers Causal Reasoning through Axiomatic Training, demonstrates training on causal axioms can work.
Human-like Episodic Memory for Infinite Context LLMs. It is obvious that current LLMs lack the long-term memory that humans leverage to address new problems. This work tries to cluster tokens into episodes that are efficiently stored and later retrieved.
AgentInstruct: Toward Generative Teaching with Agentic Flows. Framework generates synthetic data for training other models, that is higher-quality and more diverse than prior methods.
Transformer Layers as Painters, analyzes how LLMs operate. They intentionally skip layers, or swap layer execution order (strong similarities to Tegmark’s “Stages of Inference” paper, c.f.). They find the LLM degrades gracefully, which suggests that every layer matters (is performing a distinct computation) but also that subsequent middle layers are operating on a common representation. They find that math-heavy tasks are most sensitive (biggest degradation). They show that middle layers can even be applied in parallel instead of sequentially (optionally looping over this parallel block). This could suggest some alternative architectures with faster inference.

AI Agents

Decomposing Agency — capabilities without desires. Goes through different possible splits between the crucial components for a fully-featured agent (goals, awareness, planning, capabilities). An important point is that one can build different kinds of agents, with subsets of these components. E.g. the high-level motivating goals can come from the human, such that the AI agent has no goals of its own.

LLM

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models. They customize the encoding of a spreadsheet, allowing the LLM to reason on the semi-structured data organization of spreadsheets. Given how much data and workflow complexity can be captured by spreadsheets (combined with the availability of spreadsheet training data), this seems like a useful generic capability.
OpenAI published some guidelines on how to improve the accuracy of LLM output.
OpenAI posted: Prover-Verifier Games improve legibility of language model outputs. They use a strong LLM to generate answers/proofs in a way that a weaker model could verify them. This sacrifices a bit of performance, but increases legibility to humans.
OpenAI added a “mini” version of GPT-4o to its model lineup. It is meant to be the most capable and cost-effective model.

Multi-modal Models

Llava-NeXT-Interleave is a new group of vision-language models trained on image, video and 3D data (demo).

Chatbots

The Pantheon Interface is a new idea for how to interact with LLMs (live instance, code). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming.
- This could be an element of the human-computer interface for my proposed science exocortex (swarm of AI agents that help researchers).
- Loom is a somewhat related idea, where one have LLMs created branched writings.

Vision

Nvidia MambaVision models use a hybrid mamba-transformer. State-of-the-art in performance and throughput. Can be applied to classification, detection, segmentation, etc.

Images

This is a fun demo of using a physical interface to tune image synthesis model parameters, making it easier to explore the latent space.

Video

Realtime implementations of LivePortrait are appearing (gradio app, fal.ai).
A new text-to-video/image-to-video option: haiper.ai (examples).

World Synthesis

A weakness of Gaussian splats are that they bake-in lighting/environment. But there is work to improve this.
- 3D Gaussian Ray Tracing: Fast Tracing of Particle Scenes tackles one part of this.
- StyleSplat: 3D Object Style Transfer with Gaussian Splatting; takes the approach of style-transfer to enable editing of 3D splats.
- RRM: Relightable assets using Radiance guided Material extraction.
It is also hard to identify and modify objects in splat point-clouds.
- Let It Flow: Simultaneous Optimization of 3D Flow and Object Clustering; uses flow-based methods to cluster points.
- Click-Gaussian: Interactive Segmentation to Any 3D Gaussians; allows one to interactively pick objects, and then modify (move/rescale) them (video).
Animate3D: Animating Any 3D Model with Multi-view Video Diffusion (video).
Buildbox 4 has a platform for building games by iteratively prompting an AI.

Policy

The US Department of Energy announced a new initiative: Frontiers in Artificial Intelligence for Science, Security and Technology (FASST). There is a senate bill proposing to fund this effort at $2.4B/year.

Education

Andrej Karpathy has announced a new venture that will leverage AI to improve education. Eureka Labs will build AI teaching assistants to work alongside teachers in helping students understand complex topics. The company’s first concrete output is (naturally) a course focused on how to build an AI model (aimed at undergraduates).

Brain

Scaling Law in Neural Data: Non-Invasive Speech Decoding with 175 Hours of EEG Data. They synthesize speech using EEG data fed through a neural model. They show that performance improves continually as a function of dataset size (up to 175 hours; by comparison usually people only use ~10 hours of data). The lack of plateau in the scaling is good news in the bitter lesson sense: it suggests that there is plenty of available performance by simply scaling up known methods on more and more brain data.

Consciousness

A survey of the top 200 definitions of what consciousness might be: A landscape of consciousness: Toward a taxonomy of explanations and implications.

Hardware

NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules.

Robots

Disney published on the control approach for their cute bipedal robot: Design and Control of a Bipedal Robotic Character. Their reinforcement learning approach includes the usual (stability, locomotion) with artistic control aspects (including behaviors that humans find pleasing/amusing).
DextrAH-G: Pixels-to-Action Dexterous Arm-Hand Grasping with Geometric Fabrics. Using reinforcement learning in simulation to train object-grasping policies. The transfer learning from sim to reality works remarkably well (videos).
Robotic Control via Embodied Chain-of-Thought Reasoning. Chain-of-thought reasoning improves chatbot response. This approach has been applied to vision-language-action (VLA) models, improving task performance since it plans ahead (project page, videos).

Posted in AI, News | Tagged agents, brain, chatbots, consciousness, hardware, image synthesis, images, multimodal, research, robots, video, vision, world synthesis | Leave a comment

AI News 2024-07-11

Posted on 2024-07-11 by KevinYager

Research Insights

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities. Gives LLM the ability to think through an answer visually by writing code that outputs images, and then analyzing said image. Combined with iterative self-prompting, this should allow a model to reason visually. It of course makes sense that an LLM would have trouble with visual tasks, which humans typically solve by visually imagining the problem. Of course, one can also train multimodal (text+vision) models; but even in that case there is likely an advantage to models using internal scratch-space to work through problems before answering.
Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling. RLHF is used to elicit desired behavior from base models. However, this leads to a tradeoff, where the agentic RLHFed model is better at the selected tasks, but becomes worse at generic next-token prediction and thus worse at world modeling. So goal-directed behavior worsens overall understanding. An obvious solution is to build systems that mix models. E.g. an agentic RLHFed system that can call a powerful base model for predictions.
- My own suggestion is to build swarms of AI agents, each specialized in some way. It does seem like we should keep the untuned base model available as an agent or tool in the mix; supervised by other agents.
A set of nominally unrelated results all point in a similar direction:
- Mixture of A Million Experts. Google DeepMind shows that one can replace the feedforward layers in a transformer with a PEER layer (parameter efficient expert retrieval). The PEER layer draws from a large pool (over a million) of “tiny experts”. This outperforms feedforward, and also the usual coarse-grained mixture-of-experts (MoE) method.
- Memory³: Language Modeling with Explicit Memory. LLMs have different kinds of memory: contextual (current state captured by activation of key-value in transformer), implicit (baked into the network weights), and retrieval (if RAG systems pull in documents into context window). This work proposes to add another form of memory that is more robust/concrete than implicit (weights). During training, they learn a sparse attention key-values (highly compressed and efficient); during training, memories are retrieved and integrated into self-attention layers.
- Learning to (Learn at Test Time): RNNs with Expressive Hidden States (summary from one of the authors). This method introduces Test-Time-Training (TTT) layers into a recurrent neural network (RNN). So the hidden state (memory) of the RNN, instead of being a simple vector, is a small neural network. This internal NN is optimized via gradient descent to capture the required “current state” information as a long sequence of tokens is processed. This provides better expressive/memory power, while retaining the good scaling of RNNs for long sequences. The authors claim this yields much better scaling on long context-window problems than transformers or even Mamba (a structured state space model). TTT replaces the need for attention. Of course, transformers have many advantages; so it remains to be seen if TTT can match the capabilities of transformer systems. But it seems clever (and the general idea of having some NNs that learn to capture active state, inside of larger pretrained systems, could be useful).
- The common thread is increasing sophistication for the internal modules of a NN, with the internal weights being updated at runtime. This massively expands the expressive power of the system, without correspondingly increasing model size (since the larger range of possibilities is externalized). This seems like an attractive concept for improving LLMs.
Distilling System 2 into System 1, uses LLM to do (expensive) “system 2 reasoning” by askingfor chain-of-thought solutions. Then retrains the system on that text. Thus, improved system 2 reasoning becomes baked-in to the LLM’s fast/reflexive response. Clever, useful, and points towards recursive self-improvement of LLMs. (Similar to STaR.)
Associative Recurrent Memory Transformer. Tackles long-context windows by combining transformer self-attention for local context, with segment-level recurrence to capture distributed information. They show results for a 50M token context.

Safety

It’s not easy to imagine how humans will provide suitable oversight to LLMs as they become smarter and more broadly deployed. One strategy is to have LLMs debate with each other, allowing the human to judge which argument is best even in instances where they don’t fully understand the topic. (C.f. Debate Helps Supervise Unreliable Experts, Debating with More Persuasive LLMs Leads to More Truthful Answers.) New Google DeepMind contribution: On scalable oversight with weak LLMs judging strong LLMs.
Yoshua Bengio provides balanced arguments for why we should take AI safety seriously.
CIGI paper: Framework Convention on Global AI Challenges. Discusses both near-term challenges and long-term risks. This image summarizes risks:

Chatbots

GPT-4o and Kyutai Moshi (c.f.) show a shift towards conversational/audio chatbots.
This 2016 paper (via 𝕏) is relevant: Turn-taking in Human Communication – Origins and Implications for Language Processing.
- Most human conversation involves rapid back-and-forth; in fact the average speaking time for a person is only 2 seconds.
- This pace of switching is faster than possible for language encoding, and certainly for deliberative thinking. So, participants are instead predicting the other person’s speech and when their turn will come.
- Current chatbots are ill-suited to this modality. They monologue too much, their latency is still too high, they don’t handle interruptions well, and they are not actively predicting the user’s speech as they are talking.
- But, these are all solvable problems. It would certainly be interesting to see a class of models trained and tuned to exhibit true conversational dialogue.
Swift is a very fast voice-bot demo (based on Groq, Cartesia, VAD, and Vercel). Code here.

Images

Paints-UNDO is a new model can turn an image into a video sequence of steps needed to sketch it, refine it, color it, etc. (more examples, huggingface demo). This model is part of an exploration for AI to better understand how humans create artistic images. It could also be used, perhaps, to generate drawing tutorials for humans.

Video

Now that numerous AI tools are available for video and audio (c.f.), creators are starting explore. Here are some example creations. Right now these are quite short-form, but as tools improve in controllability and speed, we can expect to see longer-form content.
Live Portrait allows you to drive the facial animation of an image using a provided video (examples). Also available on replicate.
RenderNet has a video face swapping tool.
YouTube Erase Song tool allows one to remove music from video (while leaving other audio intact). The main use-case is to avoid copyright claims (e.g. from background music).
Odyssey announced that they intend to release AI tools for “Hollywood-grade visuals”. They are training models that don’t just output text-to-video, but output intermediate representations (depth maps? meshes?), allowing the user to iteratively ask for AI refinements. The idea is to give the level of control and quality that prestige TV/movies demand. Currently it’s just a teaser video; no results to inspect or demos to play with. But it will be exciting if they can deliver on this idea.

Zoo has added text-to-CAD.

World Synthesis

Segment Any 4D Gaussians (preprint) shows how to extract content from a 4D (3D reconstruction+time) clip, and copy-paste that content into another 3D/4D reconstruction.
HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models (preprint).
GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting (preprint, code) applies the Gaussian splatting to two-dimensional data, using it as a compression algorithm. Seems to work quite well.

Art

Style transfer is a well-studied class of methods for recreating an image with a different art style. It has somewhat fallen by the wayside since generative AI art (image synthesis) is now so good. But StyleShot shows improvements in style transfer (code, demo).
Generative Art in Websim shows how to make generative art by prompting an LLM (such as Anthropic’s Claude chatbot).

AI for Science

OpenAI and Los Alamos National Laboratory announce bioscience research partnership. (Same announcement from Los Alamos.) The focus is on bioscience research, and AI biosecurity. They also mention deploying the OpenAI multi-modal models (GPT-4o voice assistant) in real-world wet-lab settings, as a companion to the human scientist.

Health

Sam Altman and Arianna Huffington announced a new AI-health venture: Thrive AI Health. The idea is hyper-personalization of AI to help people make behavioral changes for better health.

Brain

Paper: Semantic encoding during language comprehension at single-cell resolution. Researchers measured single cells in the prefrontal cortex of live humans. The activation patterns suggest that specific neurons respond to word semantics. There is a strong analogy to what’s seem in LLMs. Concepts are encoded in activation patterns, with specific neurons capturing meaning (at some level of abstraction).
Paper: Task-driven neural network models predict neural dynamics of proprioception. Writeup: Artificial intelligence meets body sense: task-driven neural networks reveal computational principles of the proprioceptive pathway. They use musculoskeletal modeling and neural networks to mimic proprioception.
There is an interesting convergence (c.f.) between artificial neural networks and understanding of biological brains. The two efforts are complementary, helping us better understand AIs, better understand brains, and improve interfaces between them.
Synchron is developing a brain-computer interface (BCI) that is inserted into blood vessels (like a catheter) and therefore doesn’t require open brain surgery. They are planning to use OpenAI technology to improve interface/control, since a chatbot can provide contextually meaningful options.

Robots

Robot control is advancing, with several methods showing promise.

Diffusion methods show promise for planning, including for robots to generate path-plans in environments:
- C.f. Diffusion Forcing, which denoises tokens allowing arbitrary-length videos/plans to be generated.
- Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World.
- DiPPeST: Diffusion-based Path Planner for Synthesizing Trajectories Applied on Quadruped Robots.
- LDP: A Local Diffusion Planner for Efficient Robot Navigation and Collision Avoidance.
As previously discussed, world reconstruction (via 3D Gaussian splats) is also relevant for robotic planning:
- Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling. Can track object motion, including deformable objects, enabling improved planning.
Reinforcement learning exploiting generative models is improving:
- New paper: Lifelike agility and play in quadrupedal robots using reinforcement learning and generative pre-trained models (Nature paper, preprint). They use a hiearchy of controllers, based on generative modeling (trained in part on animal motion).

Robot hardware/systems continue to advance.

Most current robots lack a sense of touch. There are efforts to add pressure sensors. An alternative is for the robot to measure audio signals, and train models that can infer from that the necessary tactile information. ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data (preprint). Clever.
Xiaomi claims they are bringing online a robot factory that will operate 24/7 without humans, delivering 60 smartphones/minute. I’m skeptical (I assume there will still be humans tasked with oversight, maintenance, repair, and intervention); but it is an interesting trend to watch.
A new entrant to the humanoid-robot startup space: BXI Elf robot. Already available for purchase ($25k), though it seems a bit primitive compared to other efforts.

Posted in AI, News | Tagged 3D, AI, art, BCI, brain, chatbots, health, images, research, robots, Science, video, world synthesis | Leave a comment

AI News 2024-07-04

Posted on 2024-07-04 by KevinYager

Research Insights

Symbolic Learning Enables Self-Evolving Agents. Demonstrates automated data-driven optimization of LLM workflows. This tries to mimic back-propagation and gradient descent (c.f. TextGrad). This is also another hint of recursive-self-improvement, since an AI model is optimizing an AI model.
The Remarkable Robustness of LLMs: Stages of Inference? They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
- They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
  - Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
  - Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
  - Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and “suppression neurons” playing a major role in upvoting/downvoting.
  - Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
- This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).
A group at MIT introduced Diffusion Forcing, a sort of hybrid method between next-token prediction and full-sequence generation via diffusion. The different tokens to-be-denoised can have different noise levels, providing more control. The concept is general, but they apply it specifically to video and planning. They show how one can generate unlimited-length video (with control/guidance). Planning can handle uncertainty through variable noise levels, and could be useful for robotics. Although only demonstrated on a small model, the concept shows promise.
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems introduces a more challenging task for large-context LLMs (to summarize, with sourcing, a large amount of information). This should be a useful metric/benchmark for future improvements.
- The comparison to humans is also interesting. Humans outperform LLMs, if they take enough time to complete the task. But there are obviously cases where a <1 min imperfect summary is preferable to a ~1 hour better-quality human analysis. And, of course, LLM performance will improve over time.
Self-Play Preference Optimization for Language Model Alignment presents an alternative to RLHF or DPO. The SPPO method treats human preferences as probabilities, seeking to find a Nash equilibrium policy in a constant-sum two-player game. This better captures the intransitivity and irrationality of human preferences.

Tools

There are several demos of multi-agent orchestration systems (Camel, LoopGPT, JARVIS, OpenAGI, AutoGen, TaskWeaver, MetaGPT). Increasingly, cloud solutions are also appearing:

Numbers Station released Meadow which is an agentic framework for data workflows (code).
CrewAI says they provide multi-agent automations (code).
LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.

A related coordination strategy is to triage user queries, to balance between fast/small models and expensive/better larger models:

RouteLLM: Learning to Route LLMs with Preference Data; they evaluate router models that balance between cost and quality.

LLM

Perplexity adds multi-step search to their Pro Search product ($20/month); they claim it performs “deeper research on more complex queries with multi-step reasoning, Wolfram|Alpha, and code execution.”
Microsoft released the code for GraphRAG, which does document retrieval in a graph-based approach.
kyutai Open Science AI lab presented a demo of a real-time voice AI (moshi), based on their multimodal foundation model. It can listen and speak, with very low latency, allowing rather natural conversations. (To some extent, they beat OpenAI to release of a conversational agent, though their model does not seem as smart as GPT-4o.) You can play with it now; code will apparently be released soon.

OpenAI

OpenAI demoed features that we’ve heard about (real-time speech, adjusting tone, rapid OCR, desktop content sharing) and something new: adding voiceovers to Sora videos by cloning your own voice (including changing language). (Full video from AI Engineer World’s Fair.) Hopefully some of these will be available to the public soon.

Audio

ElevenLabs partnered with estates to bring iconic voices to their service (Judy Garland, James Dean, Burt Reynolds and Sir Laurence Olivier).
ElevenLabs also released voice isolator, which can eliminate noisy backgrounds (demo).

Video

Runway Gen3-3 Alpha now available to all (including prompting guide).
Google DeepMind released some more examples of generation from Veo. But the model is still not available to anyone.
All the elements are in place to put together AI-generated short-form content. Runway or Luma (especially with Midjourney image prompting) for video, Elevenlabs for Foley audio and narration, Suno or Udio for backing music. Here’s a simple example of putting this together. We are starting to see this being used for commercial efforts. Toys R Us partnered with OpenAI to use Sora to generate this commercial. Motorola released this genAI commercial, which integrates their logo into fashion. Seems like an appropriate use of genAI (advertising an AI-enabled photo, generating something that would be hard to do with other methods).

Meta 3D Gen shows improved text-to-3D.

World Synthesis

Continuing my survey of methods leading towards neural world synthesis:

Object-Aware Gaussian Splatting for Robotic Manipulation: Can do 3D reconstruction and semantic segmentation in real-time. Robot can use this as a world model.
GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping.
PAPR in Motion: Seamless Point-level 3D Scene Interpolation: Can smoothly deform point-clouds in ways that make sense. Further demonstration that animated 3D worlds will be possible.
4Real Towards Photorealistic 4D Scene Generation via Video Diffusion Models: Text-to-4D scene generation.
M-LRM: Multi-view Large Reconstruction Model: Better 3D reconstruction.
RTG-SLAM: Real-time 3D Reconstruction at Scale Using Gaussian Splatting.
ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering.
NeRFiller: Completing Scenes via Generative 3D Inpainting.
Autonomous driving company Wayve has 4D reconstruction models (PRISM-1) that can be used to simulate driving situations.
Nvidia video-to-4D synthesis.
Video generation can now be done at real-time speeds.
Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text (paper). Uses diffusion-generation of Gaussian splats and camera motions to enable text-to-video where scenes are rigidly consistent.

Brain

Using AI to interpret brain-scan data shows promise.
- Intracranial-EEG+AI to reconstruct the song a person is hearing.
- External-EEG+AI reconstructing words a person is thinking.
In 2022, some researchers showed how you could combine fMRI brain scans with stable diffusion, and use that to reconstruct a rough version of the image a person is imagining in their mind.
In 2023, Meta combined MEG with AI: Toward a real-time decoding of images from brain activity (preprint).
Some new work improves on this idea, showing rather good image reconstruction from brain scans. They use an attentional mechanism, so that the model identifies the relevant parts of the data and focuses on that.
These methods are potentially relevant for future brain-computer interfaces. One of the challenges in such systems (e.g. Neuralink) is transmitting and interpreting the large amount of data that can be generated by in-brain probes. Attentional systems could quite effectively analyze and compress the raw data, packaging it more suitably for transmission and understanding. The fact that AI methods can reconstruct decent images from weak data (MRI brain scans) bodes well for viable brain-computer interfaces.

Robots

Stanford HumanPlus leverages training from human data. They first train the robot controller via RL in simulation. Then do imitation of humans in the real world. They demonstrate ‘shadowing’ where the robot is teleoperated in real-time (using only a camera). This bootstraps to the robot doing autonomous tasks (including tying a shoe).
Similarly, there is a UCSD effort to develop Open Tele-Vision, a teleoperation scheme for robots that also acts as useful platform for gathering training data.
In robotics, there is a philosophical split between “build a bunch of specialized robots for each task” and “build one general-purpose design”. And even if one wants a general design, is a humanoid the best form factor? The argument in favor of humanoid robots is that our work and living environments are already optimized for humanoids, so it makes sense for our robots to conform and take advantage of existing tools/infrastructure. Additionally, these recent papers emphasize an additional advantage: by selecting a humanoid shape, it is easier to access/generate relevant training data, since one can more directly train on humans.
Red Rabbit Robotics is trying to develop an open-source humanoid robot design that others could reproduce for $1,000. Still early days, but it looks like they have a prototype of sorts.
Leju Robotics launched a humanoid-robot called Kuavo. It seems to be able to do what the other humanoid robots can do (semi-contrived tasks in a slow/deliberate manner).
Figure recently started shipping humanoid robots to a real client. This video shows their robot working on BMW use-cases.
GXO logistics has signed an agreement to use Agility Robotics Digit in their warehouses (video). Apparently this is subscription-based (robots-as-a-service); which may well become the business model for humanoid robot companies?
Clone Robotics continues to release videos of their micro-hydraulic arm that is remarkably dextrous: hand, lifting, pronation and supination, thumb.

Posted in AI, News | Tagged 3D, AI, audio, LLM, OpenAI, Perplexity, research, robots, tools, video | Leave a comment

AI News 2024-06-27

Posted on 2024-06-27 by KevinYager

Research Insights

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data. Surprising result: Training LLM on (x,y) pairs enables it to infer the underlying function (define it in code, invert it, compose it). Reasoning is occurring non-transparently in weights/activations.
I posted a preprint of an idea for a “science exocortex”. Essentially, a swarm of AI agents working together on tasks, and picking only high-value ideas and decisions that require human consideration. It’s just a set of ideas for now.

Anthropic

Anthropic released Claude 3.5 Sonnet. It is better than the larger Claude 3 Opus, and beats GPT-4o on many evals. (So presumably 3.5 Opus will be very smart?) It also has “artifacts”, which are sidebar visualizations/interactions that it can update and modify based on your requests. Interestingly, it seems to use special <antThinking> tags so that it can do chain-of-thought but have that output hidden from the user.

OpenAI

OpenAI acquired Rockset, a database/analytics company. The intended use seems to be for customers (especially corporate) to integrate data retrieval into LLM products.
Multi is a MacOS app for slick collaborative screenshare. They are shutting down their offering and instead “joining” OpenAI (merging with? being acquired by?). Some are guessing this means OpenAI will launch a radically new kind of operating system, where AI agents are first-class components. I think the simpler prediction is that they want their AI agent to “screenshare” by being able to see what’s on your screen and point at things, or even edit things or click buttons (with your permission). That would be useful.
Announced a partnership with TIME. Could either represent training data, or integration of sourced results in future ChatGPT replies (probably both). This is on top of other partnerships they’ve announced: Financial Times, Stack Overflow, Reddit, News Corp, Vox Media, The Atlantic, Apple.
Taken together, these make it seem like OpenAI are putting more focus on delivering a compelling consumer product.
On the research side, OpenAI put out a preprint showing how an LLM can be trained to critique another LLM. The critic can catch errors in the code output of ChatGPT. Small step towards iteration loops to improve outputs.

LLMs

Nvidia releases Nemotron-4 340B models and training dataset.
Google opens developer access to Gemini 1.5 Pro with 2M context window. That’s a lot of context.

Science

AlphaFold is already having a sizable impact on protein structure determination. Now, startup EvolutionaryScale has announced ambitions to enable programmable biology. Their preprint is equally ambitious: Simulating 500 million years of evolution with a language model. (See also prior publication cred.) They have open-sourced their ESM3 foundation model, which is trained on sequence, structure, and function of proteins. So you can (e.g.) input a desired function and it will generate a candidate protein. If these claims pan out, this could accelerate bio/medical research.
Some new work has demonstrated an RNA method for gene editing. In terms of utility, this is similar to CRISPR; in fact it could provide some capabilities beyond what CRISPR can do. Combined with more and more AI-based bio-design, this could lead to some interesting developments.

Robots

Kinda novel approach to AI/control for robotics: Dreamitate involves having the AI ‘dream’ an upcoming action (i.e. predict what the required action would look like in its camera vision), and then imitate that set of actions. The advantage here is that this leverages the power of generative video. You train a model on a bunch of video, so that it can correctly predict the next frame. Then that’s what you use for robot control. (This is the sense in which OpenAI claim Sora is a world-simulator and hence can be used to understand and act.)
A related robot-control effort: Embodied Instruction Following in Unknown Environments. Multi-modal model for robot following commands. Language model to understand human request. Builds a high-level plan and steps within it. Explores environment if necessary to learn more. Leveraging LLM means it can handle arbitrary tasks that it wasn’t specifically trained on.

Vision

Supervision is a generic (and open-source) vision system. Seems to work very well for semantic video tracking.
Microsoft open-sourced Florence-2, a lightweight vision-language foundation model useful for captioning, object detection, grounding, and segmentation. Interestingly, they created their training dataset by taking existing data and existing specialized models to create a unified set of well-labelled images. So this is another example of AI generating improved training data for AI.

Virtual Avatars

C.f. prior progress.
Synthesia is making avatars better; now the avatars can do realistic hand gestures timed to speech.

Tools

One idea for easily creating AI workflows is to use spreadsheet-like interfaces, where cells can invoke AI/LLM/etc. in order to run tasks across a whole bunch of data. V7 Go and Otto are offering this.

Hardware

Groq transitioned to being an AI cloud compute provider, instead of trying to sell people their custom chips directly. Their pricing on many models (including Whisper Large V3) are very good. They clearly have something to offer.
Etched raises $125M for their specialized chips.
Preprint recasts LLMs in a way that avoids matrix multiplication. Some are claiming this means the end of GPUs and Nvidia; that seems unlikely to me since there are so many current (and future!) data/ML/AI tasks that benefit from GPU/CUDA. But it is an interesting reminder that we don’t know what the optimal software architecture will be, thus it’s hard to know what the right hardware will be.

Posted in AI, News | Tagged AI, Anthropic, avatars, hardware, LLM, OpenAI, research, robots, Science, tools, vision | Leave a comment