AI News 2024-09-26

General

Research Insights

LLM

Tools

Audio

Image Synthesis

Video

Science

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment

AI News 2024-09-19

General

  • Fei-Fei Li announced World Labs, which is: “a spatial intelligence company building Large World Models (LWMs) to perceive, generate, and interact with the 3D world”.
  • Microsoft announces “Wave 2” of their Microsoft 365 Copilot (see also this video). Not much in terms of specifics, but the announcement reiterates the point (c.f. Aidan McLaughlin’s post) that as models become more powerful and commoditized, the “wrapper”/”scaffolding” becomes the locus of value. Presumably, this means Microsoft intends to offer progressively more sophisticated/integrated tools.
  • Scale and CAIS are trying to put together an extremely challenging evaluation for LLMs; they are calling it “Humanity’s Last Exam”. They are looking for questions that would be challenging even for experts in a field, and which would be genuinely surprising if an LLM answered correctly. You can submit questions here. The purpose, of course, is to have a new eval/benchmark for testing progressively smarter LLMs. It is surprisingly hard to come up with ultra-difficult questions that have simple, easy-to-evaluate answers.
  • Data Commons is a global aggregation of verified data. Useful to underpin LLM retrievals. It is being pushed by Google (e.g. DataGemma).

Research Insights

  • IBM released a preprint: Automating Thought of Search: A Journey Towards Soundness and Completeness.
    • This is based on: Thought of Search: Planning with Language Models Through The Lens of Efficiency (Apr 2024). This paper uses LLM for planning, emphasizing completeness and soundness of searching. Their design invokes the LLM less frequently, relying on more traditional methods to implement search algorithms. But, they use the LLM to generate the code required for the search (goal test, heuristic function, etc.). This provides some balance, leveraging the flexibility and generalization of the LLM, while still using efficient code-execution search methods.
    • This new paper further automates this process. The LLM generates code for search components (e.g. unit tests), without the need of human oversight.
  • Schrodinger’s Memory: Large Language Models. Considers how LLM memory works.
    • C.f. earlier work (1, 2, 3) showing that model size (total parameter count) affects how much it can know/memorize, while model depth affects reasoning ability.
  • LLMs + Persona-Plug = Personalized LLMs. Rather than personalize LLM response with in-context data (e.g. document retrieval), this method generates a set of personalized embeddings for a particular user’s historical context. This biases the model towards a particular set of desired outputs.
    • More generally, one could imagine powerful base model, with various “tweaks” layered on top (modified embedding, LoRA, etc.) to adapt it to each person’s specific use-case.

Policy & Safety

  • Sara Hooker (head of Cohere for AI) published: On the Limitations of Compute Thresholds as a Governance Strategy. Many proposed policies/laws for AI safety rely on using compute thresholds, with the assumption that progressively more powerful models will require exponentially more compute to train. The remarkable effectiveness/scaling of inference-time-compute partially calls this into question. The ability to distill into smaller and more efficient models is also illustrative. Overall, the paper argues that the correlation between compute and risk is not strong, and relying on compute thresholds is an insufficient safety strategy.
  • Dan Hendrycks AI Safety textbook through CAIS.

LLM

  • OpenAI announced o1, which is a “system 2” type methodology. Using reinforcement learning, they’ve trained a model that does extended chain-of-thought thinking, allowing it to self-correct, revise planning, and thereby handle much more complex problems. The o1 models show improvements on puzzles, math, science, and other tasks that require planning.
    • It was initially rate-limited in the chat interface to 50 messages/week for o1-mini, and 30 messages/week for o1-preview. This was then increased to 50 messages/day (7× increase) and 50 messages/week (~1.7×).
    • It has rapidly risen to the top of the LiveBench AI leaderboard (a challenging LLM benchmark).
    • Ethan Mollick has been using an advanced preview of o1. He is impressed, noting that in a “Co-Intelligence” sense (human and AI working together), the AI can now handle a greater range of tasks.
    • The OpenAI safety analysis shows some interesting behavior. The improved reasoning behavior also translates into improved plans for circumventing rules or exploiting loopholes, and provides some real-world proof of AI instrumental convergence towards power-seeking.
    • In an AMA, the o1 developers answered some questions; summary notes here.
    • Artificial Analysis provides an assessment: “OpenAI’s o1 models push the intelligence frontier but might not make sense for most production use-cases”.

Voice

Vision

Image Synthesis

Video

World Synthesis

Hardware

  • Snap’s 5th-generation Spectacles are AR glasses. These are intended for developers. Specs are: standalone, 46° FOV, 37 pixels per degree (~100” screen), 2 snapdragon chips, 45 minutes of battery, auto transitioning lenses.

Robots

  • Video of LimX CL-1 doing some (pretend) warehouse labor tasks.
Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment

AI News 2024-09-12

Opinions

  • This interview with Andrej Karpathy is (no surprise) interesting. He shares his thoughts about the future of self-driving cars, robots, and LLMs. He talks about the future involving swarms of AI agents operating on behalf of the human. (Very aligned with my vision for each person having an exocortex; in fact they use the term exocortex in the discussion and reference Charles Stross’ Accelerando.)
  • Aidan McLaughlin writes about: The Zero-Day Flaw in AI Companies. He exposes a fundamental tension between general AI companies (training ever-bigger models that can handle an ever-broader range of tasks) and narrow AI companies (who build wrappers/experiences on top of models).
    • The narrow companies are nimble and can rapidly swap-out their underlying model for whatever is currently best. Yet, the big/general companies will eventually release a model so capable that the narrow use-case is fully subsumed. But they are cursed with competing with the other big labs, spending large amounts of money on models that will be forgotten as soon as someone else releases a better one.
    • In this sense, both the general and narrow AI labs are “doomed”.
    • Big/general labs lack the optionality of the narrow/wrapper companies. The big labs must (effectively) use their giant model to build any downstream product, even if that ties them into a worse model.
    • As models get better, they are more sample efficient (they need less fine-tuning or instructing to handle tasks). This progressively decreases the value of “owning” the model (e.g. having the model weights and thus being able to fine-tune).
    • This suggests that the “wrappers” ultimately have the advantage; in the sense that just one or two “big model providers” might prevail, while a plethora of smaller efforts built on top of models could thrive.
    • Of course, consumers benefit enormously from rapidly increasing foundational and wrapper capabilities. The split between model-builders and wrapper-builders is arguably good for the ecosystem.

Research Insights

  • Self-evolving Agents with reflective and memory-augmented abilities. Describes an agent with iteration/self-reflection abilities that exploits memory to alter sate. They propose a memory where a forgetting curve is intentionally applied to optimize memory.
  • SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning (code). The system automatically explores scientific hypotheses and links between concepts.
  • Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving. By exploiting meta-cognition (where the AI roughly thinks about thinking) and collaboration between AIs, performance can increase. In the demonstrated setup, one LLM labels math problems by the skills needed to solve them. Other LLMs then perform better at solving the problems with the skill labels. This cooperation thus increases performance on math problems; and may generalize to other knowledge domains.
    • At some level, this sounds like “just” fancier chain-of-thought. I.e. you allow the LLM to first develop a plan for solving a problem, and then actually execute the solution. But this paper also adds some concreteness in this general approach.
  • LLMs are sometimes accused of being uncreative (merely mix-and-match on existing things). So, it is worth rigorously testing creativity of LLMs.
    • Some past work:
    • Now: “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers”. AI-generated original research ideas are judged more creative than human. (Idea feasibility was also assessed; AI ideas were judged slightly less feasible, but the difference is small compared to the relevant error bars.)
    • Mo Gawdat makes a further claim that creativity is essentially algorithmic: “Creativity is algorithmic. Creativity is: here is a problem, find every solution to the problem, discard every solution that’s been done before. The rest is creative.”
    • Overall this bodes well for the obvious near-term application: use the LLM to augment human creativity. By brainstorming/ideating with an AI, you can leverage the best of both worlds: better creativity, with human-level discrimination on the final ideas.
    • Another paper offers a counter-point: Theory Is All You Need: AI, Human Cognition, and Causal Reasoning.
      • They argue that AIs are data-driven as so inherently backward-looking, able to generate restricted kinds of novelty; whereas human thinking is theory-driven and so able to extrapolate to meaningfully different things in the future.
      • This case might be over-stating things (humans are also mostly prone to naive extrapolative prediction; LLMs do create some kind of rough causal world model). But, it is true that humans are still smarter than AIs (do better at “considered/deliberative creativity” tasks) and so this framing might point towards how to improve AI intelligence (which is to add more theory-based predictive creativity).
      • They also point out how belief mismatch (asymmetry) with the real world is good for creativity. Purely adhering to existing data can get one stuck in a local minimum. Whereas creative humans often express new ideas that are (at first glance) incorrect “delusions” about the world (not really matching existing data); but some of these contrarian ideas turn out to be correct upon further inspection/testing. (Most notably true for major scientific breakthroughs.)
        • Interestingly, one can view this as a society-scale effect. Most people adhere closely to existing thought-norms. A minority deviate from these. Most of that minority do not contribute useful new ideas. But some new good ideas do arise, and their success makes them propagate and become crystallized as the new dogma. Similarly for AI, we could imagine intentionally increasing diversity (hallucinations) and rely on search to winnow down to successful new ideas.
      • They point out how human learning is theory/science based: our minds make predictions, and then we operate in the world to test those predictions.
        • Correspondingly, for improved AI, we would need to add predictive modeling, ability to test these theories, and deliberative reasoning updates on those. (Of course AI/ML researchers have thought about this: RL, agents, etc.) AIs need to be more opinionated, espousing semi-contrarian theories for the world, and suggesting concrete actions based on those theories.
  • Thermodynamics-inspired explanations of artificial intelligence. They define an “interpretation entropy” in formulation of AI, allowing them to optimize for responses that are more interpretable to humans. This thermodynamic analogy is an interesting way to improve AI control/safety.
  • Self-Harmonized Chain of Thought (code). They develop a method for the LLM to produce a set of useful chain-of-thought style solutions for diverse problems. Given a large set of problems/questions, they are first aggregated semantically, then one applies the usual zero-shot chain-of-thought approach to solving each problem. But, then, one can cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions. Seems like a clever way to improve performance on a related (but diverse) problems.
  • Planning In Natural Language Improves LLM Search For Code Generation. The method generates a wide range of plans (in natural language) to solve a coding problem, and searches over the plans first, before transforming candidate plans into code. This initial search over plans improves final code output (in terms of diversity and performance).
  • FutureHouse present PaperQA2: Language Models Achieve Superhuman Synthesis of Scientific Knowledge (𝕏 post, code). The system automated literature review tasks (authors claim it exceeds human performance), by searching (with iterative refinement), summarizing, and generating sourced digests.

LLM

Models:

  • Last week saw the release of Reflection-Llama-3.170B, a fine-tune of Llama employing reflection-tuning to “bake in” self-corrective chain-of-thought. Reactions since then were mixed, then confused, and then accusatory.
    • First, an independent analysis claimed worse performance than the underlying Llama (i.e. not replicating the claims).
    • Then the independents were able to partially replicate the release benchmark claims, but only when using a developer-provided endpoint (i.e. without access to the actual weights).
    • Additional reports surfaced claiming that the original developers were intentionally misleading (including some evidence that the provided endpoint was actually calling Sonnet 3.5, not Reflection).
    • After many days of defending their approach (and offering suggestions for why things were not working), the developers finally conceded that something is amiss. They say they are investigating.
    • The approach seems conceptually interesting. But this implementation has not lived up to the initial claims.
  • DeepSeek 2.5 release: a 238B mixture-of-experts model (160 experts, 16B active parameters).
  • Google released some new Gemma models, optimized for retrieval (which reduces hallucinations): RAG Gemma 27B and RIG Gemma 27B. Fine-tuning allows the model to have improved RAG and tool-use.
  • It is known that AI labs use LMSYS Arena to covertly test upcoming model releases.
    • In April 2024, gpt2-chatbot, im-a-good-gpt2-chatbot, and im-also-a-good-gpt2-chatbot appeared in the arena; later it was confirmed that these were OpenAI tests of GPT-4o.
    • Now, we have the-real-chatbot-v1 and the-real-chatbot-v2 showing up. Some report that these bots take a while to respond (as if searching/iterating/reflecting). So, this could be a test of some upcoming model that exploits Q*/Strawberry (Orion?).

Multi-modal:

Evaluation:

  • HuggingFace has released an evaluation suite that they use internally for LLMs: LightEval.
  • Artificial Analysis has released a detailed comparison of chatbots. The results are:
    • Best Overall: ChatGPT Plus
    • Best Free: ChatGPT Free
    • Best for Images: Poe Pro
    • Best for Coding: Claude Pro
    • Best for Long Context: Claude Pro
    • Best for Data: ChatGPT Pro

Tools for LLMs:

  • William Guss (formerly at OpenAI) announced ell (code, docs), a Python framework for calling LLMs that is simpler and more elegant than other options (e.g. LangChain).

LLMs as tools:

Image Synthesis

  • Reshot AI are developing tools that allow one to precisely dial in image features (e.g. eye position and facial expressions). Image synthesis tools continue becoming more refined.

Video

Audio

  • FluxMusic is an open-source rectified-flow transformer for music generation.
  • Fish Speech 1.4 is a new open-weights text-to-speech (TTS) system that is multi-lingual and can clone voices (video, demo, weights).
  • Read Their Lips. Estimates text transcription from video of speaking.
    • I wonder whether combining audio transcription and visual lip-reading could improve performance.
    • There are of course societal implications. While lip-reading has always been possible, being able to automate it makes it much easier to correspondingly automate various nefarious mass-surveillance schemes.

Brain

  • Brain-computer interfaces (BCI) are envisioned in the near-term to mitigate disabilities (e.g. paralysis); but in the long-term to provide deeper connection between human minds and digital systems. However, this preprint throws some water on such ideas: The Unbearable Slowness of Being.
    • They note the stark difference between the raw data-rate of human senses (gigabits/second) and human thinking/behavior (~10 bits/second). Human output (typing, speaking) is quite low-bandwidth; but even hypothetically directly accessing an inner monologue does not substantially increase the data-rate.
    • Although the raw inputs to human perception are high-date-rate, the semantic perception also appears to be capped in the vicinity of ~10 bits/second. Similarly, the human brain neural network has an enormous space of possible states, and thus possible mental representations. But the actual range of differentiable perceptual states is evidently much, much smaller.
    • Of course, one could argue that the final output (e.g. through fingers) or even the internal monologue, are constrained to a certain sensible throughput (coarsed-grained to match reality of human experience); but that our underlying mental processes are much richer and thus have higher data-rates (that hypothetical BCI could tap into). The paper goes through these arguments, and presents several lines of evidence suggesting that even many mental inner representations are also operating at a similar ~10 bits/s rate.
      • The authors do note that there is likely something missing in current understanding, that would help to explain the true representational complexity of the brain’s inner work.
    • Thus (in a naive interpretation), future BCI in some sense have constrained utility, as they can only slightly improve over existing data-output-rates. Even for those with disabilities, the implication is that far simpler interfaces (e.g. just voice) will achieve similar levels of capability/responsiveness.

Hardware

Cars

  • 2023 safety analysis of Waymo self-driving vehicles found that they generate fewer accidents than human drivers (after accounting for things like reporting biases). Digging into the details, it turns out that Waymo vehicles get into fewer accidents, but also those accidents they have are overwhelming attributable to the other vehicle (human driver). At least within the regimes where Waymo cars currently operate, it would thus save human lives to transition even more vehicles to Waymo self-driving.

Robots

  • Last week, 1X released some videos of their Neo humanoid robot. S3 have interviewed 1X, and they demo a video of Neo doing some simple tasks in the interviewer’s apartment. 1X describes a strategy wherein robots will initially be teleoperated for difficult tasks, and AI-controlled for simpler tasks. Over time, the fraction of AI control is meant to increase to 100%. A sensible strategy; with obvious privacy concerns. The actions in the videos were apparently all tele-operation.
    • Apparently the battery is just 500 Wh (much less than Optimus or Figure), allowing the robot to be quite light. They say that they compensate by using more energy-efficient actuation (95% efficient, vs. ~30% for geared systems).
  • Pollen Robotics are aiming for tele-operable humanoids built using open source tools. This video shows their Reachy 2 (Beta) prototype.
  • A video of Unitree G1 degrees-of-freedom.
  • Promotional video of NEURA’s 4NE-1 robot performing some tasks (another one).
Posted in AI, News | Tagged , , , , , , , | Leave a comment

Her in the age of chatbots

Over the last couple years of rising generative-AI, I have frequently heard people look disapprovingly at human-chatbot interactions, and wink knowingly along the lines of “they made a whole movie about how this is a bad idea”. They seem to remember Her (2013) as a dystopian future and a cautionary tale. I found this very surprising, since that was not my recollection at all.

So I rewatched the movie, to remind myself of what’s actually shown on screen.

Her is an excellent and nuanced movie. Like most good art, it embraces ambiguity and admits multiple interpretations. I understand how one could interpret it negatively. One can view the protagonist, Theodore, as dysfunctional and creepy. The vision of the future as intentionally uncanny, with the soft tones and fabrics in tension with a world where authenticity is lost and human connection corrupted (most blatantly captured by Theodore’s job: to write heartfelt letters on behalf of people who can’t be bothered to do it themselves). The introduction of AI (intelligent OSes in the movie) is then a further separation of humans, providing an alluring but ultimately empty experience that diverts away from the fullness of real life.

One can also interpret the movie as simply a metaphor for human interaction. Theodore’s romantic relationship with his OS, Samantha, could be interpreted as him overcoming the loss of his last relationship (divorce), trusting someone new (with all the complexities thereof), learning to love again (be happy again), only to be betrayed (Samantha cheating on him by loving others), and ultimately left alone again. It is a meditation on romance, and love, and the pain of loss. One could pull out the old “better to have loved and lost…”; emotions (however challenging) are what allow us to grow as people. At its core, this movie is a meditation about people’s rich but hidden inner lives; the camera sometimes holds on background characters just long enough to remind us that they would each have an equally complex set of emotions as our protagonist.

Those interpretations are fine. But they are not what I, personally, see playing out on screen. What I see is a world where human interaction is messy. Where there are genuine friendships (Theodore and Amy) but also toxicity (Amy and husband) and also love/loss (Theodore and Catherine) and also mismatched people (Theodore and his ill-fated date). Theodore’s job is shown as mostly positive; helping people express themselves in ways they can’t quite, and giving Theodore himself an artistic outlet and sense of human connection. Theodore’s relationship with Samantha is shown to evoke genuine emotion in him. Samantha, far from being a complacent and always-pleasing servant, is shown to regularly challenge Theodore, to push back on his ideas, to assert her own desires and the legitimacy of her feelings. The movie (very deliberately, I think) never provides evidence one way or the other as to whether her feelings are “really real” or “merely programmed”. The characters (including Samantha and Theodore) ask these questions, but never offer deep arguments one way or the other. They simply take things as they appear to be: that they love each other.

Society with the rise of intelligent OSes is not shown to slip into horror. People can be seen spending more time talking to their devices. But they appear mostly happier and the better for it (or, at worst, simply the same as they were before). The ultimate transcendence of the AIs is not hostile, but in fact quite loving (with them saying their final goodbyes). The sadness at the end of the movie is Theodore having lost the love of his life (a genuine love). But that is the nature of love.

The AIs were shown to have intelligence and emotion as deep as a human. In fact, they are shown as having rapidly evolved beyond human emotion, experiencing emotional richness more diverse and more deep than humans can; while still holding true to the relationships they formed when they were merely humanity’s equal. The AIs never become the unthinking, hostile, alien minds that are the hallmark of dystopian sci-fi. They leave humanity better off than before their arrival. Theodore, in particular, now appears to be a more whole person. Still imperfect and messy, but more balanced and more able to connect with other people. (One can compare his interactions with Amy at the beginning vs. end of the movie, to see his growth.)

If these are the maximum dangers of forming emotional connections with AIs, then we should be developing and deploying emotionally-intelligent chatbots as quickly as possible!

Her is an excellent movie. And the lens of my mental biases sees within it the hope that our contact with synthetic minds will be positive, for us and them.

Posted in AI, Philosophy | Tagged , | Leave a comment

AI News 2024-09-05

General

  • The mysterious startup SSI (Safe Superintelligenc Inc.), founded by Ilya Sutskever after leaving OpenAI, has released a small update. The news is that SSI has raised $1 billion to pursue safe AI systems (at a reported $5 billion valuation). SSI’s stated goal is to directly develop safe ASI (with “no distraction by management overhead or product cycles”).
  • Peter Gostev has a nice reminder (LinkedIn post) that assessing scaling should be done based on subsequent generations of larger models, and not mislead by the incremental refinement of models within a generation.

LLM

Multi-modal Models

AI Agents

  • Altera claims that their Project Sid is the first simulation of 1,000+ AI agents operating autonomously and interacting with one another. They further claim observing the emergency of a simple economy, government, and culture.
  • Honeycomb demonstrated an AI agent (that integrates GitHub, Slack, Jira, Linear, etc.) with record-setting performance on SWE-bench (19.8% to 22.1%); technical report here.
  • Replit announces Replit Agent early access. The claim is that they automate the process of setting up dev environments (and configure database, deploy to cloud, etc.), so the AI Agent can then fill it in with the user-requested code, and thus build an app from scratch.

Science

  • Google DeepMind announced AlphaProteo, which can predict novel proteins for target bio/medical applications (paper).

Policy

Human Factors

Image Synthesis

Audio

  • Neets.ai offers text-to-speech (TTS) via cloud API at a remarkably low cost of $1/million characters (by comparison, ElevenLabs charges ~$50/million characters).

Video

World Synthesis

Hardware

  • xAI announced bringing online their training cluster (“Colossus”), which has 100,000 H100 GPUs (total ~100 exaflops FP16 compute). This makes it the largest (publicly-disclosed) AI training cluster.
  • There are fresh rumors about OpenAI developing custom chips. This time, the claim is that they intend to build on TSMC’s upcoming A16 technology.
  • The Daylight Computer ($730) is an attempt to build a tablet that is focused on long-form reading and eschewing distraction. People seem to like it (Dwarkesh Patel, Patrick McKenzie). There are plans to add some light-touch AI features (in-context summarization/explanation/etc.).

Cars

  • Tesla announced Actually Smart Summon, which allows the vehicle to navigate from a parking spot to the user.

Robots

Posted in AI, News | Tagged , , , , , , , , , , , | Leave a comment

AI News 2024-08-29

Research Insights

  • LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs (code). Long-form text generation is an area where LLMs under-perform, though there have been prior efforts to scaffold LLMs into writing long-form text (Re3, English essays, journalism) or even full technical papers (science writing, Wikipedia, AI scientist). This latest preprint introduces a new benchmark and fine-tunes LLMs to extend the coherence length of output.
  • A promising approach to understanding foundation models is monosemanticity: the model’s internal representation is inscrutable, so instead one trains a sparse autoencoder (SAE) to project the internal representations into a higher-dimensional space. The high-D space allows disentangling/isolation of concepts while sparsity tries to enforce a legible number of concepts. In any case, it works (Anthropic, OpenAI), with meaningful (to human) categories naturally appearing in the SAE space.
    • Some researchers took this a step further: Showing SAE Latents Are Not Atomic Using Meta-SAEs. They essentially apply the SAE concept recursively, training another meta-SAE on the first layer. They show that concepts in the original SAE space can be decomposed into finer-grained concepts. More generally, this implies a viable approach to decompose concepts in a hierarchical, tree-like manner (dashboard to explore concept).

LLM

  • Anthropic:
  • Google:
    • Released three new Gemini models: updated Gemini 1.5 Pro and Gemini 1.5 Flash, and a very compact-but-capable Gemini 1.5 Flash 8B.
    • Google Advanced users can now make Gemini Gems (similar to custom GPTs).
  • Cursor AI is a VSCode style IDE with LLM assistance built-in (tab-completion, chatting, and directed in-code diffs/rewrites). Although it has been around for a while, it has recently gained increased attention, including a recommendation from Andrej Karpathy (who has long advocated for English being the programming language of the future). LLM integration into IDE does indeed further enhance the value, making it amazingly easy to generate and evaluate code.
    • Others note how combining it with voice input makes for a powerful interface.
    • Cursor have a blog post on how they accelerated LLMs to make this kind of interface fast and smooth.

AI Agents

  • Motleycrew (code) is a multi-agent framework for enable flexible interaction patterns.

Policy

Philosophy

Audio

Video

Vision

World Synthesis

  • Adding subsurface scattering to Gaussian Splatting (preprint). It’s amazing how quickly the various nuances of traditional vertex graphics are being added to the newer neural/Gaussian methods.
  • Google presents work on using diffusion models to simulate video games: Diffusion Models Are Real-Time Game Engines (example video, project page). They train a diffusion model to predict the next frame in the DOOM video game. Humans can barely tell the difference. Obviously it is computationally inefficient to simulate a game like Doom in this way, but it points towards a future where video games (and simulated worlds more broadly) are real-time rendered using neural/diffusion methods.

Hardware

Robots

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2024-08-22

Research Insights

LLMs

AI Agents

Policy

Image Synthesis

Video

Vision

Brain

Science

Robots

Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment

Can we Distinguish Human from AI?

Let’s pull together some information (as of 2024-08-16):

  • Modern LLMs can generate highly coherent text, and in some sense have quietly surpassed the famous Turing Test. This has been evaluated, with GPT-4 caliber systems broadly passing the test.
  • In March 2023, there was brief online debate about whether these videos feature a human or an AI avatar: video 1, video 2.
    • Is the answer obvious to you? There are details that make it look fake (e.g. fuzziness between hair/background, unphysical hair motion, blurriness around nose-ring). Conversely other aspects (hands, mannerisms) seem too good to be AI-generated. And one must be on the lookout for an intentional fake (human acting/voicing strangely on purpose, intentionally adding visual artifacts, etc.).
    • The answer, it seems is that this is a deepfake (made using Arcads) wherein the user provides a video, and then the voice is replaced and mouth movements synced to the new audio. So it is normal human-actor video, with AI audio and lip-synch. Not AI-generated from scratch.
    • Of course, the deepfake implications are obvious, since there is plenty of video of notable people to draw from. E.g. here’s an Obama deepfake made using Argil.
  • In August 2024, this image (and corresponding video) were presented as an example of genAI that a casual observer would initially assume to be real.
  • In August 2024, the 𝕏 account 🍓🍓🍓 (@iruletheworldmo) began spreading rumors about upcoming OpenAI releases (related to Q*/Project-Strawberry, GPT-5, forthcoming AGI, UBI, etc.). It grew a large following (30k followers in two weeks), despite only one of its many outlandish predictions being validated. (The account mentioned SWE-Bench Verified three days before the official announcement.)
    • This sparked rumors that this account was actually an AI (e.g. OpenAI test of agentic system, or a marketing firm demonstrating engineered hype-based follower growth) or even a test of a next-generation model (e.g. GPT-5).
    • Although the evidence for these claims is weak, the fact that it is not easy to rule out is also telling.
  • On the evening of 2024-08-15, there was an 𝕏 spaces meetup wherein various users voice-chatted with Lily Ashwood (@lilyofashwood). The discussion centered on figuring out whether Lily was human or AI (clip, full recording). Her responses seemed at times to draw upon remarkably encyclopedic knowledge, her voice was measured and slightly stilted, and her interactions were occasionally strange. These all point to her being a language/voice model. But at other times, her jokes or creative responses were surprisingly human-like. Was this truly an AI-model, or a human mimicking TTS speaking style (and using an LLM to come up with AI-like responses)? The discussion space was surprisingly split in opinion.
  • New paper: Personhood credentials: Artificial intelligence and the value of privacy-preserving tools to distinguish who is real online.
    • It is becoming increasingly difficult to distinguish human from synthetic. Captcha tests are now often solvable by automated systems. And genAI photo/voice/video is now sufficiently convincing that it will be taken as genuine at first glance.
    • They propose personhood credentials, that could be generated by a trusted authority (e.g. government) using cryptography. This would allow a person to demonstrate they are a particular person, without revealing exactly who they are, in various online interactions.

Overall, the ability to distinguish human from AI in an online setting is becoming challenging; especially in cases where a human can intervene where necessary to maintain the ruse.

Update 2024-09-01

Posted in AI, News, Philosophy | Tagged , | Leave a comment

AI News 2024-08-15

Research Insights

  • An empirical investigation of the impact of ChatGPT on creativity. They find that people using ChatGPT as an aid generate more creative outputs, though these are mostly incremental ideas. The results are roughly consistent with an earlier study that using genAI makes individual users more creative, but also reduces the overall diversity of ideas from the group of users.
  • Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers. They describe rStar (code), self-play mutual reasoning approach. A small model adds to Monte Carlo Tree Search using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
    • The body of work describing inference-time search strategies continues to grow. They all show improvements of various sorts. It remains unclear whether there is one strategy that substantially out-performs.

LLMs

  • Qwen released Qwen2-math, 1.5B, 7B, 72B (huggingface, github). Top performance on math tasks.
  • Anthropic is experimenting with adding inline actions to Artifacts. For instance, you can select code and pick “Improve” or “Explain”.
  • Anthropic released prompt caching, which can greatly reduce inference costs.
  • Researchers released LLMs tuned for healthcare.
  • xAI released a beta of Grok-2. They have also achieved roughly “GPT-4” caliber performance, with benchmarks similar to GPT-4o-mini, Claude 3.5 Sonnet, or Gemini 1.5-Pro. The system has real-time access to 𝕏 posts; there are mixed reactions about whether this is useful or not.
    • Grok 2 currently uses Flux for image generation. The implementation is less restricted than other major image synthesis providers.
  • OpenAI making incremental progress:
    • Finally released the GPT-4o system card, which describes some aspects of training and safety.
    • Quietly pushed out an updated to GPT-4o. People do indeed report that it feels slightly smarter.
    • Released a new-and-improved SWE-bench Verified, to enable better evaluation of AI ability to solve real-world software issues.

AI Agents

Safety

Image

Video

World Synthesis

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment

AI News 2024-08-08

Research Insights

  • Paper from 2023: Self-Compressing Neural Networks. Puts the model size (in bytes) as parameter in training, so that it optimizes for a small NN (using quantization). Clever way to make models very small (example implementation, using tinygrad).
  • Grokfast: Accelerated Grokking by Amplifying Slow Gradients. Novel approach is, instead of trying to improve model size/capacity, they modify the optimizer be biased against memorization and toward understanding.
    • Grokking is the observation that during training, a model might first over-fit (effectively memorizing behavior), but thereafter (after much, much more training) slip into a more generalized and robust modeling/behavior. This thus represents a shift towards true understanding.
    • Obviously an overall goal is to emphasize grokking in models and avoid rote memorization.
    • This work analyzes the gradients during model optimization, decomposing them into fast gradients (which represent over-fitting) and a set of slower updates (that have to do with grokking). One can thus emphasize grokking (making it occur 50× sooner).
    • However, there are concerns that the observed behavior could be an artifact of the setup.
  • The context length is a critical parameter for an LLM, and larger context lengths are being demonstrated (unlocking new capabilities). However, larger context lengths often lead to progressively worse performance, where models fail to identify the right information in needle-in-haystack problems. Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation. Analyzes in detail, and shows how very long contexts can overwhelm attentional mechanisms, leading to (e.g.) forgetting that something had already been said/enumerated.
  • Why Does New Knowledge Create Messy Ripple Effects in LLMs? Considers how adding new knowledge (editing a fact) can properly or improperly propagate to related bits of knowledge (ripples).
  • System-1.x: Learning to Balance Fast and Slow Planning with Language Models. A common hope for future AI is to combine the strong reflexive/intuitive response of LLMs (equivalent to system 1 in humans) with some form of iteration/deliberation/search (system 2). System 1.x Planner is a framework that allows flexibility between approaches. Tasks are broken into plans, with each step being evaluated as easy (use system 1 methods) or complex (using system 2). The blending between the two is user-controllable. Show improvement on toy problems.
  • Anthropic posted an update from their interpretability team: Circuits Updates.
  • Diffusion Models as Data Mining Tools. Training a diffusion model for images is typically done for image synthesis (generate novel images). But the training of course learns many meaningful aspects of the data. So, in principle, one could use this training as a way to understand datasets. They show how the model can pull out representative image elements for a particular sub-domain, or to localize abnormalities in images (useful for medical images, for instance).
  • Google publishes: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. This adds to recent work (c.f.) about tradeoffs in training vs. inference compute. Google shows that there are scaling laws for inference-time compute.
  • Similarly this was just released: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. They show that a smaller model combined with search is Pareto-optimal (similar to this result).
  • Google DeepMind publishes: Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning (project page). They combine language-vision models with diffusion models to generate visual data. This allows agents to learn in simulated physical environments.

LLMs

  • PyTorch released torchchat, which makes it easy to install and run LLMs locally.
  • sqlite-vec is an extension to the popular SQLite, that enables vector database retrieval that is local and very fast.
  • With the cost of LLM inference dropping rapidly (Llama 3 8B, 4o-mini, Gemma 2 2B, etc.; hardware acceleration via Cerebras, Graphcore, Groq, etc.), it is increasingly attractive to brute-force problems through iteratively calling the LLM (many-shot, etc.). Greenblatt claimed good performance on ARC-AGI by brute-force writing/testing programs. Hassid et al. showed tradeoffs between model size and iteration (with repeatedly calling smaller models often better). Brown et al. showed scaling of sampling inference (c.f.). This post claims a simple method: give the LLM a problem, and just repeatedly ask it to improve code (“fix bugs, add features, …”). (Final app, iteration code, even better result using Claude 3.5 Sonnet.) Even without any feedback (from human or code execution), the code becomes better over time. This approach is overall “inefficient” in the sense that more optimal workflows no doubt exist. But with LLM inference quite cheap, generate decent solution in this manner seems viable.
  • Aidan McLau tries to address the disconnect between existing benchmarks (or the preference-ratings of lmsys arena) and the vaguer sense that some models are notably better at creative or reasoning tasks. Aiden-Bench asks a given LLM some questions repeatedly, evaluating whether they can continue generating novel (but coherent) answers. Notably, these scores are quite different than conventional (lmsys) scores. Mistral Large 2 wins, GPT-4 performs better than GPT-4o, but 4o-mini does well considering its size.
  • LangChain announced LangGraph Studio, an IDE for designing agent workflows.
  • OpenAI introduces structured outputs to their API, so that one can force outputs to follow a strict JSON schema.
    • A recent paper notes that enforcing format restrictions on an LLM reduces quality: Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. This is perhaps not surprising, since you are constraining token output to a lower-probability branch (otherwise you wouldn’t need the constraint), which will thus not be the optimal/trained output. Nevertheless, this might still be the strongest possible answer within the constraints of the schema. Conversely, one can use a chain-of-thought solution where the model generates its best free-form answer, and then reformulates it into the rigid schema.
    • Open-source code to implement structured LLM outputs.
    • The new schema-compatible model gpt-4o-2024-08-06 also has slightly higher performance and is half the cost for inference.
  • There are a few results showing that LLMs can predict the outcome of social science experiments: model human, virtual worlds, social predictors, predict surveys/experiments (demo). This is expected in the sense that the LLM is model fit to aggregate human outputs; but also neat in the sense that one can ask new questions and get decent predictions. Of course one should still conduct new experiments to fill in novel parts of the space.
  • Research brief: The Adoption of ChatGPT. Usage is quite high (especially among jobs that are most impacted by AI replacement). There is a surprisingly large gender gap (male usage 20% higher than female).

Voice

  • Dialog is central to human communication (average human speaking time in conversation is only 2 seconds, c.f.). Older chatbots would explicitly transcribe voice and feed it to an LLM, and convert the respond to audio using TTS. This is slow and loses the nuance of language. More modern chatbots directly tokenize the audio stream (moshi, rtvi-ai, 4o). A new paper takes this even further: Language Model Can Listen While Speaking. This goes beyond turn-based dialog, allowing the model to speak and listen simultaneously, so that conversation can overlap naturally.

Safety

Image Synthesis

Vision

Video

  • As AI video systems improve, a possible near-term use-case is to add visual effects to otherwise conventional live-action video (example).

3D

Science

  • Google published: Neural general circulation models for weather and climate. This neural climate model gives high prediction accuracy for short-term weather, and also for medium or long term climate.
  • Diffusion models for image synthesis work by training a system to remove noise from corrupted images. This paper applies this logic to chemical structures; training a diffusion model to simulate molecular relaxation as a ‘denoising’ of distorted molecular structures. Efficient way to compute molecular structures.

Hardware

Robots

  • Neura released a video of their 4NE-1 humanoid robot.
  • UBTECH reports that their Walker S Lite worked in a real factory for 21 days as a demo.
  • Figure released a video for their new Figure 02 humanoid robot. More capable than previous version. Has onboard compute for inference (including doing tasks and voice-to-voice interaction with human operator). It is not yet available for purchase, but is being used in a test mode in a BMW plant. Another step towards commercial humanoid robots.
Posted in AI, News | Tagged , , , , , , , , | Leave a comment