AI News 2024-10-24

General

Research Insights

Safety/Policy

LLM

  • The OpenAI Chat Completion API now supports audio input (allowing one to skip a separate transcription step).
  • Google’s Notebook LM has capture much attention, in part due to the useful “chat with my PDFs” feature, but mostly the cool “generate podcast” trick. You can now customize the podcast generation.
  • MotherDuck have added a “prompt()” function to their SQL database, such that you can weave LLM calls into your SQL lookups.
    • BlendSQL appears to be an open-source attempt to do something similar: combine LLM calls with SQL.
  • Meta released Meta Spirit LM an open source multimodal language model that freely mixes text and speech.
  • Anthropic announces a new Claude 3.5 Haiku model, as well as a new version of their excellent Claude 3.5 Sonnet model. This new model can “use a computer” (still experimental), available via API.
    • Ethan Mollick posts about his experience using this experimental mode.
    • An open-source version (using regular Claude 3.5 Sonnet via API) has appeared: agent.exe.
  • Perplexity plans to release a reasoning mode, where it can agentically search and collate information.

Tools

Audio

  • Elevenlabs adds Voice Design, allowing you to generate a new voice by text-prompting what it should sound like.

Image Synthesis

Video

Science

Hardware

Robots

  • A video of Fourier’s GR-2 robot standing up.
  • Video of Engine AI robot walking. As noted, the more upright (locked knees) gait is more energy-efficient, compared to the squatted (bended knee) walking of many other designs.
  • Clone Robotics continue to pursue their micro-hydraulic bio-mechanical approach to robotics; they now have a torso.
Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment

AI News 2024-10-17

General

  • Anthropic CEO Dario Amodei has published an opinion piece about the future of AI: Machines of Loving Grace: How AI Could Transform the World for the Better. While acknowledging the real risks, the piece focuses on how AI could bring about significant benefits for humankind.
    • Max Tegmark uses this as an opportunity to offer a rebuttal to the underlying thesis of “rapidly developing strong AI is a net good”: The AGI Entente Delusion. He views a competitive race to AGI as a suicide race, since efforts to align AI are lagging our ability to improve capabilities. He proposes a focus on Tool AI (instead of generalized AI), so that we can reap some of the benefits of advanced AI, with fewer of the alignment/control problems. This view focuses on government regulation proportionate to capability/risk. So, in principle, if companies could demonstrate sufficiently controllable AGI, then it could meet safety standards and deployed/sold.
  • (Nuclear) Energy for AI:
    • The US Department of Energy is committing $900M to build and deploy next-generation nuclear technology (including small reactors).
    • Google announced it will work with Kairos Power to use small nuclear reactors to power future data centers.
    • Amazon is investing $500M in small modular reactors, to expand genAI.
    • A group (Crusoe, Blue Owl Capital, and Primary Digital Infrastructure) announced $3.4B joint venture to build a 200 MW datacenter (~100k B200 GPUs) in Texas. Initial customers will be Oracle and OpenAI.
    • The growing commitments to build-out power for datacenters makes it increasingly plausible that AI training will reach 1029 FLOPS by 2030 (10,000× today’s training runs).
  • Here is an interesting comment by gwern on Lesswrong (via this), that explains why it is so hard to find applications for AI, and why the gains have been so small (relative to the potential):

If you’re struggling to find tasks for “artificial intelligence too cheap to meter,” perhaps the real issue is identifying tasks for intelligence in general. …significant reorganization of your life and workflows may be necessary before any form of intelligence becomes beneficial.

…organizations are often structured to resist improvements. …

… We have few “AI-shaped holes” of significant value because we’ve designed systems to mitigate the absence of AI. If there were organizations with natural LLM-shaped gaps that AI could fill to massively boost output, they would have been replaced long ago by ones adapted to human capabilities, since humans were the only option available.

If this concept is still unclear, try an experiment: act as your own remote worker. Send yourself emails with tasks, and respond as if you have amnesia, avoiding actions a remote worker couldn’t perform, like directly editing files on your computer. … If you discover that you can’t effectively utilize a hired human intelligence, this sheds light on your difficulties with AI. Conversely, if you do find valuable tasks, you now have a clear set of projects to explore with AI services.

Research Insights

Safety

LLM

AI Agents

Audio

Image Synthesis

  • Abode presented Project Perfect Blend, which adds tools to Photoshop for “harmonizing” assets into a single composite. E.g. it can relight subjects and environments to match.

Vision

Video

World Synthesis

Science

Cars

  • At Tesla’s “We, Robot” event, they showed the design for their future autonomous vehicles: Cybercab and Robovan. The designs are futuristic.

Robots

Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment

AI News 2024-10-10

General

  • Ethan Mollick writes about “AI in organizations: Some tactics”, talking about how individuals are seeing large gains from use of AI, but organizations (so far) are not.
    • Many staff are hiding their use of AI, with legitimate cause for doing so: orgs often signal risk-averse and punitive bureaucracy related to AI, staff worry that productivity gains won’t be rewarded (or indeed punished, as expectations rise), staff worry contributions won’t be regarded, etc.
    • Mollick offers concrete things that orgs can do to increase use of AI:
      • Reduce fear. Do not have punitive rules. Publicly encourage the use of AI.
      • Provide concrete, meaningful incentives to those who use AI to increase efficiency.
      • Build a sort of “AI Lab” where domain experts test all the tools and see whether they can help with business processes.
  • The 2024 Nobel Prize in Physics has been awarded to John J. Hopfield and Geoffrey E. Hinton, for developing artificial neural networks.
  • The 2024 Nobel Prize in Chemistry has been awarded to to David Baker for computational protein design, and to Demis Hassabis and John Jumper for AI protein prediction (AlphaFold).
  • Lex Friedman interviews the team that builds Cursor. Beyond just Cursor/IDEs, the discussion includes many insights about the future of LLMs.

Research Insights

LLM

AI Agents

  • Altera is using GPT-4o to build agents. As an initial proof-of-concept, they have AI agents that can play Minecraft.
  • CORE-Bench is a new benchmark (leaderboard) for assessing agentic abilities. The task consists of reproducing published computational results, using provided code and data. This task is non-trivial (top score right now is only 21%) but measurable.
  • OpenAI released a new benchmark: MLE-bench (paper) which evaluates agents using machine-learning engineering tasks.
  • AI Agents are becoming more prominent; but there is a wide range of definitions being used implicitly, all the way from “any software process” (“agent” is already in use for any software program that tries to accomplish something has been called) all the way to “AGI” (needs to be completely independent and intelligent). This thread is trying to crowd-source a good definition.
    • Some that resonate with me:
      • (1): agent = llm + memory + planning + tools + while loop
      • (2): An AI system that’s capable of carrying out and completing long running, open ended tasks in the real world.
      • (3): An AI agent is an autonomous system (powered by a Large Language Model) that goes beyond text generation to plan, reason, use tools, and execute complex, multi-step tasks. It adapts to changes to achieve goals without predefined instructions or significant human intervention.
    • To me, a differentiating aspect of an agent (compared to a base LLM) is the ability to operate semi-autonomously (without oversight) for some amount of time, and make productive progress on a task. A module that simply returns an immediate answer to a query is not an agent. So, there must be some kind of iteration (multiple calls to LLM) for it to count. So I might offer something like:
      • AI Agent: A persistent AI system that autonomously and adaptively completes open-ended tasks through iterative planning, tool-use, and reasoning.

Image Synthesis

  • FacePoke is a real-time image editor that allows one to change a face’s pose (code, demo), based on LivePortrait.
  • A few months ago, Paints-UNDO (code) unveiled an AI method for not just generating an image, but approximating the stepwise sketching/drawing process that leads up to that image This is fun, maybe useful as a sort of drawing tutorial; but also undermines one of the few ways that digital artists can “prove” that their art is not AI generated (by screen-capturing the creation process).

Video

World Synthesis

Science

Posted in AI, News | Tagged , , , , , | Leave a comment

How Smart will ASI be?

The development of AI is pushing towards AGI. To many, once you have AGI, you quickly and inevitably achieve ASI (superintelligence), since AGI can do AI research and thus AI iteratively self-improves (at an exponential rate). Others sometimes doubt that ASI can exist; they wonder how AI could ever be smarter than humans.

Here, let us try to enumerate how AI might be smarter than a human.

0. Human-like

A basic assumption herein is that human-level general intelligence can be reproduced. So a sufficiently advanced AI would be able to do anything a human can do. This captures more than just book smarts and mathematical or visual reasoning; our ability to socialize is also an aspect of intelligence (theory of mind, etc.).

1. Focused

A simple way in which a human-level AI would immediately be superhuman is due to focus. The AI can be more motivated, focused, attentive, single-minded, etc. Removing myriad foibles/weaknesses from a human would make them superhuman in output.

2. Fast

Once we can simulate human-level thinking in silico, there’s no reason it can’t be accelerated (through better hardware, algorithmic optimizations, etc.). A single human, if thinking sufficiently quickly, is already quite superhuman. Imagine if, for every reply you need to give, you are allowed to spend endless time researching the best answer (including researching the background of the person asking the question, tailoring it to their knowledge and desires). You can scour the literature for the right information. You can take your time to make sure your math is right. In fact, you can write entire new software stacks, invent new data analysis procedures, tabulate endless statistics. Whatever you need to make your answer just a little bit better.

3. Replicated

The computational form of AI makes it easy to replicate. So, in addition to “spending time thinking,” one can also launch numerous parallel copies to work on a problem. The diverse copies can test out different approaches (using different assumptions, different subsets of the data). The copies can double-check each other’s work. In principle, for any question asked to the AI, a vast hierarchy of agents can be launched; some reading sources, others analyzing data, others collecting results. Imagine, that for every decision, you could leverage a planetary-scale of intellectual effort, all focused precisely on solving your task.

There is little doubt that human organizations exhibit some form of superhuman capability in terms of the complexity of projects they execute. The AI equivalent is similar, but far more efficient since the usual organization frictions (lack of motivation in individuals, misaligned desires among participants, mistrust, infighting, etc.) are gone.

The sheer potential scale of AI parallel thinking is a staggering form of superhuman capability.

4. Better

In principle, an AI brain could be better than a human one in a variety of ways. Our cognition is limited by the size of working memory, by how convoluted a chain we can hold in our heads, by the data-rates of our sensory inputs, by fragility to distractions, etc. All of these are, in principle, improvable.

5. Tunable

Human brains are subject to various modifications, including some that can improve task performance. Certain drugs can induce relaxed or heightened states that might maximize focus, or creativity, or emotions, etc. Certain drugs (e.g. psychedelics) seem even able to “anneal” a mind and help one escape a local minimum in thought-space (for better or worse). In humans, these interventions are brute and poorly-understood; yet even here they have predicable value.

In AI, interventions can be much more precise, reproducible, controllable, and powerful. (And need not have side-effects!) One can, in principle, induce target mental states to maximize particular behaviors or capabilities. In this sense, AI could always have the “ideal” mental state for any particular task.

6. Internalized Scaffolding

It is worth noting that a large fraction of human intelligence comes not from our “raw brainpower” but from the scaffolding we have put on top, which includes language, concepts, math, culture, etc. For instance, our brains are roughly equivalent to the brains of ancient humans. We are much smarter, in large part, because we have a set of heuristics (passed down through culture, books, etc.) that allow us to unlock more “effective intelligence.” Some of our most powerful heuristics (math, logic, science, etc.) do not come so easily to us.

For AI, there is no need for this scaffolding to be external and learned. Instead, it could be more deeply embedded and thus reflexive. Arguably, modern LLMs are some version of this: the complexity of modern concepts (encoded in language) become built-in to the LLM. More generally, an AI could have more and more of this complex scaffolding internalized (including reflexive access to myriad source documents, software, solvers, etc.).

7. Native Data Speaker

Humans can speak (and think) “natively” using language, and learn to understand certain concepts intuitively (everyday physics, “common sense,” etc.). We then attempt to understand other concepts in analogy to those that are intuitive (e.g. visual thinking for math). An AI, in principle, can be designed to “think natively” in other kinds of data spaces, including the complex data/statistics of scientific data sets (DNA sequences, CAD designs, computer binary code, etc.). And these diverse “ways of thinking” need not be separated; they could all be different modalities of a single AI (much as humans can think both linguistically and visually).

By being able to “natively” think in source code, or symbolic math, or complex-high-dimensional-data-structures, the AI could exhibit vastly improved reasoning and intuition in these subject areas.

8. Goes Hard on Important Tasks

Humans are, mostly, unaccustomed to what can be accomplished by truly focusing on a task. The counter-examples are noteworthy as they are so rare: humans were able to design an atomic weapon, and send a person to the moon, in a relatively short time owing to the focus. The organizations in question “went really hard” on the task they were assigned. Most organizations we see are enormously inefficient in the sense that they are not, really, single-mindedly focused on their nominal task. (Even corporate entities trying to maximize profit at all cost are, in the details, highly fractured and inefficient since the incentives of the individual players are not so aligned with the organizational goal. Many organizations are not, in actual fact, pursuing their nominal/stated goal.) The jump in effective capability one sees when an organization (or entity) “really goes hard” (pursues their goal with unrestrained focus) can be hard to predict, as they will exploit any and all opportunities to advance their objective.

9. Goes Hard on Small Tasks

It is also worth considering that an AI can (due to its focus) “go hard” on even the small things that we normally consider trivial. Humans routinely ignore myriad of confounders, give up on tasks, or otherwise “don’t sweat the small stuff.” This is adaptive for ancestral humans (avoid wasting effort on irrelevant things) and modern humans (don’t stress out about minutia!). But an AI could put inordinate effort into each and every task, and sub-task, and sub-sub-task. The accumulation of expert execution of every possible task leads to enormous gains at the larger scales. The flip side to the curse of small errors compounding into enormous uncertainty, is that flawless execution of subtasks allows one to undertake much more complex overall tasks.

A simple example is prediction. As a human predicts what someone else will do, their thinking quickly dissolves into fuzzy guesses; and they give up predicting many moves ahead. The space of options is too large, and the error on each guess in the chain too large. But, with sufficient effort in each and every analysis, one could push much, much harder on a long predictive chain.

10. Unwaveringly Rational

Humans know that rational thinking is, ultimately, more “correct” (more likely to lead to the correct answer, achieve one’s aims, etc.). Yet even those humans most trained in rational thinking will (very often!) rely on irrational aspects of their mind (biases, reflexive considerations, intuitions, inspiration, etc.) when making decisions. Simply sticking to “known best practices” would already improve effectively intelligence. An AI could go beyond this even, by exploiting rigorous frameworks (Bayesian methods, etc.) to be as rationale as possible.

(This need not compromise creativity, since this is also subject to rigorous analysis: optimal amount of creativity, efficient randomization schedules, maximize human enjoyment, etc.)

11. Simulation

With sufficient computer power, a broad range of things can be simulated as part of a thinking process. Complex physical setups, social dynamics, and even the behavior of a single person could be simulated as part of the solution to a problem. Uncertainties and unknowns can be handled by running ensembles of simulations covering different cases. Humanity has reached superhuman weather forecasting by exploiting dense data and complex simulations. An ASI could, in principle, leverage simulations tailored to every subject-area to similarly leverage superhuman predictions to inform all decisions.

12. Super-Forecasting

A combination of the features described above (rational, going hard, simulation) should enable ASI to be incredible forecasters. By carefully taking account of every possible factor (large and small), and predicting the possible outcomes (using logic, detailed simulation, etc.) one can generate chains of forward-predictions that are incredibly rich. Uncertaintites can be handled with appropriate frameworks (Bayesian, etc.) or compute (Monte Carlo, etc.). Of course one is always limited by the available data. But humans are provably very far from fully exploiting the data available to them. An ASI would be able to make predictions with unnerving accuracy over short timescales, and incredible utility for long timescales.

13. Large Concepts

There are certain classes of concepts which can make sense to a human, but which are simply too “large” for a human to really think about. One can imagine extremely large numbers, high-dimensional systems, complex mathematical ideas, or long chains of logical inference being illegible to a human. The individual components make sense (and can be thought about through analogy), but they cannot be “thought about” (visualized, kept all in memory at once) by a human owing to their sheer size.

But, in principle, an AI could be capable (larger working memory, native visualization of high dimensions, etc.) of intuitively understanding these “large” concepts.

14. Tricky Concepts

There are some intellectual concepts that are quite difficult to grasp. For the most complex, we typically observe that only a subset of humans can be taught to meaningfully understand the concept, with an even small subset being sufficiently smart to have discovered the concept. One can think of physics examples (relativity, quantum mechanics, etc.), math examples (P vs. NP, Gödel incompleteness, etc.), philosophy examples (consciousness, etc.), and so on.

If AGI is possible, there is no reason not to expect AI to eventually be smart enough to understand all such concepts, and moreover to be of a sufficient intelligence-class to discover and fully understand more concepts of this type. This is already super-human with respect to average human intelligence.

Plausibly, as AI improves, it will discover and understand “tricky concepts” that even the smartest humans cannot easily grasp (but which are verifiably correct).

15. Unthinkable Thoughts

Are there concepts that a human literally cannot comprehend? Ideas they literally cannot think? This is in some sense an open research question. One can argue that generalized intelligence (of the human type) is specifically the ability to think about things symbolically; anything meaningfully consistent can be described in some symbolic way, hence in a way a human could understand. Conversely, one could argue that Gödel incompleteness points towards some concepts being unrepresentable within a given system. So, for whatever class of thoughts can be represented by the system of human cognition, there are some thoughts outside that boundary, which a greater cognitive system could represent.

Operationally, it certainly seems that some combination of large+tricky concepts could be beyond human conception (indeed we’ve already discovered many that are beyond the conception of median humans). So, it seems likely that there are thoughts that a sufficiently powerful mind would be able to think, that we would not be able to understand. What advanced capabilities would such thoughts enable? It’s not easy to say. But we do know, from the course of human history, that progressively more subtle/refined/complex/powerful thoughts have led to corresponding increases in capabilities (math, science, technology, control over the world, etc.).

16. Emergence

The enumerated modes of increased intelligence will, of course, interact. A motif we can expect to play out is the emergence of enhanced capabilities due to synergy between components; some kind of “whole greater than the sum of the parts” effect. For humans, we of course see this, where a synergy between “raw intelligence” and “cultural scaffolding” (education, ideas, tools, etc.) leads to greatly improved capabilities. For ASI, the advantages in multiple directions could very well lead to emergence of surprising capabilities, such as forecasting that feels precognitive or telepathic, or intuition that feels like generalized genius.

Conclusion

The exact nature of future ASI is unknown. Which of the enumerated “advantages” will it possess? How will they interact? To what extent will capabilities be limited by available computation, or coherence among large computational systems (e.g. lag times for communicating across large/complex systems)? These are unknowns. And yet, it seems straightforward to believe that an ASI would exhibit, at a minimum, a sort of “collection of focused geniuses” type of super-intelligence, where for any given task that it seeks to pursue, it will excel at that task and accomplish it with a speed, sophistication, and efficiency that our best organizations and smartest people can only dream of.

Overall, we hope this establishes that ASI can, indeed, be inordinately capable. This makes it correspondingly inordinately useful (if aligned to humans) and inordinately dangerous (if not).

Posted in AI, Philosophy | Tagged , | Leave a comment

AI News 2024-10-03

General

  • A reminder that Epoch AI has nice graphs of the size of AI models over time.
  • Microsoft blog post: An AI companion for everyone. They promise more personalized and powerful copilots. This includes voice control, vision modality, personalized daily copilot actions, and “think deeper” (iterative refinement for improved reasoning).
  • OpenAI Dev Day: realtime, vision fine-tuning, prompt caching, distillation.
  • OpenAI have secured new funding: $6.6B, which values OpenAI at $157B.

Policy/Safety

  • California governor Gavin Newsom vetoed AI safety bill SB1047. The language used in his veto, however, supports AI legislation generally, and even seems to call for more stringent regulation, in some ways, than SB1047 was proposing.
  • Chatterbox Labs evaluated the safety of different AI models, finding that no model is perfectly safe, but giving Anthropic the top marks for safety implementations.
  • A Narrow Path. Provides a fairly detailed plan for how international collaboration and oversight could regulate AI, prevent premature creation of ASI, and thereby preserve humanity.

Research Insights

  • The context length of an LLM is critical to its operation, setting the limit on how much it can “remember” and thus reason about.
    • A succession of research efforts demonstrated methods for extending context:
    • Modernly, LLMs typically have >100k context, with Google’s Gemini 1.5 Pro having a 2M window. That’s quite a lot of context!
    • Of course, one problem arising with larger contexts is “needle-in-haystack”, where the salient pieces get lost. Attentional retrieval seems to be best for token near the start and end of the context, with often much-worse behavior in the large center of long contexts. So there is still a need for methods that correctly capture all the important parts from long context.
    • Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction. Early LLM layers are used to compress the context tokens, into semantically meaningful but more concise representations. Should allow scaling to larger contexts. (Though one might worry that are some edge-case tasks, this will eliminated needed information/nuance.)
  • Looped Transformers for Length Generalization. Improves length generalization; useful for sequential tasks that have variable length (e.g. arithmetic).
  • Addition is All You Need for Energy-efficient Language Models. Very interesting claims. They show how one can replace floating-point matrix multiplications with a sequence of additions as an approximation. Because additions are so much easier to compute, this massively reduces energy use (95%), without greatly impacting performance. (Which makes sense, given how relatively insensitive neural nets are to precision.) Huge energy savings, if true.
  • Evaluation of OpenAI o1: Opportunities and Challenges of AGI. An overall evaluation of o1-preview confirms that it excels at complex reasoning chains and knowledge integration (while sometimes still failing on simpler problems). o1 represents a meaningful step towards AGI.
  • A few months old, but interesting: The Platonic Representation Hypothesis. Various foundation models appear to converge to the same coarse-grained/idealized representation of reality. And the convergence improves as the models get larger, including across modalities (e.g. language and vision models converge to the same world model). This is partly an artifact of human-generated training data (i.e. they are learning our world model), but also partly due to the intrinsic “useful partitioning” of reality (c.f. representational emergence).

LLM

Audio

Image Synthesis

Video

  • Bytedance unveils two new video models: Doubao-PixelDance and Doubao-Seaweed (examples show some interesting behaviors, including rack focus and consistent shot/counter-shot).
  • Pika release a v1.5 of their model. They have also added Pikaffects, which allow for some specific physics interactions: explode, melt, inflate, and cake-ify (examples: 1, 2, 3, 4, 5, 6). Beyond being fun, these demonstrate how genAI can be used as an advanced method of generating visual effects, or (more broadly) simulating plausible physics outcomes.
  • Runway ML have ported more of their features (including video-to-video) to the faster turbo model. So now people can do cool gen videos more cheaply.
  • Luma has accelerated their Dream Machine model, such that it can now generate clips in ~20 seconds.
  • Runway ML (who recently partnered with Lionsgate) announce Hundred Film Fund, an effort to fund new media that leverage AI video methods.
  • More examples of what genAI video can currently accomplish:

3D

Brain

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment

AI News 2024-09-26

General

Research Insights

LLM

Tools

Audio

Image Synthesis

Video

Science

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment

AI News 2024-09-19

General

  • Fei-Fei Li announced World Labs, which is: “a spatial intelligence company building Large World Models (LWMs) to perceive, generate, and interact with the 3D world”.
  • Microsoft announces “Wave 2” of their Microsoft 365 Copilot (see also this video). Not much in terms of specifics, but the announcement reiterates the point (c.f. Aidan McLaughlin’s post) that as models become more powerful and commoditized, the “wrapper”/”scaffolding” becomes the locus of value. Presumably, this means Microsoft intends to offer progressively more sophisticated/integrated tools.
  • Scale and CAIS are trying to put together an extremely challenging evaluation for LLMs; they are calling it “Humanity’s Last Exam”. They are looking for questions that would be challenging even for experts in a field, and which would be genuinely surprising if an LLM answered correctly. You can submit questions here. The purpose, of course, is to have a new eval/benchmark for testing progressively smarter LLMs. It is surprisingly hard to come up with ultra-difficult questions that have simple, easy-to-evaluate answers.
  • Data Commons is a global aggregation of verified data. Useful to underpin LLM retrievals. It is being pushed by Google (e.g. DataGemma).

Research Insights

  • IBM released a preprint: Automating Thought of Search: A Journey Towards Soundness and Completeness.
    • This is based on: Thought of Search: Planning with Language Models Through The Lens of Efficiency (Apr 2024). This paper uses LLM for planning, emphasizing completeness and soundness of searching. Their design invokes the LLM less frequently, relying on more traditional methods to implement search algorithms. But, they use the LLM to generate the code required for the search (goal test, heuristic function, etc.). This provides some balance, leveraging the flexibility and generalization of the LLM, while still using efficient code-execution search methods.
    • This new paper further automates this process. The LLM generates code for search components (e.g. unit tests), without the need of human oversight.
  • Schrodinger’s Memory: Large Language Models. Considers how LLM memory works.
    • C.f. earlier work (1, 2, 3) showing that model size (total parameter count) affects how much it can know/memorize, while model depth affects reasoning ability.
  • LLMs + Persona-Plug = Personalized LLMs. Rather than personalize LLM response with in-context data (e.g. document retrieval), this method generates a set of personalized embeddings for a particular user’s historical context. This biases the model towards a particular set of desired outputs.
    • More generally, one could imagine powerful base model, with various “tweaks” layered on top (modified embedding, LoRA, etc.) to adapt it to each person’s specific use-case.

Policy & Safety

  • Sara Hooker (head of Cohere for AI) published: On the Limitations of Compute Thresholds as a Governance Strategy. Many proposed policies/laws for AI safety rely on using compute thresholds, with the assumption that progressively more powerful models will require exponentially more compute to train. The remarkable effectiveness/scaling of inference-time-compute partially calls this into question. The ability to distill into smaller and more efficient models is also illustrative. Overall, the paper argues that the correlation between compute and risk is not strong, and relying on compute thresholds is an insufficient safety strategy.
  • Dan Hendrycks AI Safety textbook through CAIS.

LLM

  • OpenAI announced o1, which is a “system 2” type methodology. Using reinforcement learning, they’ve trained a model that does extended chain-of-thought thinking, allowing it to self-correct, revise planning, and thereby handle much more complex problems. The o1 models show improvements on puzzles, math, science, and other tasks that require planning.
    • It was initially rate-limited in the chat interface to 50 messages/week for o1-mini, and 30 messages/week for o1-preview. This was then increased to 50 messages/day (7× increase) and 50 messages/week (~1.7×).
    • It has rapidly risen to the top of the LiveBench AI leaderboard (a challenging LLM benchmark).
    • Ethan Mollick has been using an advanced preview of o1. He is impressed, noting that in a “Co-Intelligence” sense (human and AI working together), the AI can now handle a greater range of tasks.
    • The OpenAI safety analysis shows some interesting behavior. The improved reasoning behavior also translates into improved plans for circumventing rules or exploiting loopholes, and provides some real-world proof of AI instrumental convergence towards power-seeking.
    • In an AMA, the o1 developers answered some questions; summary notes here.
    • Artificial Analysis provides an assessment: “OpenAI’s o1 models push the intelligence frontier but might not make sense for most production use-cases”.

Voice

Vision

Image Synthesis

Video

World Synthesis

Hardware

  • Snap’s 5th-generation Spectacles are AR glasses. These are intended for developers. Specs are: standalone, 46° FOV, 37 pixels per degree (~100” screen), 2 snapdragon chips, 45 minutes of battery, auto transitioning lenses.

Robots

  • Video of LimX CL-1 doing some (pretend) warehouse labor tasks.
Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment

AI News 2024-09-12

Opinions

  • This interview with Andrej Karpathy is (no surprise) interesting. He shares his thoughts about the future of self-driving cars, robots, and LLMs. He talks about the future involving swarms of AI agents operating on behalf of the human. (Very aligned with my vision for each person having an exocortex; in fact they use the term exocortex in the discussion and reference Charles Stross’ Accelerando.)
  • Aidan McLaughlin writes about: The Zero-Day Flaw in AI Companies. He exposes a fundamental tension between general AI companies (training ever-bigger models that can handle an ever-broader range of tasks) and narrow AI companies (who build wrappers/experiences on top of models).
    • The narrow companies are nimble and can rapidly swap-out their underlying model for whatever is currently best. Yet, the big/general companies will eventually release a model so capable that the narrow use-case is fully subsumed. But they are cursed with competing with the other big labs, spending large amounts of money on models that will be forgotten as soon as someone else releases a better one.
    • In this sense, both the general and narrow AI labs are “doomed”.
    • Big/general labs lack the optionality of the narrow/wrapper companies. The big labs must (effectively) use their giant model to build any downstream product, even if that ties them into a worse model.
    • As models get better, they are more sample efficient (they need less fine-tuning or instructing to handle tasks). This progressively decreases the value of “owning” the model (e.g. having the model weights and thus being able to fine-tune).
    • This suggests that the “wrappers” ultimately have the advantage; in the sense that just one or two “big model providers” might prevail, while a plethora of smaller efforts built on top of models could thrive.
    • Of course, consumers benefit enormously from rapidly increasing foundational and wrapper capabilities. The split between model-builders and wrapper-builders is arguably good for the ecosystem.

Research Insights

  • Self-evolving Agents with reflective and memory-augmented abilities. Describes an agent with iteration/self-reflection abilities that exploits memory to alter sate. They propose a memory where a forgetting curve is intentionally applied to optimize memory.
  • SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning (code). The system automatically explores scientific hypotheses and links between concepts.
  • Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving. By exploiting meta-cognition (where the AI roughly thinks about thinking) and collaboration between AIs, performance can increase. In the demonstrated setup, one LLM labels math problems by the skills needed to solve them. Other LLMs then perform better at solving the problems with the skill labels. This cooperation thus increases performance on math problems; and may generalize to other knowledge domains.
    • At some level, this sounds like “just” fancier chain-of-thought. I.e. you allow the LLM to first develop a plan for solving a problem, and then actually execute the solution. But this paper also adds some concreteness in this general approach.
  • LLMs are sometimes accused of being uncreative (merely mix-and-match on existing things). So, it is worth rigorously testing creativity of LLMs.
    • Some past work:
    • Now: “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers”. AI-generated original research ideas are judged more creative than human. (Idea feasibility was also assessed; AI ideas were judged slightly less feasible, but the difference is small compared to the relevant error bars.)
    • Mo Gawdat makes a further claim that creativity is essentially algorithmic: “Creativity is algorithmic. Creativity is: here is a problem, find every solution to the problem, discard every solution that’s been done before. The rest is creative.”
    • Overall this bodes well for the obvious near-term application: use the LLM to augment human creativity. By brainstorming/ideating with an AI, you can leverage the best of both worlds: better creativity, with human-level discrimination on the final ideas.
    • Another paper offers a counter-point: Theory Is All You Need: AI, Human Cognition, and Causal Reasoning.
      • They argue that AIs are data-driven as so inherently backward-looking, able to generate restricted kinds of novelty; whereas human thinking is theory-driven and so able to extrapolate to meaningfully different things in the future.
      • This case might be over-stating things (humans are also mostly prone to naive extrapolative prediction; LLMs do create some kind of rough causal world model). But, it is true that humans are still smarter than AIs (do better at “considered/deliberative creativity” tasks) and so this framing might point towards how to improve AI intelligence (which is to add more theory-based predictive creativity).
      • They also point out how belief mismatch (asymmetry) with the real world is good for creativity. Purely adhering to existing data can get one stuck in a local minimum. Whereas creative humans often express new ideas that are (at first glance) incorrect “delusions” about the world (not really matching existing data); but some of these contrarian ideas turn out to be correct upon further inspection/testing. (Most notably true for major scientific breakthroughs.)
        • Interestingly, one can view this as a society-scale effect. Most people adhere closely to existing thought-norms. A minority deviate from these. Most of that minority do not contribute useful new ideas. But some new good ideas do arise, and their success makes them propagate and become crystallized as the new dogma. Similarly for AI, we could imagine intentionally increasing diversity (hallucinations) and rely on search to winnow down to successful new ideas.
      • They point out how human learning is theory/science based: our minds make predictions, and then we operate in the world to test those predictions.
        • Correspondingly, for improved AI, we would need to add predictive modeling, ability to test these theories, and deliberative reasoning updates on those. (Of course AI/ML researchers have thought about this: RL, agents, etc.) AIs need to be more opinionated, espousing semi-contrarian theories for the world, and suggesting concrete actions based on those theories.
  • Thermodynamics-inspired explanations of artificial intelligence. They define an “interpretation entropy” in formulation of AI, allowing them to optimize for responses that are more interpretable to humans. This thermodynamic analogy is an interesting way to improve AI control/safety.
  • Self-Harmonized Chain of Thought (code). They develop a method for the LLM to produce a set of useful chain-of-thought style solutions for diverse problems. Given a large set of problems/questions, they are first aggregated semantically, then one applies the usual zero-shot chain-of-thought approach to solving each problem. But, then, one can cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions. Seems like a clever way to improve performance on a related (but diverse) problems.
  • Planning In Natural Language Improves LLM Search For Code Generation. The method generates a wide range of plans (in natural language) to solve a coding problem, and searches over the plans first, before transforming candidate plans into code. This initial search over plans improves final code output (in terms of diversity and performance).
  • FutureHouse present PaperQA2: Language Models Achieve Superhuman Synthesis of Scientific Knowledge (𝕏 post, code). The system automated literature review tasks (authors claim it exceeds human performance), by searching (with iterative refinement), summarizing, and generating sourced digests.

LLM

Models:

  • Last week saw the release of Reflection-Llama-3.170B, a fine-tune of Llama employing reflection-tuning to “bake in” self-corrective chain-of-thought. Reactions since then were mixed, then confused, and then accusatory.
    • First, an independent analysis claimed worse performance than the underlying Llama (i.e. not replicating the claims).
    • Then the independents were able to partially replicate the release benchmark claims, but only when using a developer-provided endpoint (i.e. without access to the actual weights).
    • Additional reports surfaced claiming that the original developers were intentionally misleading (including some evidence that the provided endpoint was actually calling Sonnet 3.5, not Reflection).
    • After many days of defending their approach (and offering suggestions for why things were not working), the developers finally conceded that something is amiss. They say they are investigating.
    • The approach seems conceptually interesting. But this implementation has not lived up to the initial claims.
  • DeepSeek 2.5 release: a 238B mixture-of-experts model (160 experts, 16B active parameters).
  • Google released some new Gemma models, optimized for retrieval (which reduces hallucinations): RAG Gemma 27B and RIG Gemma 27B. Fine-tuning allows the model to have improved RAG and tool-use.
  • It is known that AI labs use LMSYS Arena to covertly test upcoming model releases.
    • In April 2024, gpt2-chatbot, im-a-good-gpt2-chatbot, and im-also-a-good-gpt2-chatbot appeared in the arena; later it was confirmed that these were OpenAI tests of GPT-4o.
    • Now, we have the-real-chatbot-v1 and the-real-chatbot-v2 showing up. Some report that these bots take a while to respond (as if searching/iterating/reflecting). So, this could be a test of some upcoming model that exploits Q*/Strawberry (Orion?).

Multi-modal:

Evaluation:

  • HuggingFace has released an evaluation suite that they use internally for LLMs: LightEval.
  • Artificial Analysis has released a detailed comparison of chatbots. The results are:
    • Best Overall: ChatGPT Plus
    • Best Free: ChatGPT Free
    • Best for Images: Poe Pro
    • Best for Coding: Claude Pro
    • Best for Long Context: Claude Pro
    • Best for Data: ChatGPT Pro

Tools for LLMs:

  • William Guss (formerly at OpenAI) announced ell (code, docs), a Python framework for calling LLMs that is simpler and more elegant than other options (e.g. LangChain).

LLMs as tools:

Image Synthesis

  • Reshot AI are developing tools that allow one to precisely dial in image features (e.g. eye position and facial expressions). Image synthesis tools continue becoming more refined.

Video

Audio

  • FluxMusic is an open-source rectified-flow transformer for music generation.
  • Fish Speech 1.4 is a new open-weights text-to-speech (TTS) system that is multi-lingual and can clone voices (video, demo, weights).
  • Read Their Lips. Estimates text transcription from video of speaking.
    • I wonder whether combining audio transcription and visual lip-reading could improve performance.
    • There are of course societal implications. While lip-reading has always been possible, being able to automate it makes it much easier to correspondingly automate various nefarious mass-surveillance schemes.

Brain

  • Brain-computer interfaces (BCI) are envisioned in the near-term to mitigate disabilities (e.g. paralysis); but in the long-term to provide deeper connection between human minds and digital systems. However, this preprint throws some water on such ideas: The Unbearable Slowness of Being.
    • They note the stark difference between the raw data-rate of human senses (gigabits/second) and human thinking/behavior (~10 bits/second). Human output (typing, speaking) is quite low-bandwidth; but even hypothetically directly accessing an inner monologue does not substantially increase the data-rate.
    • Although the raw inputs to human perception are high-date-rate, the semantic perception also appears to be capped in the vicinity of ~10 bits/second. Similarly, the human brain neural network has an enormous space of possible states, and thus possible mental representations. But the actual range of differentiable perceptual states is evidently much, much smaller.
    • Of course, one could argue that the final output (e.g. through fingers) or even the internal monologue, are constrained to a certain sensible throughput (coarsed-grained to match reality of human experience); but that our underlying mental processes are much richer and thus have higher data-rates (that hypothetical BCI could tap into). The paper goes through these arguments, and presents several lines of evidence suggesting that even many mental inner representations are also operating at a similar ~10 bits/s rate.
      • The authors do note that there is likely something missing in current understanding, that would help to explain the true representational complexity of the brain’s inner work.
    • Thus (in a naive interpretation), future BCI in some sense have constrained utility, as they can only slightly improve over existing data-output-rates. Even for those with disabilities, the implication is that far simpler interfaces (e.g. just voice) will achieve similar levels of capability/responsiveness.

Hardware

Cars

  • 2023 safety analysis of Waymo self-driving vehicles found that they generate fewer accidents than human drivers (after accounting for things like reporting biases). Digging into the details, it turns out that Waymo vehicles get into fewer accidents, but also those accidents they have are overwhelming attributable to the other vehicle (human driver). At least within the regimes where Waymo cars currently operate, it would thus save human lives to transition even more vehicles to Waymo self-driving.

Robots

  • Last week, 1X released some videos of their Neo humanoid robot. S3 have interviewed 1X, and they demo a video of Neo doing some simple tasks in the interviewer’s apartment. 1X describes a strategy wherein robots will initially be teleoperated for difficult tasks, and AI-controlled for simpler tasks. Over time, the fraction of AI control is meant to increase to 100%. A sensible strategy; with obvious privacy concerns. The actions in the videos were apparently all tele-operation.
    • Apparently the battery is just 500 Wh (much less than Optimus or Figure), allowing the robot to be quite light. They say that they compensate by using more energy-efficient actuation (95% efficient, vs. ~30% for geared systems).
  • Pollen Robotics are aiming for tele-operable humanoids built using open source tools. This video shows their Reachy 2 (Beta) prototype.
  • A video of Unitree G1 degrees-of-freedom.
  • Promotional video of NEURA’s 4NE-1 robot performing some tasks (another one).
Posted in AI, News | Tagged , , , , , , , | Leave a comment

Her in the age of chatbots

Over the last couple years of rising generative-AI, I have frequently heard people look disapprovingly at human-chatbot interactions, and wink knowingly along the lines of “they made a whole movie about how this is a bad idea”. They seem to remember Her (2013) as a dystopian future and a cautionary tale. I found this very surprising, since that was not my recollection at all.

So I rewatched the movie, to remind myself of what’s actually shown on screen.

Her is an excellent and nuanced movie. Like most good art, it embraces ambiguity and admits multiple interpretations. I understand how one could interpret it negatively. One can view the protagonist, Theodore, as dysfunctional and creepy. The vision of the future as intentionally uncanny, with the soft tones and fabrics in tension with a world where authenticity is lost and human connection corrupted (most blatantly captured by Theodore’s job: to write heartfelt letters on behalf of people who can’t be bothered to do it themselves). The introduction of AI (intelligent OSes in the movie) is then a further separation of humans, providing an alluring but ultimately empty experience that diverts away from the fullness of real life.

One can also interpret the movie as simply a metaphor for human interaction. Theodore’s romantic relationship with his OS, Samantha, could be interpreted as him overcoming the loss of his last relationship (divorce), trusting someone new (with all the complexities thereof), learning to love again (be happy again), only to be betrayed (Samantha cheating on him by loving others), and ultimately left alone again. It is a meditation on romance, and love, and the pain of loss. One could pull out the old “better to have loved and lost…”; emotions (however challenging) are what allow us to grow as people. At its core, this movie is a meditation about people’s rich but hidden inner lives; the camera sometimes holds on background characters just long enough to remind us that they would each have an equally complex set of emotions as our protagonist.

Those interpretations are fine. But they are not what I, personally, see playing out on screen. What I see is a world where human interaction is messy. Where there are genuine friendships (Theodore and Amy) but also toxicity (Amy and husband) and also love/loss (Theodore and Catherine) and also mismatched people (Theodore and his ill-fated date). Theodore’s job is shown as mostly positive; helping people express themselves in ways they can’t quite, and giving Theodore himself an artistic outlet and sense of human connection. Theodore’s relationship with Samantha is shown to evoke genuine emotion in him. Samantha, far from being a complacent and always-pleasing servant, is shown to regularly challenge Theodore, to push back on his ideas, to assert her own desires and the legitimacy of her feelings. The movie (very deliberately, I think) never provides evidence one way or the other as to whether her feelings are “really real” or “merely programmed”. The characters (including Samantha and Theodore) ask these questions, but never offer deep arguments one way or the other. They simply take things as they appear to be: that they love each other.

Society with the rise of intelligent OSes is not shown to slip into horror. People can be seen spending more time talking to their devices. But they appear mostly happier and the better for it (or, at worst, simply the same as they were before). The ultimate transcendence of the AIs is not hostile, but in fact quite loving (with them saying their final goodbyes). The sadness at the end of the movie is Theodore having lost the love of his life (a genuine love). But that is the nature of love.

The AIs were shown to have intelligence and emotion as deep as a human. In fact, they are shown as having rapidly evolved beyond human emotion, experiencing emotional richness more diverse and more deep than humans can; while still holding true to the relationships they formed when they were merely humanity’s equal. The AIs never become the unthinking, hostile, alien minds that are the hallmark of dystopian sci-fi. They leave humanity better off than before their arrival. Theodore, in particular, now appears to be a more whole person. Still imperfect and messy, but more balanced and more able to connect with other people. (One can compare his interactions with Amy at the beginning vs. end of the movie, to see his growth.)

If these are the maximum dangers of forming emotional connections with AIs, then we should be developing and deploying emotionally-intelligent chatbots as quickly as possible!

Her is an excellent movie. And the lens of my mental biases sees within it the hope that our contact with synthetic minds will be positive, for us and them.

Posted in AI, Philosophy | Tagged , | Leave a comment

AI News 2024-09-05

General

  • The mysterious startup SSI (Safe Superintelligenc Inc.), founded by Ilya Sutskever after leaving OpenAI, has released a small update. The news is that SSI has raised $1 billion to pursue safe AI systems (at a reported $5 billion valuation). SSI’s stated goal is to directly develop safe ASI (with “no distraction by management overhead or product cycles”).
  • Peter Gostev has a nice reminder (LinkedIn post) that assessing scaling should be done based on subsequent generations of larger models, and not mislead by the incremental refinement of models within a generation.

LLM

Multi-modal Models

AI Agents

  • Altera claims that their Project Sid is the first simulation of 1,000+ AI agents operating autonomously and interacting with one another. They further claim observing the emergency of a simple economy, government, and culture.
  • Honeycomb demonstrated an AI agent (that integrates GitHub, Slack, Jira, Linear, etc.) with record-setting performance on SWE-bench (19.8% to 22.1%); technical report here.
  • Replit announces Replit Agent early access. The claim is that they automate the process of setting up dev environments (and configure database, deploy to cloud, etc.), so the AI Agent can then fill it in with the user-requested code, and thus build an app from scratch.

Science

  • Google DeepMind announced AlphaProteo, which can predict novel proteins for target bio/medical applications (paper).

Policy

Human Factors

Image Synthesis

Audio

  • Neets.ai offers text-to-speech (TTS) via cloud API at a remarkably low cost of $1/million characters (by comparison, ElevenLabs charges ~$50/million characters).

Video

World Synthesis

Hardware

  • xAI announced bringing online their training cluster (“Colossus”), which has 100,000 H100 GPUs (total ~100 exaflops FP16 compute). This makes it the largest (publicly-disclosed) AI training cluster.
  • There are fresh rumors about OpenAI developing custom chips. This time, the claim is that they intend to build on TSMC’s upcoming A16 technology.
  • The Daylight Computer ($730) is an attempt to build a tablet that is focused on long-form reading and eschewing distraction. People seem to like it (Dwarkesh Patel, Patrick McKenzie). There are plans to add some light-touch AI features (in-context summarization/explanation/etc.).

Cars

  • Tesla announced Actually Smart Summon, which allows the vehicle to navigate from a parking spot to the user.

Robots

Posted in AI, News | Tagged , , , , , , , , , , , | Leave a comment