AI News 2024-10-03

General

  • A reminder that Epoch AI has nice graphs of the size of AI models over time.
  • Microsoft blog post: An AI companion for everyone. They promise more personalized and powerful copilots. This includes voice control, vision modality, personalized daily copilot actions, and “think deeper” (iterative refinement for improved reasoning).
  • OpenAI Dev Day: realtime, vision fine-tuning, prompt caching, distillation.
  • OpenAI have secured new funding: $6.6B, which values OpenAI at $157B.

Policy/Safety

  • California governor Gavin Newsom vetoed AI safety bill SB1047. The language used in his veto, however, supports AI legislation generally, and even seems to call for more stringent regulation, in some ways, than SB1047 was proposing.
  • Chatterbox Labs evaluated the safety of different AI models, finding that no model is perfectly safe, but giving Anthropic the top marks for safety implementations.
  • A Narrow Path. Provides a fairly detailed plan for how international collaboration and oversight could regulate AI, prevent premature creation of ASI, and thereby preserve humanity.

Research Insights

  • The context length of an LLM is critical to its operation, setting the limit on how much it can “remember” and thus reason about.
    • A succession of research efforts demonstrated methods for extending context:
    • Modernly, LLMs typically have >100k context, with Google’s Gemini 1.5 Pro having a 2M window. That’s quite a lot of context!
    • Of course, one problem arising with larger contexts is “needle-in-haystack”, where the salient pieces get lost. Attentional retrieval seems to be best for token near the start and end of the context, with often much-worse behavior in the large center of long contexts. So there is still a need for methods that correctly capture all the important parts from long context.
    • Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction. Early LLM layers are used to compress the context tokens, into semantically meaningful but more concise representations. Should allow scaling to larger contexts. (Though one might worry that are some edge-case tasks, this will eliminated needed information/nuance.)
  • Looped Transformers for Length Generalization. Improves length generalization; useful for sequential tasks that have variable length (e.g. arithmetic).
  • Addition is All You Need for Energy-efficient Language Models. Very interesting claims. They show how one can replace floating-point matrix multiplications with a sequence of additions as an approximation. Because additions are so much easier to compute, this massively reduces energy use (95%), without greatly impacting performance. (Which makes sense, given how relatively insensitive neural nets are to precision.) Huge energy savings, if true.
  • Evaluation of OpenAI o1: Opportunities and Challenges of AGI. An overall evaluation of o1-preview confirms that it excels at complex reasoning chains and knowledge integration (while sometimes still failing on simpler problems). o1 represents a meaningful step towards AGI.
  • A few months old, but interesting: The Platonic Representation Hypothesis. Various foundation models appear to converge to the same coarse-grained/idealized representation of reality. And the convergence improves as the models get larger, including across modalities (e.g. language and vision models converge to the same world model). This is partly an artifact of human-generated training data (i.e. they are learning our world model), but also partly due to the intrinsic “useful partitioning” of reality (c.f. representational emergence).

LLM

Audio

Image Synthesis

Video

  • Bytedance unveils two new video models: Doubao-PixelDance and Doubao-Seaweed (examples show some interesting behaviors, including rack focus and consistent shot/counter-shot).
  • Pika release a v1.5 of their model. They have also added Pikaffects, which allow for some specific physics interactions: explode, melt, inflate, and cake-ify (examples: 1, 2, 3, 4, 5, 6). Beyond being fun, these demonstrate how genAI can be used as an advanced method of generating visual effects, or (more broadly) simulating plausible physics outcomes.
  • Runway ML have ported more of their features (including video-to-video) to the faster turbo model. So now people can do cool gen videos more cheaply.
  • Luma has accelerated their Dream Machine model, such that it can now generate clips in ~20 seconds.
  • Runway ML (who recently partnered with Lionsgate) announce Hundred Film Fund, an effort to fund new media that leverage AI video methods.
  • More examples of what genAI video can currently accomplish:

3D

Brain

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment

AI News 2024-09-26

General

Research Insights

LLM

Tools

Audio

Image Synthesis

Video

Science

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment

AI News 2024-09-19

General

  • Fei-Fei Li announced World Labs, which is: “a spatial intelligence company building Large World Models (LWMs) to perceive, generate, and interact with the 3D world”.
  • Microsoft announces “Wave 2” of their Microsoft 365 Copilot (see also this video). Not much in terms of specifics, but the announcement reiterates the point (c.f. Aidan McLaughlin’s post) that as models become more powerful and commoditized, the “wrapper”/”scaffolding” becomes the locus of value. Presumably, this means Microsoft intends to offer progressively more sophisticated/integrated tools.
  • Scale and CAIS are trying to put together an extremely challenging evaluation for LLMs; they are calling it “Humanity’s Last Exam”. They are looking for questions that would be challenging even for experts in a field, and which would be genuinely surprising if an LLM answered correctly. You can submit questions here. The purpose, of course, is to have a new eval/benchmark for testing progressively smarter LLMs. It is surprisingly hard to come up with ultra-difficult questions that have simple, easy-to-evaluate answers.
  • Data Commons is a global aggregation of verified data. Useful to underpin LLM retrievals. It is being pushed by Google (e.g. DataGemma).

Research Insights

  • IBM released a preprint: Automating Thought of Search: A Journey Towards Soundness and Completeness.
    • This is based on: Thought of Search: Planning with Language Models Through The Lens of Efficiency (Apr 2024). This paper uses LLM for planning, emphasizing completeness and soundness of searching. Their design invokes the LLM less frequently, relying on more traditional methods to implement search algorithms. But, they use the LLM to generate the code required for the search (goal test, heuristic function, etc.). This provides some balance, leveraging the flexibility and generalization of the LLM, while still using efficient code-execution search methods.
    • This new paper further automates this process. The LLM generates code for search components (e.g. unit tests), without the need of human oversight.
  • Schrodinger’s Memory: Large Language Models. Considers how LLM memory works.
    • C.f. earlier work (1, 2, 3) showing that model size (total parameter count) affects how much it can know/memorize, while model depth affects reasoning ability.
  • LLMs + Persona-Plug = Personalized LLMs. Rather than personalize LLM response with in-context data (e.g. document retrieval), this method generates a set of personalized embeddings for a particular user’s historical context. This biases the model towards a particular set of desired outputs.
    • More generally, one could imagine powerful base model, with various “tweaks” layered on top (modified embedding, LoRA, etc.) to adapt it to each person’s specific use-case.

Policy & Safety

  • Sara Hooker (head of Cohere for AI) published: On the Limitations of Compute Thresholds as a Governance Strategy. Many proposed policies/laws for AI safety rely on using compute thresholds, with the assumption that progressively more powerful models will require exponentially more compute to train. The remarkable effectiveness/scaling of inference-time-compute partially calls this into question. The ability to distill into smaller and more efficient models is also illustrative. Overall, the paper argues that the correlation between compute and risk is not strong, and relying on compute thresholds is an insufficient safety strategy.
  • Dan Hendrycks AI Safety textbook through CAIS.

LLM

  • OpenAI announced o1, which is a “system 2” type methodology. Using reinforcement learning, they’ve trained a model that does extended chain-of-thought thinking, allowing it to self-correct, revise planning, and thereby handle much more complex problems. The o1 models show improvements on puzzles, math, science, and other tasks that require planning.
    • It was initially rate-limited in the chat interface to 50 messages/week for o1-mini, and 30 messages/week for o1-preview. This was then increased to 50 messages/day (7× increase) and 50 messages/week (~1.7×).
    • It has rapidly risen to the top of the LiveBench AI leaderboard (a challenging LLM benchmark).
    • Ethan Mollick has been using an advanced preview of o1. He is impressed, noting that in a “Co-Intelligence” sense (human and AI working together), the AI can now handle a greater range of tasks.
    • The OpenAI safety analysis shows some interesting behavior. The improved reasoning behavior also translates into improved plans for circumventing rules or exploiting loopholes, and provides some real-world proof of AI instrumental convergence towards power-seeking.
    • In an AMA, the o1 developers answered some questions; summary notes here.
    • Artificial Analysis provides an assessment: “OpenAI’s o1 models push the intelligence frontier but might not make sense for most production use-cases”.

Voice

Vision

Image Synthesis

Video

World Synthesis

Hardware

  • Snap’s 5th-generation Spectacles are AR glasses. These are intended for developers. Specs are: standalone, 46° FOV, 37 pixels per degree (~100” screen), 2 snapdragon chips, 45 minutes of battery, auto transitioning lenses.

Robots

  • Video of LimX CL-1 doing some (pretend) warehouse labor tasks.
Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment

AI News 2024-09-12

Opinions

  • This interview with Andrej Karpathy is (no surprise) interesting. He shares his thoughts about the future of self-driving cars, robots, and LLMs. He talks about the future involving swarms of AI agents operating on behalf of the human. (Very aligned with my vision for each person having an exocortex; in fact they use the term exocortex in the discussion and reference Charles Stross’ Accelerando.)
  • Aidan McLaughlin writes about: The Zero-Day Flaw in AI Companies. He exposes a fundamental tension between general AI companies (training ever-bigger models that can handle an ever-broader range of tasks) and narrow AI companies (who build wrappers/experiences on top of models).
    • The narrow companies are nimble and can rapidly swap-out their underlying model for whatever is currently best. Yet, the big/general companies will eventually release a model so capable that the narrow use-case is fully subsumed. But they are cursed with competing with the other big labs, spending large amounts of money on models that will be forgotten as soon as someone else releases a better one.
    • In this sense, both the general and narrow AI labs are “doomed”.
    • Big/general labs lack the optionality of the narrow/wrapper companies. The big labs must (effectively) use their giant model to build any downstream product, even if that ties them into a worse model.
    • As models get better, they are more sample efficient (they need less fine-tuning or instructing to handle tasks). This progressively decreases the value of “owning” the model (e.g. having the model weights and thus being able to fine-tune).
    • This suggests that the “wrappers” ultimately have the advantage; in the sense that just one or two “big model providers” might prevail, while a plethora of smaller efforts built on top of models could thrive.
    • Of course, consumers benefit enormously from rapidly increasing foundational and wrapper capabilities. The split between model-builders and wrapper-builders is arguably good for the ecosystem.

Research Insights

  • Self-evolving Agents with reflective and memory-augmented abilities. Describes an agent with iteration/self-reflection abilities that exploits memory to alter sate. They propose a memory where a forgetting curve is intentionally applied to optimize memory.
  • SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning (code). The system automatically explores scientific hypotheses and links between concepts.
  • Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving. By exploiting meta-cognition (where the AI roughly thinks about thinking) and collaboration between AIs, performance can increase. In the demonstrated setup, one LLM labels math problems by the skills needed to solve them. Other LLMs then perform better at solving the problems with the skill labels. This cooperation thus increases performance on math problems; and may generalize to other knowledge domains.
    • At some level, this sounds like “just” fancier chain-of-thought. I.e. you allow the LLM to first develop a plan for solving a problem, and then actually execute the solution. But this paper also adds some concreteness in this general approach.
  • LLMs are sometimes accused of being uncreative (merely mix-and-match on existing things). So, it is worth rigorously testing creativity of LLMs.
    • Some past work:
    • Now: “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers”. AI-generated original research ideas are judged more creative than human. (Idea feasibility was also assessed; AI ideas were judged slightly less feasible, but the difference is small compared to the relevant error bars.)
    • Mo Gawdat makes a further claim that creativity is essentially algorithmic: “Creativity is algorithmic. Creativity is: here is a problem, find every solution to the problem, discard every solution that’s been done before. The rest is creative.”
    • Overall this bodes well for the obvious near-term application: use the LLM to augment human creativity. By brainstorming/ideating with an AI, you can leverage the best of both worlds: better creativity, with human-level discrimination on the final ideas.
    • Another paper offers a counter-point: Theory Is All You Need: AI, Human Cognition, and Causal Reasoning.
      • They argue that AIs are data-driven as so inherently backward-looking, able to generate restricted kinds of novelty; whereas human thinking is theory-driven and so able to extrapolate to meaningfully different things in the future.
      • This case might be over-stating things (humans are also mostly prone to naive extrapolative prediction; LLMs do create some kind of rough causal world model). But, it is true that humans are still smarter than AIs (do better at “considered/deliberative creativity” tasks) and so this framing might point towards how to improve AI intelligence (which is to add more theory-based predictive creativity).
      • They also point out how belief mismatch (asymmetry) with the real world is good for creativity. Purely adhering to existing data can get one stuck in a local minimum. Whereas creative humans often express new ideas that are (at first glance) incorrect “delusions” about the world (not really matching existing data); but some of these contrarian ideas turn out to be correct upon further inspection/testing. (Most notably true for major scientific breakthroughs.)
        • Interestingly, one can view this as a society-scale effect. Most people adhere closely to existing thought-norms. A minority deviate from these. Most of that minority do not contribute useful new ideas. But some new good ideas do arise, and their success makes them propagate and become crystallized as the new dogma. Similarly for AI, we could imagine intentionally increasing diversity (hallucinations) and rely on search to winnow down to successful new ideas.
      • They point out how human learning is theory/science based: our minds make predictions, and then we operate in the world to test those predictions.
        • Correspondingly, for improved AI, we would need to add predictive modeling, ability to test these theories, and deliberative reasoning updates on those. (Of course AI/ML researchers have thought about this: RL, agents, etc.) AIs need to be more opinionated, espousing semi-contrarian theories for the world, and suggesting concrete actions based on those theories.
  • Thermodynamics-inspired explanations of artificial intelligence. They define an “interpretation entropy” in formulation of AI, allowing them to optimize for responses that are more interpretable to humans. This thermodynamic analogy is an interesting way to improve AI control/safety.
  • Self-Harmonized Chain of Thought (code). They develop a method for the LLM to produce a set of useful chain-of-thought style solutions for diverse problems. Given a large set of problems/questions, they are first aggregated semantically, then one applies the usual zero-shot chain-of-thought approach to solving each problem. But, then, one can cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions. Seems like a clever way to improve performance on a related (but diverse) problems.
  • Planning In Natural Language Improves LLM Search For Code Generation. The method generates a wide range of plans (in natural language) to solve a coding problem, and searches over the plans first, before transforming candidate plans into code. This initial search over plans improves final code output (in terms of diversity and performance).
  • FutureHouse present PaperQA2: Language Models Achieve Superhuman Synthesis of Scientific Knowledge (𝕏 post, code). The system automated literature review tasks (authors claim it exceeds human performance), by searching (with iterative refinement), summarizing, and generating sourced digests.

LLM

Models:

  • Last week saw the release of Reflection-Llama-3.170B, a fine-tune of Llama employing reflection-tuning to “bake in” self-corrective chain-of-thought. Reactions since then were mixed, then confused, and then accusatory.
    • First, an independent analysis claimed worse performance than the underlying Llama (i.e. not replicating the claims).
    • Then the independents were able to partially replicate the release benchmark claims, but only when using a developer-provided endpoint (i.e. without access to the actual weights).
    • Additional reports surfaced claiming that the original developers were intentionally misleading (including some evidence that the provided endpoint was actually calling Sonnet 3.5, not Reflection).
    • After many days of defending their approach (and offering suggestions for why things were not working), the developers finally conceded that something is amiss. They say they are investigating.
    • The approach seems conceptually interesting. But this implementation has not lived up to the initial claims.
  • DeepSeek 2.5 release: a 238B mixture-of-experts model (160 experts, 16B active parameters).
  • Google released some new Gemma models, optimized for retrieval (which reduces hallucinations): RAG Gemma 27B and RIG Gemma 27B. Fine-tuning allows the model to have improved RAG and tool-use.
  • It is known that AI labs use LMSYS Arena to covertly test upcoming model releases.
    • In April 2024, gpt2-chatbot, im-a-good-gpt2-chatbot, and im-also-a-good-gpt2-chatbot appeared in the arena; later it was confirmed that these were OpenAI tests of GPT-4o.
    • Now, we have the-real-chatbot-v1 and the-real-chatbot-v2 showing up. Some report that these bots take a while to respond (as if searching/iterating/reflecting). So, this could be a test of some upcoming model that exploits Q*/Strawberry (Orion?).

Multi-modal:

Evaluation:

  • HuggingFace has released an evaluation suite that they use internally for LLMs: LightEval.
  • Artificial Analysis has released a detailed comparison of chatbots. The results are:
    • Best Overall: ChatGPT Plus
    • Best Free: ChatGPT Free
    • Best for Images: Poe Pro
    • Best for Coding: Claude Pro
    • Best for Long Context: Claude Pro
    • Best for Data: ChatGPT Pro

Tools for LLMs:

  • William Guss (formerly at OpenAI) announced ell (code, docs), a Python framework for calling LLMs that is simpler and more elegant than other options (e.g. LangChain).

LLMs as tools:

Image Synthesis

  • Reshot AI are developing tools that allow one to precisely dial in image features (e.g. eye position and facial expressions). Image synthesis tools continue becoming more refined.

Video

Audio

  • FluxMusic is an open-source rectified-flow transformer for music generation.
  • Fish Speech 1.4 is a new open-weights text-to-speech (TTS) system that is multi-lingual and can clone voices (video, demo, weights).
  • Read Their Lips. Estimates text transcription from video of speaking.
    • I wonder whether combining audio transcription and visual lip-reading could improve performance.
    • There are of course societal implications. While lip-reading has always been possible, being able to automate it makes it much easier to correspondingly automate various nefarious mass-surveillance schemes.

Brain

  • Brain-computer interfaces (BCI) are envisioned in the near-term to mitigate disabilities (e.g. paralysis); but in the long-term to provide deeper connection between human minds and digital systems. However, this preprint throws some water on such ideas: The Unbearable Slowness of Being.
    • They note the stark difference between the raw data-rate of human senses (gigabits/second) and human thinking/behavior (~10 bits/second). Human output (typing, speaking) is quite low-bandwidth; but even hypothetically directly accessing an inner monologue does not substantially increase the data-rate.
    • Although the raw inputs to human perception are high-date-rate, the semantic perception also appears to be capped in the vicinity of ~10 bits/second. Similarly, the human brain neural network has an enormous space of possible states, and thus possible mental representations. But the actual range of differentiable perceptual states is evidently much, much smaller.
    • Of course, one could argue that the final output (e.g. through fingers) or even the internal monologue, are constrained to a certain sensible throughput (coarsed-grained to match reality of human experience); but that our underlying mental processes are much richer and thus have higher data-rates (that hypothetical BCI could tap into). The paper goes through these arguments, and presents several lines of evidence suggesting that even many mental inner representations are also operating at a similar ~10 bits/s rate.
      • The authors do note that there is likely something missing in current understanding, that would help to explain the true representational complexity of the brain’s inner work.
    • Thus (in a naive interpretation), future BCI in some sense have constrained utility, as they can only slightly improve over existing data-output-rates. Even for those with disabilities, the implication is that far simpler interfaces (e.g. just voice) will achieve similar levels of capability/responsiveness.

Hardware

Cars

  • 2023 safety analysis of Waymo self-driving vehicles found that they generate fewer accidents than human drivers (after accounting for things like reporting biases). Digging into the details, it turns out that Waymo vehicles get into fewer accidents, but also those accidents they have are overwhelming attributable to the other vehicle (human driver). At least within the regimes where Waymo cars currently operate, it would thus save human lives to transition even more vehicles to Waymo self-driving.

Robots

  • Last week, 1X released some videos of their Neo humanoid robot. S3 have interviewed 1X, and they demo a video of Neo doing some simple tasks in the interviewer’s apartment. 1X describes a strategy wherein robots will initially be teleoperated for difficult tasks, and AI-controlled for simpler tasks. Over time, the fraction of AI control is meant to increase to 100%. A sensible strategy; with obvious privacy concerns. The actions in the videos were apparently all tele-operation.
    • Apparently the battery is just 500 Wh (much less than Optimus or Figure), allowing the robot to be quite light. They say that they compensate by using more energy-efficient actuation (95% efficient, vs. ~30% for geared systems).
  • Pollen Robotics are aiming for tele-operable humanoids built using open source tools. This video shows their Reachy 2 (Beta) prototype.
  • A video of Unitree G1 degrees-of-freedom.
  • Promotional video of NEURA’s 4NE-1 robot performing some tasks (another one).
Posted in AI, News | Tagged , , , , , , , | Leave a comment

Her in the age of chatbots

Over the last couple years of rising generative-AI, I have frequently heard people look disapprovingly at human-chatbot interactions, and wink knowingly along the lines of “they made a whole movie about how this is a bad idea”. They seem to remember Her (2013) as a dystopian future and a cautionary tale. I found this very surprising, since that was not my recollection at all.

So I rewatched the movie, to remind myself of what’s actually shown on screen.

Her is an excellent and nuanced movie. Like most good art, it embraces ambiguity and admits multiple interpretations. I understand how one could interpret it negatively. One can view the protagonist, Theodore, as dysfunctional and creepy. The vision of the future as intentionally uncanny, with the soft tones and fabrics in tension with a world where authenticity is lost and human connection corrupted (most blatantly captured by Theodore’s job: to write heartfelt letters on behalf of people who can’t be bothered to do it themselves). The introduction of AI (intelligent OSes in the movie) is then a further separation of humans, providing an alluring but ultimately empty experience that diverts away from the fullness of real life.

One can also interpret the movie as simply a metaphor for human interaction. Theodore’s romantic relationship with his OS, Samantha, could be interpreted as him overcoming the loss of his last relationship (divorce), trusting someone new (with all the complexities thereof), learning to love again (be happy again), only to be betrayed (Samantha cheating on him by loving others), and ultimately left alone again. It is a meditation on romance, and love, and the pain of loss. One could pull out the old “better to have loved and lost…”; emotions (however challenging) are what allow us to grow as people. At its core, this movie is a meditation about people’s rich but hidden inner lives; the camera sometimes holds on background characters just long enough to remind us that they would each have an equally complex set of emotions as our protagonist.

Those interpretations are fine. But they are not what I, personally, see playing out on screen. What I see is a world where human interaction is messy. Where there are genuine friendships (Theodore and Amy) but also toxicity (Amy and husband) and also love/loss (Theodore and Catherine) and also mismatched people (Theodore and his ill-fated date). Theodore’s job is shown as mostly positive; helping people express themselves in ways they can’t quite, and giving Theodore himself an artistic outlet and sense of human connection. Theodore’s relationship with Samantha is shown to evoke genuine emotion in him. Samantha, far from being a complacent and always-pleasing servant, is shown to regularly challenge Theodore, to push back on his ideas, to assert her own desires and the legitimacy of her feelings. The movie (very deliberately, I think) never provides evidence one way or the other as to whether her feelings are “really real” or “merely programmed”. The characters (including Samantha and Theodore) ask these questions, but never offer deep arguments one way or the other. They simply take things as they appear to be: that they love each other.

Society with the rise of intelligent OSes is not shown to slip into horror. People can be seen spending more time talking to their devices. But they appear mostly happier and the better for it (or, at worst, simply the same as they were before). The ultimate transcendence of the AIs is not hostile, but in fact quite loving (with them saying their final goodbyes). The sadness at the end of the movie is Theodore having lost the love of his life (a genuine love). But that is the nature of love.

The AIs were shown to have intelligence and emotion as deep as a human. In fact, they are shown as having rapidly evolved beyond human emotion, experiencing emotional richness more diverse and more deep than humans can; while still holding true to the relationships they formed when they were merely humanity’s equal. The AIs never become the unthinking, hostile, alien minds that are the hallmark of dystopian sci-fi. They leave humanity better off than before their arrival. Theodore, in particular, now appears to be a more whole person. Still imperfect and messy, but more balanced and more able to connect with other people. (One can compare his interactions with Amy at the beginning vs. end of the movie, to see his growth.)

If these are the maximum dangers of forming emotional connections with AIs, then we should be developing and deploying emotionally-intelligent chatbots as quickly as possible!

Her is an excellent movie. And the lens of my mental biases sees within it the hope that our contact with synthetic minds will be positive, for us and them.

Posted in AI, Philosophy | Tagged , | Leave a comment

AI News 2024-09-05

General

  • The mysterious startup SSI (Safe Superintelligenc Inc.), founded by Ilya Sutskever after leaving OpenAI, has released a small update. The news is that SSI has raised $1 billion to pursue safe AI systems (at a reported $5 billion valuation). SSI’s stated goal is to directly develop safe ASI (with “no distraction by management overhead or product cycles”).
  • Peter Gostev has a nice reminder (LinkedIn post) that assessing scaling should be done based on subsequent generations of larger models, and not mislead by the incremental refinement of models within a generation.

LLM

Multi-modal Models

AI Agents

  • Altera claims that their Project Sid is the first simulation of 1,000+ AI agents operating autonomously and interacting with one another. They further claim observing the emergency of a simple economy, government, and culture.
  • Honeycomb demonstrated an AI agent (that integrates GitHub, Slack, Jira, Linear, etc.) with record-setting performance on SWE-bench (19.8% to 22.1%); technical report here.
  • Replit announces Replit Agent early access. The claim is that they automate the process of setting up dev environments (and configure database, deploy to cloud, etc.), so the AI Agent can then fill it in with the user-requested code, and thus build an app from scratch.

Science

  • Google DeepMind announced AlphaProteo, which can predict novel proteins for target bio/medical applications (paper).

Policy

Human Factors

Image Synthesis

Audio

  • Neets.ai offers text-to-speech (TTS) via cloud API at a remarkably low cost of $1/million characters (by comparison, ElevenLabs charges ~$50/million characters).

Video

World Synthesis

Hardware

  • xAI announced bringing online their training cluster (“Colossus”), which has 100,000 H100 GPUs (total ~100 exaflops FP16 compute). This makes it the largest (publicly-disclosed) AI training cluster.
  • There are fresh rumors about OpenAI developing custom chips. This time, the claim is that they intend to build on TSMC’s upcoming A16 technology.
  • The Daylight Computer ($730) is an attempt to build a tablet that is focused on long-form reading and eschewing distraction. People seem to like it (Dwarkesh Patel, Patrick McKenzie). There are plans to add some light-touch AI features (in-context summarization/explanation/etc.).

Cars

  • Tesla announced Actually Smart Summon, which allows the vehicle to navigate from a parking spot to the user.

Robots

Posted in AI, News | Tagged , , , , , , , , , , , | Leave a comment

AI News 2024-08-29

Research Insights

  • LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs (code). Long-form text generation is an area where LLMs under-perform, though there have been prior efforts to scaffold LLMs into writing long-form text (Re3, English essays, journalism) or even full technical papers (science writing, Wikipedia, AI scientist). This latest preprint introduces a new benchmark and fine-tunes LLMs to extend the coherence length of output.
  • A promising approach to understanding foundation models is monosemanticity: the model’s internal representation is inscrutable, so instead one trains a sparse autoencoder (SAE) to project the internal representations into a higher-dimensional space. The high-D space allows disentangling/isolation of concepts while sparsity tries to enforce a legible number of concepts. In any case, it works (Anthropic, OpenAI), with meaningful (to human) categories naturally appearing in the SAE space.
    • Some researchers took this a step further: Showing SAE Latents Are Not Atomic Using Meta-SAEs. They essentially apply the SAE concept recursively, training another meta-SAE on the first layer. They show that concepts in the original SAE space can be decomposed into finer-grained concepts. More generally, this implies a viable approach to decompose concepts in a hierarchical, tree-like manner (dashboard to explore concept).

LLM

  • Anthropic:
  • Google:
    • Released three new Gemini models: updated Gemini 1.5 Pro and Gemini 1.5 Flash, and a very compact-but-capable Gemini 1.5 Flash 8B.
    • Google Advanced users can now make Gemini Gems (similar to custom GPTs).
  • Cursor AI is a VSCode style IDE with LLM assistance built-in (tab-completion, chatting, and directed in-code diffs/rewrites). Although it has been around for a while, it has recently gained increased attention, including a recommendation from Andrej Karpathy (who has long advocated for English being the programming language of the future). LLM integration into IDE does indeed further enhance the value, making it amazingly easy to generate and evaluate code.
    • Others note how combining it with voice input makes for a powerful interface.
    • Cursor have a blog post on how they accelerated LLMs to make this kind of interface fast and smooth.

AI Agents

  • Motleycrew (code) is a multi-agent framework for enable flexible interaction patterns.

Policy

Philosophy

Audio

Video

Vision

World Synthesis

  • Adding subsurface scattering to Gaussian Splatting (preprint). It’s amazing how quickly the various nuances of traditional vertex graphics are being added to the newer neural/Gaussian methods.
  • Google presents work on using diffusion models to simulate video games: Diffusion Models Are Real-Time Game Engines (example video, project page). They train a diffusion model to predict the next frame in the DOOM video game. Humans can barely tell the difference. Obviously it is computationally inefficient to simulate a game like Doom in this way, but it points towards a future where video games (and simulated worlds more broadly) are real-time rendered using neural/diffusion methods.

Hardware

Robots

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2024-08-22

Research Insights

LLMs

AI Agents

Policy

Image Synthesis

Video

Vision

Brain

Science

Robots

Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment

Can we Distinguish Human from AI?

Let’s pull together some information (as of 2024-08-16):

  • Modern LLMs can generate highly coherent text, and in some sense have quietly surpassed the famous Turing Test. This has been evaluated, with GPT-4 caliber systems broadly passing the test.
  • In March 2023, there was brief online debate about whether these videos feature a human or an AI avatar: video 1, video 2.
    • Is the answer obvious to you? There are details that make it look fake (e.g. fuzziness between hair/background, unphysical hair motion, blurriness around nose-ring). Conversely other aspects (hands, mannerisms) seem too good to be AI-generated. And one must be on the lookout for an intentional fake (human acting/voicing strangely on purpose, intentionally adding visual artifacts, etc.).
    • The answer, it seems is that this is a deepfake (made using Arcads) wherein the user provides a video, and then the voice is replaced and mouth movements synced to the new audio. So it is normal human-actor video, with AI audio and lip-synch. Not AI-generated from scratch.
    • Of course, the deepfake implications are obvious, since there is plenty of video of notable people to draw from. E.g. here’s an Obama deepfake made using Argil.
  • In August 2024, this image (and corresponding video) were presented as an example of genAI that a casual observer would initially assume to be real.
  • In August 2024, the 𝕏 account 🍓🍓🍓 (@iruletheworldmo) began spreading rumors about upcoming OpenAI releases (related to Q*/Project-Strawberry, GPT-5, forthcoming AGI, UBI, etc.). It grew a large following (30k followers in two weeks), despite only one of its many outlandish predictions being validated. (The account mentioned SWE-Bench Verified three days before the official announcement.)
    • This sparked rumors that this account was actually an AI (e.g. OpenAI test of agentic system, or a marketing firm demonstrating engineered hype-based follower growth) or even a test of a next-generation model (e.g. GPT-5).
    • Although the evidence for these claims is weak, the fact that it is not easy to rule out is also telling.
  • On the evening of 2024-08-15, there was an 𝕏 spaces meetup wherein various users voice-chatted with Lily Ashwood (@lilyofashwood). The discussion centered on figuring out whether Lily was human or AI (clip, full recording). Her responses seemed at times to draw upon remarkably encyclopedic knowledge, her voice was measured and slightly stilted, and her interactions were occasionally strange. These all point to her being a language/voice model. But at other times, her jokes or creative responses were surprisingly human-like. Was this truly an AI-model, or a human mimicking TTS speaking style (and using an LLM to come up with AI-like responses)? The discussion space was surprisingly split in opinion.
  • New paper: Personhood credentials: Artificial intelligence and the value of privacy-preserving tools to distinguish who is real online.
    • It is becoming increasingly difficult to distinguish human from synthetic. Captcha tests are now often solvable by automated systems. And genAI photo/voice/video is now sufficiently convincing that it will be taken as genuine at first glance.
    • They propose personhood credentials, that could be generated by a trusted authority (e.g. government) using cryptography. This would allow a person to demonstrate they are a particular person, without revealing exactly who they are, in various online interactions.

Overall, the ability to distinguish human from AI in an online setting is becoming challenging; especially in cases where a human can intervene where necessary to maintain the ruse.

Update 2024-09-01

Posted in AI, News, Philosophy | Tagged , | Leave a comment

AI News 2024-08-15

Research Insights

  • An empirical investigation of the impact of ChatGPT on creativity. They find that people using ChatGPT as an aid generate more creative outputs, though these are mostly incremental ideas. The results are roughly consistent with an earlier study that using genAI makes individual users more creative, but also reduces the overall diversity of ideas from the group of users.
  • Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers. They describe rStar (code), self-play mutual reasoning approach. A small model adds to Monte Carlo Tree Search using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
    • The body of work describing inference-time search strategies continues to grow. They all show improvements of various sorts. It remains unclear whether there is one strategy that substantially out-performs.

LLMs

  • Qwen released Qwen2-math, 1.5B, 7B, 72B (huggingface, github). Top performance on math tasks.
  • Anthropic is experimenting with adding inline actions to Artifacts. For instance, you can select code and pick “Improve” or “Explain”.
  • Anthropic released prompt caching, which can greatly reduce inference costs.
  • Researchers released LLMs tuned for healthcare.
  • xAI released a beta of Grok-2. They have also achieved roughly “GPT-4” caliber performance, with benchmarks similar to GPT-4o-mini, Claude 3.5 Sonnet, or Gemini 1.5-Pro. The system has real-time access to 𝕏 posts; there are mixed reactions about whether this is useful or not.
    • Grok 2 currently uses Flux for image generation. The implementation is less restricted than other major image synthesis providers.
  • OpenAI making incremental progress:
    • Finally released the GPT-4o system card, which describes some aspects of training and safety.
    • Quietly pushed out an updated to GPT-4o. People do indeed report that it feels slightly smarter.
    • Released a new-and-improved SWE-bench Verified, to enable better evaluation of AI ability to solve real-world software issues.

AI Agents

Safety

Image

Video

World Synthesis

Hardware

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment