AI News 2024-09-19

General

  • Fei-Fei Li announced World Labs, which is: “a spatial intelligence company building Large World Models (LWMs) to perceive, generate, and interact with the 3D world”.
  • Microsoft announces “Wave 2” of their Microsoft 365 Copilot (see also this video). Not much in terms of specifics, but the announcement reiterates the point (c.f. Aidan McLaughlin’s post) that as models become more powerful and commoditized, the “wrapper”/”scaffolding” becomes the locus of value. Presumably, this means Microsoft intends to offer progressively more sophisticated/integrated tools.
  • Scale and CAIS are trying to put together an extremely challenging evaluation for LLMs; they are calling it “Humanity’s Last Exam”. They are looking for questions that would be challenging even for experts in a field, and which would be genuinely surprising if an LLM answered correctly. You can submit questions here. The purpose, of course, is to have a new eval/benchmark for testing progressively smarter LLMs. It is surprisingly hard to come up with ultra-difficult questions that have simple, easy-to-evaluate answers.
  • Data Commons is a global aggregation of verified data. Useful to underpin LLM retrievals. It is being pushed by Google (e.g. DataGemma).

Research Insights

  • IBM released a preprint: Automating Thought of Search: A Journey Towards Soundness and Completeness.
    • This is based on: Thought of Search: Planning with Language Models Through The Lens of Efficiency (Apr 2024). This paper uses LLM for planning, emphasizing completeness and soundness of searching. Their design invokes the LLM less frequently, relying on more traditional methods to implement search algorithms. But, they use the LLM to generate the code required for the search (goal test, heuristic function, etc.). This provides some balance, leveraging the flexibility and generalization of the LLM, while still using efficient code-execution search methods.
    • This new paper further automates this process. The LLM generates code for search components (e.g. unit tests), without the need of human oversight.
  • Schrodinger’s Memory: Large Language Models. Considers how LLM memory works.
    • C.f. earlier work (1, 2, 3) showing that model size (total parameter count) affects how much it can know/memorize, while model depth affects reasoning ability.
  • LLMs + Persona-Plug = Personalized LLMs. Rather than personalize LLM response with in-context data (e.g. document retrieval), this method generates a set of personalized embeddings for a particular user’s historical context. This biases the model towards a particular set of desired outputs.
    • More generally, one could imagine powerful base model, with various “tweaks” layered on top (modified embedding, LoRA, etc.) to adapt it to each person’s specific use-case.

Policy & Safety

  • Sara Hooker (head of Cohere for AI) published: On the Limitations of Compute Thresholds as a Governance Strategy. Many proposed policies/laws for AI safety rely on using compute thresholds, with the assumption that progressively more powerful models will require exponentially more compute to train. The remarkable effectiveness/scaling of inference-time-compute partially calls this into question. The ability to distill into smaller and more efficient models is also illustrative. Overall, the paper argues that the correlation between compute and risk is not strong, and relying on compute thresholds is an insufficient safety strategy.
  • Dan Hendrycks AI Safety textbook through CAIS.

LLM

  • OpenAI announced o1, which is a “system 2” type methodology. Using reinforcement learning, they’ve trained a model that does extended chain-of-thought thinking, allowing it to self-correct, revise planning, and thereby handle much more complex problems. The o1 models show improvements on puzzles, math, science, and other tasks that require planning.
    • It was initially rate-limited in the chat interface to 50 messages/week for o1-mini, and 30 messages/week for o1-preview. This was then increased to 50 messages/day (7× increase) and 50 messages/week (~1.7×).
    • It has rapidly risen to the top of the LiveBench AI leaderboard (a challenging LLM benchmark).
    • Ethan Mollick has been using an advanced preview of o1. He is impressed, noting that in a “Co-Intelligence” sense (human and AI working together), the AI can now handle a greater range of tasks.
    • The OpenAI safety analysis shows some interesting behavior. The improved reasoning behavior also translates into improved plans for circumventing rules or exploiting loopholes, and provides some real-world proof of AI instrumental convergence towards power-seeking.
    • In an AMA, the o1 developers answered some questions; summary notes here.
    • Artificial Analysis provides an assessment: “OpenAI’s o1 models push the intelligence frontier but might not make sense for most production use-cases”.

Voice

Vision

Image Synthesis

Video

World Synthesis

Hardware

  • Snapshat’s 5th-generation Spectacles are AR glasses. These are intended for developers. Specs are: standalone, 46° FOV, 37 pixels per degree (~100” screen), 2 snapdragon chips, 45 minutes of battery, auto transitioning lenses.

Robots

  • Video of LimX CL-1 doing some (pretend) warehouse labor tasks.
This entry was posted in AI, News and tagged , , , , , , , , , , . Bookmark the permalink.

Leave a Reply