AI News 2024-08-15

Research Insights

An empirical investigation of the impact of ChatGPT on creativity. They find that people using ChatGPT as an aid generate more creative outputs, though these are mostly incremental ideas. The results are roughly consistent with an earlier study that using genAI makes individual users more creative, but also reduces the overall diversity of ideas from the group of users.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers. They describe rStar (code), self-play mutual reasoning approach. A small model adds to Monte Carlo Tree Search using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
- The body of work describing inference-time search strategies continues to grow. They all show improvements of various sorts. It remains unclear whether there is one strategy that substantially out-performs.

LLMs

Qwen released Qwen2-math, 1.5B, 7B, 72B (huggingface, github). Top performance on math tasks.
Anthropic is experimenting with adding inline actions to Artifacts. For instance, you can select code and pick “Improve” or “Explain”.
Anthropic released prompt caching, which can greatly reduce inference costs.
Researchers released LLMs tuned for healthcare.
xAI released a beta of Grok-2. They have also achieved roughly “GPT-4” caliber performance, with benchmarks similar to GPT-4o-mini, Claude 3.5 Sonnet, or Gemini 1.5-Pro. The system has real-time access to 𝕏 posts; there are mixed reactions about whether this is useful or not.
- Grok 2 currently uses Flux for image generation. The implementation is less restricted than other major image synthesis providers.
OpenAI making incremental progress:
- Finally released the GPT-4o system card, which describes some aspects of training and safety.
- Quietly pushed out an updated to GPT-4o. People do indeed report that it feels slightly smarter.
- Released a new-and-improved SWE-bench Verified, to enable better evaluation of AI ability to solve real-world software issues.

AI Agents

Cosine AI put out a report on their Genie system for software generation. They claim record-setting performance on SWE-bench.
Salesforce describe a software engineering (SWE) approach that is a meta-system that manages existing SWE-agents or frameworks (preprint). It can extract better overall performance by combining a diversity of different AI agents.
Sakana AI report the development of an AI scientist, and released code and a preprint: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (explainer video). The description is quite ambitious. They describe a succession of LLMs that conduct all parts of a research workflow, including generating the final publication (example).
MultiOn AI describe Agent Q (paper), AI agents with planning and self-correcting capabilities.
Stanford describe: STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. An open-source agent that can write articles. Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models.

Safety

Better Alignment with Instruction Back-and-Forth Translation. They create synthetic training data from existing (web) data, by generating viable prompts and responses in a consistent manner.
The US is currently broadly supportive of open-source AI efforts (NTIA report).

Image

It’s no surprise that the recently-released open-source FLUX.1 image model (c.f.) is being hosted in a wide variety of places: Deforum discord, Fal.ai, Replicate, EverArt AI, HuggingFace, Abacus ChatLLM, model download. Many of these are free (for now).
Anifusion is a tool for creating comic/manga pages.
Google released a paper on Imagen 3. Generations are quite good, but not better than the current Flux or Midjourney capabilities.

Video

World Synthesis

FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework. Uses NeRF methods to reconstruct images of plants and identify/count the fruit (video). This seems quite useful by itself, but also points more broadly to the power of 3D reconstruction improving a host of visual real-world tasks. (Here’s a video showing a similar result from a different group.)
Nvidia has presented some demos of how raytracing can be combined with Gaussian splats (preprint): shadows, depth of field, refraction, etc.
High dynamic range (HDR) Gaussian splatting has also been demonstrated.
Nvidia demoed real-time world-building (text-to-object, etc.).

Hardware

Based Hardware is trying to make open-source AI wearables including glasses (OpenGlass) and the Friend AI pendant (not to be confused with the Friend AI pendant, c.f.).
Google event announced the Pixel 9 Pro smartphones.
- The phones are incrementally improved. Includes tensor G4 chip, to enable more on-device AI features.
- Gemini will become even more deeply-integrated into Android.
- Gemini Live will allow multi-modal conversations (back-and-forth conversations, AI can use camera for added context, etc.).
- The new Pixel Buds are designed to be an interface to Gemini Live.

Robots

Clone Robotics released a video showing teleoperation of their sophisticated hand. The mechanics of the Clone hand are remarkable, but the fidelity in the teleoperation appears quite low.
Google DeepMind demos a robot that can play table tennis at a solidly amateur level. The robot is engaging in a physical activity at human level performance (without simply resorting to super-human hardware solutions); i.e. the control system is capable for simple (hit ball) and complex (plan where to hit in order to win) actions.
New videos show off improvements in LimX CL-1 humanoid: doing simple tasks, shuffling sideways, walking up stairs (more confidently than before).
Presentation on Boston Dynamics Atlas (including new electric version). Seems agile; e.g. doing pushups.
Apptronik claims that with minimal training (10 hours), their Apollo robot could autonomously handle soft/deformable objects. They are also projecting that some of the initial demand for humanoid robots will be at-home assistants for the elderly.