Research Insights
- An empirical investigation of the impact of ChatGPT on creativity. They find that people using ChatGPT as an aid generate more creative outputs, though these are mostly incremental ideas. The results are roughly consistent with an earlier study that using genAI makes individual users more creative, but also reduces the overall diversity of ideas from the group of users.
- Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers. They describe rStar (code), self-play mutual reasoning approach. A small model adds to Monte Carlo Tree Search using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
- The body of work describing inference-time search strategies continues to grow. They all show improvements of various sorts. It remains unclear whether there is one strategy that substantially out-performs.
LLMs
- Qwen released Qwen2-math, 1.5B, 7B, 72B (huggingface, github). Top performance on math tasks.
- Anthropic is experimenting with adding inline actions to Artifacts. For instance, you can select code and pick “Improve” or “Explain”.
- Anthropic released prompt caching, which can greatly reduce inference costs.
- Researchers released LLMs tuned for healthcare.
- xAI released a beta of Grok-2. They have also achieved roughly “GPT-4” caliber performance, with benchmarks similar to GPT-4o-mini, Claude 3.5 Sonnet, or Gemini 1.5-Pro. The system has real-time access to 𝕏 posts; there are mixed reactions about whether this is useful or not.
- Grok 2 currently uses Flux for image generation. The implementation is less restricted than other major image synthesis providers.
- OpenAI making incremental progress:
- Finally released the GPT-4o system card, which describes some aspects of training and safety.
- Quietly pushed out an updated to GPT-4o. People do indeed report that it feels slightly smarter.
- Released a new-and-improved SWE-bench Verified, to enable better evaluation of AI ability to solve real-world software issues.
AI Agents
- Cosine AI put out a report on their Genie system for software generation. They claim record-setting performance on SWE-bench.
- Salesforce describe a software engineering (SWE) approach that is a meta-system that manages existing SWE-agents or frameworks (preprint). It can extract better overall performance by combining a diversity of different AI agents.
- Sakana AI report the development of an AI scientist, and released code and a preprint: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (explainer video). The description is quite ambitious. They describe a succession of LLMs that conduct all parts of a research workflow, including generating the final publication (example).
- MultiOn AI describe Agent Q (paper), AI agents with planning and self-correcting capabilities.
- Stanford describe: STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. An open-source agent that can write articles. Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models.
Safety
- Better Alignment with Instruction Back-and-Forth Translation. They create synthetic training data from existing (web) data, by generating viable prompts and responses in a consistent manner.
- The US is currently broadly supportive of open-source AI efforts (NTIA report).
Image
- It’s no surprise that the recently-released open-source FLUX.1 image model (c.f.) is being hosted in a wide variety of places: Deforum discord, Fal.ai, Replicate, EverArt AI, HuggingFace, Abacus ChatLLM, model download. Many of these are free (for now).
- Anifusion is a tool for creating comic/manga pages.
- Google released a paper on Imagen 3. Generations are quite good, but not better than the current Flux or Midjourney capabilities.
Video
- Generative video (text-to-video and image-to-video) has advanced rapidly over the last couple of years. It’s interesting to look back over the evolution of capabilities.
- Nov 2016: Sync-Draw
- April 2021: GODIVA
- Oct 2022: Meta Make-a-video
- Oct 2022: Google Imagen video
- April 2023: Will Smith eating spaghetti
- April 2023: Runway Gen 2
- April 2023: Nvidia latents
- December 2023: Fei-Fei Li
- January 2024: Google VideoPoet
- January 2024: Google Lumiere
- February 2024: OpenAI Sora
- April 2024: Vidu
- May 2024: Veo
- May 2024: Kling
- June 2024: Luma DreamMachine
- June 2024: RunwayML Gen-3 Alpha
- July 2024: Toys-R-Us Commercial made using Sora
- July 2024: Motorola commercial made using genAI
- July 2024: haiper.ai
- August 2024: Hotshot (examples)
- August 2024: Examples of state-of-the-art works made using genAI video:
- Runway Gen3 music video
- Runway Gen3 for adding FX to live action (another example)
- Midjourney + Runway Gen3: Hey It’s Snowing
- Flux/LoRA image + Runway Gen3 woman presenter
- McDonald’s AI commercial
- Sora used by Izanami AI Art to create dreamlike video and by Alexia Adana to create sci-fi film concept
World Synthesis
- FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework. Uses NeRF methods to reconstruct images of plants and identify/count the fruit (video). This seems quite useful by itself, but also points more broadly to the power of 3D reconstruction improving a host of visual real-world tasks. (Here’s a video showing a similar result from a different group.)
- Nvidia has presented some demos of how raytracing can be combined with Gaussian splats (preprint): shadows, depth of field, refraction, etc.
- High dynamic range (HDR) Gaussian splatting has also been demonstrated.
- Nvidia demoed real-time world-building (text-to-object, etc.).
Hardware
- Based Hardware is trying to make open-source AI wearables including glasses (OpenGlass) and the Friend AI pendant (not to be confused with the Friend AI pendant, c.f.).
- Google event announced the Pixel 9 Pro smartphones.
- The phones are incrementally improved. Includes tensor G4 chip, to enable more on-device AI features.
- Gemini will become even more deeply-integrated into Android.
- Gemini Live will allow multi-modal conversations (back-and-forth conversations, AI can use camera for added context, etc.).
- The new Pixel Buds are designed to be an interface to Gemini Live.
Robots
- Clone Robotics released a video showing teleoperation of their sophisticated hand. The mechanics of the Clone hand are remarkable, but the fidelity in the teleoperation appears quite low.
- Google DeepMind demos a robot that can play table tennis at a solidly amateur level. The robot is engaging in a physical activity at human level performance (without simply resorting to super-human hardware solutions); i.e. the control system is capable for simple (hit ball) and complex (plan where to hit in order to win) actions.
- New videos show off improvements in LimX CL-1 humanoid: doing simple tasks, shuffling sideways, walking up stairs (more confidently than before).
- Presentation on Boston Dynamics Atlas (including new electric version). Seems agile; e.g. doing pushups.
- Apptronik claims that with minimal training (10 hours), their Apollo robot could autonomously handle soft/deformable objects. They are also projecting that some of the initial demand for humanoid robots will be at-home assistants for the elderly.