AI News 2025-07-17

General

Gwern essay: LLM Daydreaming: Proposal & discussion of how default mode networks for LLMs are an example of missing capabilities for search and novelty in contemporary AI systems.
METR analysis: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (blog, commentary/analysis). Their results show that use of AI tools by programmers decreased productivity. This is surprising to the large number of programmers who use LLMs to improve their work. One caveat is that this was analyzing productivity gains for developers working on their own codebases (whereas the biggest gains for LLM usage comes from exploring unfamiliar topics/code). One might also wonder how the results change for free-form use instead of as prescribed in this study. Nevertheless, this does weaken the argument for AI already delivering productivity gains.
METR is showing results for a wider range of tasks (this appears to be an update to their earlier report from May 2025). The landmark METR result showed that AI execution of software engineering tasks (as measured by “task time” for a human to complete) was increasing exponentially. Now they add a variety of other tasks to the evaluation. Each tasks shows some form of exponential scaling, though the magnitude and scaling differ.

Research Insights

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.
- It is worth remembering that there will be tradeoffs between performance and monitorability. Various “thinking in latent space” approaches have demonstrated improved performance, but that obfuscates AI thinking (requiring imperfect mechanistic interpretability to then attempt to recover internal processes). We should be willing to give up some immediate performance gains in order to increase our ability to align (which will yield more long-term gains).

LLM

Kimi AI releases Kimi-K2, an open-source (permissive license) 1T parameter model (MoE, 32B active). It performs extremely well across multiple benchmarks, especially coding, despite being a non-reasoning model (try it, API, code, weights).
Mirix proposes a more advanced structure for save-and-retrieve memory: MIRIX: Multi-Agent Memory System for LLM-Based Agents.

Agents

OpenAI launches Agents (video). It uses a combination of text-browsing agentics (like Deep Research) and visual-browsing (like Operator) to handle open-ended asynchronous tasks. It can access connectors (Google Drive, etc.), tools (image generation), and can generate (e.g.) slide decks. Achieves 42% on Humanity’s Last Exam and 27% on FrontierMath.

Audio

Thinksound can add audio to a video (examples). Similar to the previously-reported MMAudio (examples).
Mistral releases Voxtral, an open-source audio model.

Video

Runway is starting to deploy Act-Two, an improved motion capture model that can transfer a video performance to an AI avatar based on a single input image.

Cars