AI News 2025-07-03

General

Andrej Karpathy has a knack for distilling the trends in AI/ML:
- 2017-11: Software 2.0 (“Gradient descent can write code better than you. I’m sorry.”)
- 2022-10: Transformers as general-purpose differentiable computers (talk)
- 2023-01: The hottest new programming language is English
- 2023-09: LLM as kernel of a new Operating System (diagram/diagram, OS analogies)
- 2025-02: Vibe coding
- 2025-06: Software 3.0 (talk): “Prompts as Programs”. Software 1.0 is code; 2.0 is model weights; 3.0 is prompts.
- 2025-06: “Context Engineering” instead of “Prompt Engineering”
- Now (2025-06): Prediction of LLMs being scaled down into “cognitive cores”; small edge-optimized (on-device inference) LLMs that have minimal knowledge but maximized reasoning and tool-use abilities. Can rapidly iterate to retrieve required results and build answers.
Epoch reports on improvements in context window length: LLMs now accept longer inputs, and the best models can use them more effectively.

Research Insights

A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap.
- Response to Apple’s paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, which argues LLMs fail as complexity increases, demonstrating a lack of true reasoning.
- This new paper argues that what seems like a lack of reasoning is more like a lack of agentic ability (tool access, etc.).
Towards Scalable Parameter Decomposition. They show a method to decompose models using parameters (rather than activations).
- Paper with technical details: Stochastic Parameter Decomposition (code)
VLMs can think visually without generating pixels. Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (paper, code).
The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements. The benchmark measures agentic abilities by asking the agent to improve the training speed for a small LLM (as a proxy for more general “AI recursive self-improvement”). Current agents do surprisingly badly (sub-human performance even with significant hints). Going forward, this eval (or variants thereof) should prove useful to measure “useful agentic” performance.

LLM

Inception Labs launch Mercury, a diffusion LLM. The fast inference of diffusion architecture puts it in a new regime for speed-vs-performance.

Agents

Anthropic tested Claude’s ability to operate a small business: Project Vend: Can Claude run a small shop? (And why does that matter?). Although surprisingly capable in certain ways, the agent overall lost money over time, and had a mini-identity-crisis for a day.

Safety

Image Synthesis

Video

Nano (Greyscale Labs) is a visual effects plugin that exploits ML depth-estimation to allow editing of volumetric haze.
Nvidia: UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting. Enables relighting of an existing image by estimate albedo.
Alibaba: OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation. Expressive avatar control.

World Synthesis

Science

Cars

Tesla shows off a car self-driving (no occupants) from factory to customer’s address. No doubt the route was carefully selected and vetted. Nevertheless, it is impressive.
Tesla launched a limited rollout of their full-self driving Robotaxi (with in-vehicle employee monitor, for now), in Texas.

Robots

K-Scale announces that you can now order one of their open-source humanoid robots (9k$ early adopter price; 16k$ nominal price).