AI News 2025-02-20

General

  • Perplexity adds a Deep Research capability (similar to Google and OpenAI). You can try it even in the free tier (5 per day). They score 21% on the challenging “Humanity’s Last Exam” benchmark, second only to OpenAI at 26%.
  • TechCrunch reports: A job ad for Y Combinator startup Firecrawl seeks to hire an AI agent for $15K a year. Undoubtedly a publicity stunt. And yet, it hints towards a near-future economic dynamic: offering pay based on desired results (instead of salary), and allowing others to bid using human or AI solutions.
  • Mira Murati (formerly at OpenAI) announces Thinking Machines, an AI venture.
  • Fiverr announces Fiverr Go, where freelancers can train a custom AI model on their own assets, and have this AI model/agent available for use through the Fiverr platform. This provides a way for freelancers to service more clients.
    • Elevenlabs Payouts is a similar concept, where voice actors can be paid when clients use their customized AI voice.
    • In the short term, this provides an extra revenue stream to these workers. Of course, these workers are the most at threat for full replacement by these very AI methods. (And, indeed, one could worry that the companies in question are gathering the data they need to eventually obviate the need for profit-sharing with contributors.)

Research Insights

LLM

  • Nous Research releases DeepHermes 3 (8B), which mixes together conventional LLM response with long-CoT reasoning response.
  • InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU.
  • ByteDance has released a new AI-first coding IDE: Trae AI (video intro).
  • LangChain Open Canvas provides a user interface for LLMs, including memory features, UI for coding, display artifacts, etc.
  • xAI announces the release of Grok 3 (currently available for use here), including a reasoning variant and “Deep Search” (equivalent to Deep Research). Early testing suggests a model closing in on the abilities of o1-pro (but not catching up to o3 full). So, while it has not demonstrated any record-setting capabilities, it confirms that frontier models are not yet using any methods that cannot be reproduced by others.

AI Agents

Safety

Image

Video

3D

World Synthesis

  • Microsoft report: Introducing Muse: Our first generative AI model designed for gameplay ideation (publication in Nature: World and Human Action Models towards gameplay ideation). They train a model on gameplay videos (World and Human Action Model, WHAM); the model can subsequently forward-simulate gameplay from a provided frame. The model has thus learned an implicit world model for the video game. Forward-predicting gameplay based on artificial editing of frames (introducing a new character or situation) thus allows rapid ideation of gameplay ideas before actually updating the video game. More generally, this points towards direct neural rendering of games and other interactive experiences.

Science

Brain

Robots

  • Unitree video shows robot motion that is fairly fluid and resilient.
  • Clone robotics is moving towards combining their biomimetic components into a full-scale humanoid: Protoclone.
  • MagicLab Robot with dextrous MagicHand S01.
  • Figure AI claims a breakthrough in robotic control software (Helix: A Vision-Language-Action Model for Generalist Humanoid Control). The video shows two humanoid robots handling a novel task based on human natural voice instructions. Assuming the video is genuine, it show genuine progress in the capability of autonomous robots to understand instructions and conduct simple tasks (including working with a partner in a team).
Posted in AI, News | Tagged , , , , , , , , , , | Leave a comment

AI News 2025-02-13

General

Research Insights

LLM

  • OpenAI announce that o1 and o3-mini now have file and image upload capabilities.
  • Distillation Scaling Laws. Is it better to directly train a small model, or to train a larger model and distill that into a smaller model? The answer is complicated. Roughly, if on a tight compute budget, then directly training a small model may be better. However, if the cost of the big model is “free” (you want to have the big model for other purposes, etc.) then distillation of course can be efficient.

Safety & Security

  • Auditing Prompt Caching in Language Model APIs. They use the response speed to detect whether a given input has been previously cached. This allows one to detect whether someone else has already input that prompt, which thereby leaks information between users. This has a similar flavor to other attacks based on timing or energy use; a system leaks information when it implements internal efficiencies. Leakage can be stopped, but only by giving up the efficiency/speed gains.

Voice

Video

Science

Hardware

  • Groq has secured $1.5B to expand AI inference infrastructure in Saudi Arabia.

Robots

Posted in AI, News | Tagged , , , , , , , | Leave a comment

AI News 2025-02-06

General

Research Insights

LLM

  • Nvidia is providing a host for DeepSeek-R1 through their API.
  • OpenAI releases o3-mini, a powerful reasoning model that leverages inference-time compute.
  • Open-R1 is an attempt to reproduce the DeepSeek-R1 model/result/method in a fully open manner. Their first update shows progress in replicating DeepSeek’s results.
  • s1: Simple test-time scaling. They investigate the simplest possible inference-time compute method for increasing reasoning: they arbitrarily insert “Wait” tokens when the model tries to complete its response. This forces it to reconsider and think longer, yielding gains that scale with compute.
  • ACECODER: Acing Coder RL via Automated Test-Case Synthesis. It provides another way to think about expending post-training but pre-inference compute in order to improve a system.
  • Google releases Gemini 2.0 broadly. Although not the top models in raw benchmark scores, this set of models seem to establish a new record in terms of the Pareto tradeoff between performance and inference cost.

Safety

AI Agents

Vision

Video

Voice

Robots

Posted in AI, News | Tagged , , , , , , | Leave a comment

AI News 2025-01-30

General

Research Insights

LLM

  • Release of Qwen2.5-1M model, with a 1 million token context (technical report).
  • Release of Qwen2.5-VL, a vision-language model.
  • DeepSeek releases Janus Pro 1B (includes image generation and chat with PDF). It can run local/in-browser via WebGPU (demo here).
  • Open Thoughts has launched as an effort to curate quality datasets for training reasoning models (e.g. validated synthetic reasoning traces). Initial dataset has 114k traces.
  • Open-R1 is an attempt to reproduce the DeepSeek-R1 model/result/method in a fully open manner.
  • OpenAI has added a “think” option to GPT-4o, allowing it to invoke some form of chain-of-thought.

Safety

AI Agents

Audio

Video

Science

Robots

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2025-01-23

General

Research Insights

LLM

Safety

  • OpenAI: Trading Inference-Time Compute for Adversarial Robustness (full paper). The results suggest that inference-time compute can be used to improve safety (guardrails, alignment, etc.). This makes sense, given that inference-compute increases capabilities, and alignment can be viewed as a particular kind of capability (desired response).

Image Synthesis

Video

Audio

  • Bland AI (now bland.com) is running a publicity stunt where you can call their AI on your phone, and after 10-60 seconds of talking, it will clone your voice and start talking to you in your own voice. Intentionally unnerving, and a good reminder that we must now be skeptical of suspicious phone calls (even if they sound like loved ones), and for banks to stop using voice-print as a security factor.

Science

  • Published: Simulating 500 million years of evolution with a language model. (This was previously released as a preprint.) The ESM3 foundation model is trained on sequence, structure, and function of proteins. You can (e.g.) input a desired function and it will generate a candidate protein.
  • OpenAI has created an AI model for longevity science. More specifically, GPT-4b micro was trained to predict variants of protein factors with increased/controlled function. Since this model is not yet broadly available, we can’t estimate the utility. But it reinforces the notion that there is still plenty of opportunity space for tuned/task-specific advances wherever we have data and compute.

Robots

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2025-01-16

General

Research Insights

Safety

LLM

AI Agents

Audio

Image Synthesis

Video

Science

Robots

  • Latest video of Unitree’s humanoid robot shows a more humanlike gait, and navigating more rugged terrain.
Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2025-01-09

General

Research Insights

  • PRIME: Process Reinforcement Through Implicit Rewards (data/models, code)
    • Builds on prior work: Free Process Rewards without Process Labels.
    • The basic idea is: chain-of-thought (CoT) is a useful way to improve reasoning. But how to train better CoT? You can give scores to good vs. bad chains, but then the model only gets whole-chain feedback. It would be better to know where the reasoning chain went wrong (or right). In PRIME, alongside training the LLM, they train an LLM that acts as a per-token reward model. It learns what CoT-steps are looking good vs. bad, and so can provide more fine-grained direction control.
  • Differential Transformer. Explanation: The traditional transformer architecture spreads attention and can thus get distracted by noise (especially with large context). The differential architecture alters the attention equation so as to better amplify relevant context and suppress noise. This should improve retrieval and reduce hallucinations, especially for large contexts.
  • Metadata Conditioning Accelerates Language Model Pre-training. Pre-pending training data with meta-data (e.g. “from wikipedia.org”), for part of the training, allows more control. Training can be more data-efficient, and inference can be more steerable (by invoking a meta-data field associated with the desired output style).

LLM

AI Agents

Video

  • Fine-tuning of video models to a particular style is now starting. Examples of Hunyuan Video LoRAs.
  • Nvidia’s new GeForce RTX 5090 graphics card can use neural rendering for real-time ray-tracing (where only ~10% of pixels are computed using traditional ray-tracing, and a neural model is used to interpolate from that).

World Synthesis

  • Nvidia present Cosmos, a set of foundation models trained on 20 million hours of video. Intended to accelerate training (e.g. via synthetic data generation) of models for robotics, autonomous driving, industrial settings, etc.

Science

Brain

Hardware

  • Nvidia described their BG200 NVL72 rack-sized supercomputer: 72 Blackwell GPUs, 1.4 exaFLOPS of compute, and 130 trillion transistors. For fun, Jensen Huang showed what the corresponding compute would look like if all placed on a single wafer as a superchip, though that is not how it is actually manufactured or used.
  • Nvidia announces $3,000 personal AI supercomputer called Digits, which uses a GB10 superchip. A single unit can run a 200B model; linking two should allow one to run 405B models.

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment

AI New 2025-01-02

General

Research Insights

  • An interesting effect: fine-tuning GPT-4o on responses where the first letter of each line spells out H-E-L-L-O leads to a model that can correctly explain this underlying rule (even though the rule was never provided to it). This is surprising since when generating a reply, a token-wise prediction cannot “see ahead” and know that it will spell out HELLO; yet the LLM is somehow able to predict its own behavior, suggesting it has some knowledge of its own internal state.
    • Further testing with the pattern HELOL gave far worse results, implying strong reliance on the existence of the HELLO pattern in the training data.
  • Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs. The authors analyze whether we are efficiently using inference-time compute, and propose mitigate strategies to avoid overthinking.

AI Agents

  • Huggingface introduce smolagents, a lightweight framework for agents.
  • Agentarium is a Python framework for orchestrating agents.
  • Eliza is a framework for AI models to access resources (documents, Discord, Twitter, etc.).

Audio

3D

  • zoo.dev is developing workflows for CAD where one can switch between generative and traditional-edit modes.

Science

Robots

Posted in AI, News | Tagged , , , , , | Leave a comment

AI News 2024-12-26

General

Research Insights

LLM

  • OpenAI reveal a new reasoning model: o3. It scores higher on math and coding benchmarks, including setting a new record of 87.5% on ARC-AGI Semi-Private Evaluation. This suggests that the model is exhibiting new kinds of generalization and adaptability.
    • The ARC-AGI result becomes even more impressive when one realizes that the prompt they used was incredibly simple. It does not seem that they prompt engineered, nor used a bespoke workflow for this benchmark (the ARC-AGI public training set was included in o3 training). Moreover, some of the failures involve ambiguities; even when it fails, the solutions it outputs are not far off. While humans still out-perform AI on this benchmark (by design), we are approaching the situation where the problem is not depth-of-search, but rather imperfect mimicking of human priors.
    • The success of o3 suggests that inference-time scaling has plenty of capacity; and that we are not yet hitting a wall in terms of improving capabilities.
  • More research as part of the trend of improving LLMs with more internal compute, rather than external/token-level compute (c.f. Meta and Microsoft research):
  • Qwen released: QvQ-72B-preview visual reasoning model.
  • DeepSeek release DeepSeek-V3-Base (weights), 671B params. This is noteworthy as a very large open-source model, noteworthy for achieving competitive to state-of-the-art performance, and noteworthy for having (supposedly) required relatively little compute (15T tokens, 2.788M GPU-hours on H800, only $5.5M).

Safety

Video

Audio

  • Adobe Sketch2Sound allows one to imitate sound effects, and use AI to convert it into appropriate sounds. This allows art direction for Foley sound.
  • MMAudio enables video-to-audio; i.e. it can add a soundtrack to silent video (project, code, examples: 1, 2).

World Synthesis

Science

Hardware

  • Nvidia unveils a small form-factor compute platform (suitable for robotics).
  • Raven Resonance is another attempt to deliver augmented reality glasses.

Robots

Posted in AI, News | Tagged , , , , , , , , | Leave a comment

AI News 2024-12-19

General

  • Ilya Sutskever was co-recipient of the test-of-time award at NeurIPS 2024, for the 2014 paper: Sequence to Sequence Learning with Neural Networks, currently cited >28,000 times. Video of his speech here, in which he makes many provocative points: compute is growing but data is not (we only have one Internet, data is the fossil fuel of AI); scaling still matters, and we must determine what to scale; what comes next will be a mix of agents, synthetic data, and inference-time computer; strongly reasoning systems will be unpredictable; superintelligence is coming.
  • Anthropic present Clio, a system that provides an aggregated view of what people are using Claude to do. So this allows one to observe trends in AI usage. Paper: Clio: Privacy-Preserving Insights into Real-World AI Use.

OpenAI

Research Insights

LLM

  • Microsoft releases a small-but-capable model: Phi-4 (14B). It heavily uses synthetic data generation and post-training to improve performance (including on reasoning tasks).
  • Google’s Project Mariner, a chrome extension for agentic AI.
  • Google release Gemini 2.0 Flash Thinking, a reasoning model (available in AI studio).

Safety

  • Anthropic releases a new method to jailbreak AI models, using an automated attack method. By identifying this vulnerability, one can build future models to resist it. Paper: Best-of-N Jailbreaking (code). The method iteratively makes small changes to prompts, attempting to slide through countermeasures.
    • The flavor of successful attacks also gives insights into LLMs. Successful prompts may involve strange misspellings or capitalizations; or unusual images with text and colored boxes arranged peculiarly. This is similar to other adversarial attacks (e.g. on image classification models). They have a certain similarity to human optical illusions: generating perverse arrangements meant to trick otherwise useful processing circuits. Improved model training can progressively patch these avenues; but it’s hard to imagine models that completely eliminate them until one achieves truly robust intelligence.
  • Anthropic publish: Alignment Faking in Large Language Models. They find evidence for alignment faking, wherein the model selectively complies with an objective in training, in order to prevent modification of its behavior after training. Of course the setup elicited this behavior, but it is surprising in the sense that LLMs don’t have persistent memory/awareness, and troubling in the sense that this shows even LLMs can engage in somewhat sophisticated scheming (e.g. they have evidence for these decisions going on during the LLM forward-pass, not in chain-of-thought).

Video

Audio

  • ElevanLabs introduce a Flash TTS model, with latency of just 75 milliseconds.

World Synthesis

Science

Brain

Posted in AI, News | Tagged , , , , , , , , | Leave a comment