Nvidia's new AI writes code for robots, beating human experts in 4 of 7 tasks

Nvidia is extending its dominance from AI training to robotic control with the release of CaP-X, an open-source framework that allows robots to generate their own control software in real-time. The framework’s premier agent, CaP-Agent0, has already demonstrated performance in complex tasks that rivals or surpasses programs handwritten by human experts, signaling a major shift in how autonomous systems learn and adapt.

"On the prospects of 'Code as Policy' (CaP) for robotics, I am very excited!" Ken Goldberg, a professor at UC Berkeley, said in a comment on the release.

In benchmark tests using the CaP-Bench framework, the CaP-Agent0, which requires no prior training, achieved a success rate that matched or exceeded human-expert written programs in four of seven core manipulation tasks. This performance was achieved using only the most basic atomic commands, a scenario where even advanced large models like OpenAI's o1 and Google's Gemini 3 Pro were shown to fail without the framework's structured approach. The CaP-X model also demonstrated superior robustness in long-horizon tasks compared to end-to-end models like OpenVLA.

This development solidifies the "Code as Policy" approach, where AI models generate explicit code rather than black-box neural network outputs. For Nvidia, this extends its moat from just selling the GPUs that train AI to providing the core software frameworks that run AI-powered robots. This move could capture significant value in the growing robotics and automation market, putting further pressure on competitors trying to build comprehensive AI ecosystems.

From VLA Black Box to Code as Policy

The release of CaP-X addresses key limitations in the two dominant approaches to robot control. Traditional methods require engineers to meticulously write code for every action, a process that is precise but brittle and fails to generalize to new objects or environments. More recently, end-to-end Vision-Language-Action (VLA) models, inspired by the success of large language models, have shown impressive capabilities. However, these VLA models operate as "black boxes," making them difficult to debug and often requiring massive new datasets to adapt to novel tasks.

The "Code as Policy" (CaP) paradigm, first proposed by Google in 2022, offers a third way. Instead of having a large model output an abstract action, it generates readable Python code that directly calls a robot's control APIs. Nvidia's CaP-X is a significant evolution of this idea. It creates a complete "harness" that allows a programming agent to not only write code but also receive feedback from the environment, debug its own mistakes, and save successful routines to a reusable skill library. In this framework, even a powerful VLA model can be treated as just another tool, called by a single line of code to handle a specific, complex manipulation task it excels at.

The CaP-X Framework: A Closer Look

CaP-X is not a single model but a suite of tools designed to work together. The core is CaP-Gym, an interactive environment that connects the AI "brain" to a simulated or physical robot, providing real-time feedback on each line of generated code. It includes built-in perception tools that translate raw images into semantic concepts like "an apple" or "a cup." On the control side, it abstracts away low-level joint movements, allowing the AI to program in a more intuitive Cartesian space.

To measure progress, the team developed CaP-Bench, a benchmark that specifically tests an AI's ability to write functional code for robots, recover from errors, and incorporate visual feedback. It was on this benchmark that CaP-Agent0, the framework's flagship agent, demonstrated its superiority. The agent uses a multi-round reasoning loop and can generate multiple potential code solutions in parallel to find one that works. When a solution succeeds, it's automatically added to a persistent skill library, allowing the agent to learn and improve over time. The research also introduces CaP-RL, which uses reinforcement learning to fine-tune the programming model itself, improving its coding intuition based on environmental feedback.

While CaP-X shows remarkable strength in logic and planning, the researchers note it can be less effective at tasks requiring high-frequency visual feedback, like pouring water. The most promising future direction is a hybrid approach, where a code-generating AI handles high-level strategy and error recovery while delegating fine-motor tasks to a specialized VLA model.

This article is for informational purposes only and does not constitute investment advice.