Home > Information > News

Wang xingxing's public challenge of vla: Perhaps Just Ahead of the Autonomous Driving Industry?

Auto Lab 2025-08-14 09:13:42

"VLA can solve fully autonomous driving, but whether VLA is the most efficient way is still questionable. However, at this stage, VLA is the most capable architecture."

In May's "Ideal AI Talk Season 2," Li Xiang laid the groundwork for the biggest selling point of the Ideal i8 — the VLA driver large model.

At the Ideal i8 launch event two months later, about a quarter of the time was spent describing how powerful the VLA driver model is.

In fact, not only Ideal, but also car companies such as Great Wall, Chery, Zeekr, Xpeng, and Leapmotor are advancing the research and implementation of VLA models.

From the perspective of the entire intelligent driving industry, the VLA model has long since replaced the end-to-end model as the core of the new round of intelligent driving competition.

However, just as people in the smart driving industry are enthusiastically working on VLA models, Wang Xingxing, the CEO of Unitree Robotics and a prominent figure in the robotics industry, poured cold water on the VLA driver model without any hesitation.

The VLA model relative to "fool".

On August 9th, at the 2025 World Robot Conference, Wang Xingxing bluntly stated that the VLA model is a relatively "foolish architecture."

At the same time, Wang Xingxing also expressed that he holds a "rather skeptical attitude" towards the VLA model.

As soon as this statement was made, public opinion exploded. Huang Guan, the CEO of Jiajia Vision, even complained about Wang Xingxing's viewpoint in his social media circle, calling it "too amateurish" and suggesting that Wang Xingxing "shouldn't talk about AI anymore" in the future.

Before determining whether Wang Xingxing's words are "words of criticism" or "nonsense," let's first understand what the VLA Driver Large Model is.

VLA refers to Vision-Language-Action, which enables the execution of driving actions based on an end-to-end approach using both visual information and speech information.

Li Xiang compares the form of his product to a "driver agent," with the object of comparison naturally being humans who are also drivers.

When you communicate with the driver while taking a taxi, you can communicate with it in the same way.

Simple and quick instructions are directly handled by the on-device VLA. If the instructions are complex and require comprehension, they are sent to the cloud-based foundational model for analysis and translation before being passed back to the VLA.

Hearing this, do you get a strange feeling: isn't this just a robot?

Yes, although the VLA model became more well-known because of intelligent driving, it was originally applied to robots.

In October 2024, Stanford University released the world's first open-source VLA model, OpenVLA, successfully demonstrating that the VLA model possesses stronger generalization capabilities in the actual operation of robots.

After being applied in the robotics and intelligent driving industries, the VLA model has indeed demonstrated significant practical value.

It is more like end-to-end and VLM using a fusion card. In special scenarios that even humans find challenging, such as tidal lanes and long-sequence reasoning, VLA's way of thinking and understanding is more human-like, while its processing method is better than humans.

If the previous VLM was still limited to 2D images, then the VLA already possesses a complete brain, capable of solving problems through language and logical reasoning abilities.

We at "Super Unboxing" have also experienced it in advance. Interested friends can click on the video to watch.

It seems that VLA is the key to unlocking the door to autonomous driving, but according to Wang Xingxing, the current VLA model has a very tricky problem in that the real-world interaction data it collects is insufficient.

To address this issue, Wang Xingxing said that they tried adding "RL," which is reinforcement learning, on top of the VLA model, but ultimately found it still "not enough."

Relative to VLA+RL, Wang Xingxing found that a better solution is actually the world model.

Wang Xingxing stated that last year, Unitree started using pre-trained action videos to control robots to perform corresponding actions according to the video content.

In Wang Xingxing's view, the technology direction driven by video in world models may have a higher probability of convergence than VLA models.

However, Wang Xingxing expressed that he "doesn't dare to guarantee" whether the world model can achieve technological convergence.

The key reason is that Wang Xingxing believes the world model demands too much on the quality of video generation, leading to a somewhat high consumption of GPUs.

However, Wang Xingxing also stated that for robots, the quality of video generation does not need to be very high.

Notably, before Wang Xingxing publicly questioned VLA, the Li Xiang i8 launch event had already discussed similar issues and also talked about the world model.

The arrow gradually points to the world model.

At the Ideal i8 launch event, Li Auto's Senior Vice President of Autonomous Driving R&D, Lang Xianpeng, also spoke about the negative impact of insufficient data on the VLA model.

Lang Xianpeng shared that in human driving, highways and urban expressways account for more than 60% of the total mileage, while country roads account for less than 1%. Therefore, the proportion of human driving is very uneven, and if training is based on this, the effect will be very poor.

Regarding this issue, Lang Xianpeng stated that the ideal solution is to develop a world model.

World models can generate scenarios that conform to the laws of the real physical world, thus compensating for the insufficiency of real vehicle data.

In the automotive industry, compared to Li Auto, NIO has a more in-depth application of the world model.

As early as July last year, NIO released the World Model, but it wasn't until May this year that the first version of the World Model was rolled out. However, judging by the actual performance, the World Model has not been that impressive at least until now.

According to the official information released by NIO, the NIO World model will have a stronger ability to understand space and model long sequences, thereby improving its performance in various scenarios.

Given that, let's wait and see.

In addition, after Wang Xingxing expressed "doubt" about the VLA model, Jiang Lei, the chief scientist of the National and Local Joint Engineering Research Center for Humanoid Robots, also shared his views at the World Robot Conference.

Jiang Lei stated that the perception-cognition-decision-execution loop has not yet been closed, and the VLA model needs to be restructured in order to seek a new solution paradigm.

Alexander Verl, Chairman of the Technical Committee of the International Federation of Robotics, spoke more candidly at the robotics conference about the technical limitations of the VLA model, which mainly include seven aspects:

Lack of memory; perception defects; lack of action; object confusion; low success rate; language comprehension defects; lack of feedback.

At the same time, Weir also pointed out that the cost of training VLA models is high, reaching tens of millions of dollars even without considering the cost of preparing training data.

For the solutions to these problems, Weier's ideas are basically consistent with Wang Xingxing's, which is to use world models for learning.

In Conclusion

The VLA model is currently the hottest direction in the intelligent driving industry, and most discussions focus on how powerful and valuable it is.

Therefore, Wang Xingxing's "skeptical" voice inevitably sounds somewhat jarring.

Although it is harsh, from the perspective of technological development, the VLA large model is destined to be just a transit station before intelligent driving reaches its ultimate form.

Therefore, by openly expressing doubts about VLA, Wang Xingxing was actually ahead of the intelligent driving industry.

The ultimate convergence of intelligent driving may lie in the world model or in other solutions, and perhaps more exploration and discussion are needed.

【Copyright and Disclaimer】The above information is collected and organized by PlastMatch. The copyright belongs to the original author. This article is reprinted for the purpose of providing more information, and it does not imply that PlastMatch endorses the views expressed in the article or guarantees its accuracy. If there are any errors in the source attribution or if your legitimate rights have been infringed, please contact us, and we will promptly correct or remove the content. If other media, websites, or individuals use the aforementioned content, they must clearly indicate the original source and origin of the work and assume legal responsibility on their own.

Wang xingxing's public challenge of vla: Perhaps Just Ahead of the Autonomous Driving Industry?

Most Popular