Self-driving cars and humanoid robots that can walk, talk and work alongside us are just two of the amazing ways AI promises to change the world in the near future.
But to be able to operate safely and effectively, these physical tools and AI applications must be able to understand the world.
At this year’s Consumer Electronics Show in Las Vegas, NVidia just announced the launch of its Cosmos platform, designed to accelerate the development of physical AI systems.
Described as a “ChatGPT moment for robotics,” Cosmos is capable of generating large amounts of synthetic data. This is data that, despite being artificially generated, is close enough to the real world that robots, self-driving cars, and other physical AI algorithms should be able to learn from it.
However, some people believe that no amount of synthetic data will be able to fully simulate every real-world scenario that cars will need to prepare for. That’s why Tesla, for example, has spent years collecting real-world data with its sensor-laden cars. CEO Elon Musk tweeted: “Two sources of data scale infinitely: synthetic data, which has a ‘is it true?’ problem and video of the real world, which does not have.”
The argument is that synthetic data lacks the unpredictability and chaotic complexity of the real world, and that this is essential for building comprehensive and secure AI systems. Let’s look at this in a little more detail.
Synthetic vs. real-world data
In autonomous driving systems, visual data (pictures) are used to train algorithms that determine how vehicles will react to different road conditions and situations. This data can be captured by cameras attached to vehicles (real world data). It can also be generated by AI algorithms according to rules learned from studying real-world data (synthetic data).
There are advantages and disadvantages to both methods.
Synthetic data can often be collected much more quickly and cost-effectively than real-world data. No one has to go out and collect it – it is simply created by machines.
This can also have security benefits. For example, testing self-driving cars on the road comes with an element of risk, which can be eliminated if the trips are simply simulated.
Situations, environments, and many other variables can also be customized, rather than having to wait for the ideal circumstances to collect data to appear in the real world. For example, researchers can simulate rare weather events, test autonomous vehicles in dangerous scenarios, or model complex manufacturing defects without real-world risks or delays.
Additionally, synthetic data generation can also reduce or eliminate privacy and data protection concerns that may apply in the real world, as there is no risk of sensitive personal data being inadvertently stored or compromised.
This can happen when collecting real-world data. License plates of cars captured on camera by the autonomous car can be linked to their owner and used to identify and track them, for example.
Real-world data, on the other hand, as Musk points out, has the undeniable advantage of being more authentic. Chaotic and unpredictable human behaviors that are difficult to generate synthetically are more likely to be accounted for in the data.
Regulation can also be an issue. AI laws are evolving rapidly, and it may be that regulators require certain models or applications to be trained on real-world data at some point in time or in some jurisdictions for security reasons.
Weighing the options
Indeed, both real-world and synthetic data are likely to be essential for training the next generation of physical vehicles and AI robots.
Both offer distinct advantages and challenges, and adopting a hybrid approach is likely to be the best path to success.
The trick will be identifying which is best suited for particular use cases. For example, it is possible that synthetic data is more useful for tasks or applications that involve processing sensitive information or operating in hazardous conditions.
Real-world data, on the other hand, may be best when it comes to capturing dynamic human behavior, or you’re likely to encounter chaotic, unpredictable events.
This means that AI projects that adopt a balanced approach, led by those who understand how synthetic and real-world information can complement rather than compete with each other, are more likely to create real value business.