Elon Musk says all human data for AI training will be exhausted by 2024.
Tesla CEO Elon Musk said last year that all human data available for AI training, including books, was exhausted, joining other experts who have come to similar conclusions.
Musk, who also owns an AI company, xAI, made the remarks during a live chat with Stagwell CEO Mark Penn, which aired on X.
Former OpenAI chief scientific officer Ilya Sutskever had previously hinted at this in December, noting that the AI industry had reached what he called "peak data" and predicting that the lack of training data would force a shift from the way models are developed today. .
Moving to synthetic data
According to Musk, the next option available for AI training is now synthetic data, which is data generated by the AI itself. "AI is advancing on the hardware side, and on the software side it's now moving to synthetic data, because we've exhausted all human data. We've literally exhausted the entire internet, every book ever written, and every interesting video.
"We have exhausted the cumulative amount of human knowledge when it comes to AI training, and this happened last year. So, the only way to achieve this is to use synthetic data, which AI creates.
"He'll write an essay or come up with a thesis, and then he'll self-assess and go through this self-learning process with synthetic data," Musk said. Challenges of Using Synthetic Data
The Tesla CEO, however, noted that using synthetic data to train AI comes with its own challenges, particularly in verifying the accuracy of its response.
"It's always a challenge, because how do you know if the response is hallucinatory or real? Then it's hard to find the underlying truth," he said.
Furthermore, some researchers have also suggested that synthetic data can lead to model collapse, where a model becomes less "creative" and more biased in its results, ultimately seriously compromising its functionality.
Gartner estimates that 60% of the data used for AI and analytics projects by 2024 will be synthetically generated. Microsoft's Phi-4, whose source code was made public Wednesday morning, was trained on synthetic data in parallel with real data. The same goes for Google's Gemma models. Anthropic used synthetic data to develop one of its most successful systems, the Claude 3.5 Sonnet, and Meta refined its latest Llama model series with AI-generated data.
[attachment deleted by admin]