
(Jirsak/Shutterstock)
AI progress is often measured by scale. Bigger models, more data, more computing muscle. Every jump forward seemed to prove the same point: if you could throw more at it, the results would follow. For years, that equation held up, and each new dataset unlocked another level of AI ability. However, now there are signs that the formula is starting to crack. Even the largest labs, with all the funds and infrastructure to spare, are quietly asking a new question. Where does the next round of truly useful training data come from?
That is the concern Goldman Sachs chief data officer Neema Raphael raised in a recent podcast: AI Exchanged: The Role of Data, where he discussed the issue with George Lee, co-head of the Goldman Sachs Global Institute, and Allison Nathan, a senior strategist in Goldman Sachs Research. “We’ve already run out of data,” he said.
What he meant is not that information has vanished, but that the internet’s best data has already been scraped and consumed, leaving models to feed increasingly on synthetic output, and this shift may define the next phase of AI.
According to Raphael, the next phase of AI will be driven by the deep stores of proprietary data that are still waiting to be organized and put to work. For him, the gold rush is not over. It is simply moving to a new frontier.
To understand the critical role of data in GenAI, we must remember that a model can only perform as well as the material it learns from, and the freshness and range of that material shape its results. Early gains came from scraping the open web, pulling structured facts from Wikipedia, conversations from Reddit, and code from GitHub.
Those sources gave models enough breadth to move from narrow tools into systems that could write, translate, and even generate software. However, after years of harvesting, that stockpile is largely spent. The supply that once powered the leap in GenAI is no longer expanding fast enough to sustain the same pace of progress.
Raphael pointed to China’s DeepSeek as an example. Observers have suggested that one reason it may have been developed at relatively low cost is that it drew heavily on the results of earlier models rather than relying only on new data. He said the important question now is how much of the next generation of AI will be shaped by material that previous systems have already produced.
With the most useful parts of the web already harvested, many developers are now leaning on synthetic data in the form of machine generated text, images, and code. Raphael described its growth as explosive, noting that computers can generate almost limitless training material.
That abundance may help extend progress, but he questioned how much of it is truly valuable. The line between useful information and filler is thin, and he warned that it could lead to a creative plateau. In his view, synthetic data can play a role in supporting AI, but it cannot replace the originality and depth that come only from human-created sources.
Raphael is not the only one raising the alarm. Many in the field now talk about “peak data,” the point at which the best of the web has already been used up. Since ChatGPT first took off three years ago, that warning has grown louder.
In December last year, OpenAI cofounder Ilya Sutskever told a conference audience that almost all of the useful material online had been consumed by existing models. “Data is the fossil fuel of A.I.,” said Sutskever while speaking at the Conference on Neural Information Processing Systems (NeurIPS) in Vancouver.
Sutskever said the fast pace of AI progress “will unquestionably end” once that source is gone. Raphael shared the same concern but argued that the answer may lie in finding and preparing new pools of information that remain untapped.
The data squeeze is not just a technical challenge; it has major economic consequences. Training the largest systems already runs into hundreds of millions of dollars, and the cost will rise further as the easy supply of web material disappears. DeepSeek drew attention because it was said to have trained a strong model at a fraction of the usual expense by reusing earlier outputs.
If that approach proves effective, it could challenge the dominance of U.S. labs that have relied on massive budgets. At the same time, the hunt for reliable datasets is likely to drive more deals, as firms in finance, healthcare, and science look to lock in the data that can give them an edge.
Raphael stressed that the shortage of open web material does not mean the well is dry. He pointed to large pools of data still hidden inside companies and institutions. Financial records, client interactions, healthcare files, and industrial logs are examples of proprietary data that remain underused.
The difficulty is not just collecting it. Much of this material has been treated as waste, scattered across systems and full of inconsistencies. Turning it into something useful requires careful work. Data has to be cleaned, organized, and linked before it can be trusted by a model.
If that work is done, these reserves could push AI forward in ways that scraped web content no longer can. The race will then favor those who control the most valuable stores, raising questions about power and access. The open web may have given AI its first big leap, but that chapter is closing. If new data pools are unlocked, progress will continue, though likely at a slower and more uneven pace. If not, the industry may have already passed its high-water mark.
Related Items
The AI Beatings Will Continue Until Data Improves
Google Pushes AI Agents Into Everyday Data Tasks
How to Build a Lean AI Strategy with Data