Is there a risk of a recursive degradation effect for LLMs over time due to training on subpar AI-generated content from the web?
Some estimates say that around 50% of all content on the web is now AI-generated, and that could rise to 90% by 2026. A more precise number depends on the methodology used for measuring. However, given how long it takes a human to write a quality blog post compared to an AI generating text that is then published automatically or by a human, the trend is quite obvious. By 2030 or so, 99% of all content on the web will probably be AI-generated to some extent.
Google’s search results have deteriorated over time, and they recently took steps to filter out content that they interpret as lower quality, mostly AI-generated content. Google’s latest move is just one in a long line of efforts to filter out lower-quality texts, like SEO link farms, from search results, and they have extensive experience in this field.
So, if there is a fast rise in AI-generated content on the web that Google has taken active steps to counter, thus filtering out 45% of the previous content, how could the various trainers of new LLMs with training data in the billions and even trillions of parameters ensure the quality of that training data? They will probably have an increasingly hard time doing so.
GPT-4, for instance, has 1.76 trillion parameters and has historically increased the number of parameters with each version.
Thus, subpar AI content could take up a bigger and bigger part of the web and then be used as training data for new LLMs, which in turn will be used for generating new AI content that is put on the web, that in turn is used as new training data for LLMs, and so on.
In an experiment where an AI was trained on its own output, without fresh quality data added, it went MAD (Model Autophagy Disorder) after 5 iterations.
Given the trend of accelerating AI content on the web, ensuring the quality of the training data will be extremely important for new LLMs.