Hosting
Wednesday, February 5, 2025
Google search engine
HomeInternetIs AI quietly sabotaging itself and the internet?

Is AI quietly sabotaging itself and the internet?


Interest in artificial intelligence continues to rise as Google searches have reached 92% of their all-time highs in the past twelve months, but recent research suggests that AI’s success could be its downfall. Amid the growth of AI content online, a group of researchers from the Universities of Cambridge and Oxford set out to see what happens when generative AI tools query AI-produced content. What they found was alarming.

Dr. Ilia Shumailov of the University of Oxford and team of researchers found that when generative AI software relies solely on content produced by genAI, responses begin to deteriorate, according to the study published in Nature last month.

After the first two prompts, the answers steadily miss the mark, followed by a significant drop in quality on the fifth attempt and a full transition to nonsensical pablum on the ninth consecutive question. The researchers blamed this cyclical overdose on the collapse of the generative AI content model: a steady decline in the AI’s learned responses that continually pollutes the training sets of repeating cycles until the output is a worthless distortion of reality.

“It’s surprising how quickly the model collapses and how elusive it can be. Initially, it affects minority data: data that is poorly represented. It then affects the diversity of the outputs and the variance decreases. Sometimes you’ll see small improvements on the majority data, which will mask the performance degradation on minority data. Model collapse can have serious consequences,” Shumailov explained in an email exchange.

This matters because English makes up 52% ​​of the internet and the rest is spread across 19 languages. Of all content that has been translated, about 57% of web-based text has been translated into three or more languages, and the low quality of this content indicates that it is likely machine translated, according to a separate study from a team of Amazon Web researchers Services published in June.

If human-generated data on the internet is quickly overshadowed by machine-filtered content and the findings of Shumailov’s research are true, it’s possible that AI is killing itself – and the internet.

Researchers discovered that AI was fooling itself

Here’s how the team confirmed the model collapsed. They started with a pre-trained, AI-powered wiki that was then updated based on its own generated output. As the contaminated data contaminated the original training set of facts, the information was steadily eroded into incomprehensibility.

For example, after the ninth survey cycle, an excerpt from the wiki article of the study on English church towers from the 14th century had comically turned into a mishmash of statements about different colors of donkey tail rabbits.

Another example mentioned in the Nature report to illustrate this point involved a theoretical example of an AI trained on dog breeds. Based on the study results, lesser-known breeds would be excluded from the repeated data sets, with preference given to more popular breeds such as golden retrievers. The AI ​​creates its own de facto “use it or lose it” screening method that removes less popular breeds from its data memory. But with enough cycles of AI inputs alone, the AI ​​is only capable of meaningless results, as shown in Figure 1 below.

“Imagine that in practice you want to build an AI model that generates images of animals. If before machine learning models you could simply find images of animals online and build a model from them, today it becomes more complex. Many photos online are not real and contain misconceptions introduced by other models,” Shumailov explains.

How does model collapse happen?

For some reason (and the researchers aren’t quite sure why), when AI feeds solely on a steady diet of its own synthetic data, it loses touch with the original thread of reality and tends to use its own best to create an answer based on its own best recycled data. data points.

But something is lost in that AI translation and factoid regurgitation.

The study concludes that the only way for artificial intelligence to achieve long-term sustainability is by ensuring its access to the existing body of non-AI, human-produced content and providing a continuous stream of new ones in the future , human-generated content. .

The amount of AI-generated online content is increasing rapidly

These days, however, it doesn’t seem possible to wave a lolcat meme over your head without hitting an AI-generated piece of content on the internet — and it could be worse than you think.

One AI expert and policy advisor has even predicted that due to the exponential growth in artificial intelligence adoption, 90% of all internet content will likely be generated by AI sometime by 2025.

Even if the percentage of AI-produced materials doesn’t reach 90% next year, it will still be a disproportionately large percentage of the available training content for future AI. That’s not a comforting prospect based on Shumailov’s findings and the lack of a clear solution to this problem, which will only grow with the popularity of generative AI.

Houston We have a problem: turn it into problems

No one knows what legal or regulatory guardrails will be enforced in the coming months and years that could restrict access to the existing bolus or significant portions of copyrighted human-originated content.

Furthermore, because so much of the Internet’s current content is generated using AI, without any way to realistically slow that explosive trend, it will be a challenge for developers of next-generation AI algorithms to fully address this situation. avoid, as the proportion of original human content shrinks.

Complicating matters further, Shumailov says it’s becoming more challenging for human developers to filter out content created by AI systems with large language models at scale, with no clear solution in sight.

“Not at the moment. There is an active academic discussion going on and hopefully we will make progress in addressing the model collapse while minimizing associated costs,” says Shumailov.

“One option is community-wide coordination to ensure that different parties involved in creating and implementing LLM share the information needed to resolve provenance issues,” Shumailov added. “Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data crawled from the Internet before mass adoption of the technology or direct access to data generated on a large scale by humans.”

Shumailov says the main implication of the model’s collapse is the corruption of previously unbiased training sets, which will now tend towards errors, mistakes and dishonesty. It would also amplify misinformation and hallucinations – AI’s best guesses without real data – that have already been exposed on several genAI platforms.

Given the steady march towards the collapse of the AI ​​model, everything online may need to be verified via an immutable system such as blockchain or an equivalent of a ‘Good Housekeeping’ seal of approval to ensure trust.

Otherwise, the death of AI and the internet could well mean the death of truth.


Correction, September 5, 2024: The original version of this story stated that 57% of content translated into other languages ​​on the Internet was done through AI algorithms. It has been corrected to note that 57% of translated content on the Internet is in three or more languages ​​and, based on its quality, is likely machine translated.



Source link

RELATED ARTICLES
- Advertisment -
Google search engine

Most Popular