Hosting
Monday, February 24, 2025
Google search engine
HomeArtificial IntelligencePenguin is adding a no-scrape-for-AI page to its books

Penguin is adding a no-scrape-for-AI page to its books


Publishing giant Penguin Random House is taking a strong stance against the unlicensed use of its authors’ works by tech companies and will change the language on all copyright pages of its books to expressly ban their use in training artificial intelligence systems, according to reporting by The Bookseller.

It’s a notable departure from other major publishers, such as academic printers Taylor & Francis, Wiley and Oxford University Press, which have all agreed to license their portfolios to AI companies.

Matthew Sag, an AI and copyright expert at Emory University School of Law, said Penguin Random House’s new language appears to be aimed at the European Union market, but could also impact how AI companies in the US use their material. Under EU law, copyright holders can opt out of the mining of their work data. While that right isn’t enshrined in U.S. law, the largest AI developers generally don’t scrap content behind paywalls or content excluded by sites’ robot.txt files. “You would think there would be no reason why they wouldn’t respect this kind of opt-out [that Penguin Random House is including in its books] as long as it is a signal that they can process on a large scale,” said Sag.

Dozens of authors and media companies have filed lawsuits in the US against Google, Meta, Microsoft, OpenAI and other AI developers, accusing them of breaking the law by training large language models on copyrighted work. The tech companies argue that their actions fall under the fair use doctrine, which allows unlicensed use of copyrighted material under certain circumstances.– for example, if the derivative work substantially transforms the original content or if it is used for criticism, news reporting or education.

U.S. courts have not yet decided whether entering a book into a major language model constitutes fair use. Meanwhile, social media trends in which users post messages telling technology platforms not to train AI models on their content have been predictably unsuccessful.

Penguin Random House’s “no workout” message is a little different than those optimistic copypastas. For starters, social media users must agree to a platform’s terms of service, which invariably allow their content to be used to train AI. On the other hand, Penguin Random House is a wealthy international publishing house that can back up its message with teams of lawyers.

The Bookseller reported that the publisher’s new copyright pages will read in part: “No part of this book may be used or reproduced in any way for the purpose of training artificial intelligence technologies or systems. In accordance with Article 4(3) of the Digital Single Market Directive 2019/790, Penguin Random House expressly excludes this work from the text and data mining exception.”

Tech companies like to mine the internet, especially sites like Reddit, for language datasets, but the quality of that content is often poorfull of bad advice, racism, sexism and all other isms, which contribute to bias and inaccuracies in the resulting models. AI researchers have said that books are among the most desirable training data for models because of the quality of writing and fact-checking.

If Penguin Random House succeeds in locking out its copyrighted content from major language models, it could have a significant impact on the generative AI industry, forcing developers to start paying for high-quality content– which would be a blow to business models that rely on using other people’s work for free – or trying to sell clients models trained on low-quality internet content and outdated published material.

“The end game for companies like Penguin Random House opting out of AI training may be to satisfy the interests of authors who object to their works being used as training data for any reason, but it is likely that the publishing house can turn around and [start] charge licensing fees for access to training data,” Sag said. “If this is the world we end up in, AI companies will continue to train on the ‘open internet,’ but anyone in control of a reasonably large pile of text will want to opt out and charge for access. That seems like a pretty good compromise that will allow publishers and websites to monetize access without incurring impossible transaction costs for AI training in general.”



Source link

RELATED ARTICLES
- Advertisment -
Google search engine

Most Popular