top of page

OpenAI accused of training LLMs on copyrighted O'Reilly books

  • Voltaire Staff
  • Apr 2
  • 2 min read


OpenAI has yet again come under fire for alleged unauthorised use of copyrighted materials in training its artificial intelligence models. 


A recent report by the AI Disclosures Project suggests that OpenAI's GPT-4o model was trained on paywalled books from O'Reilly Media without a licensing agreement.


The AI Disclosures Project, a nonprofit founded in 2024 by Tim O'Reilly and economist Ilan Strauss, released a study concluding that GPT-4o demonstrates strong recognition of non-public O'Reilly Media book content. 


The study used a method known as DE-COP, or membership inference attack, to detect whether OpenAI's models were trained on copyrighted materials.


"GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content … compared to OpenAI’s earlier model GPT-3.5 Turbo," the co-authors wrote. 


The study analysed 13,962 paragraph excerpts from 34 O'Reilly books -- public, non-public -- and found a higher probability of recognition in GPT-4o than in previous models, suggesting that OpenAI may have used paywalled data without permission.


All the same, the researchers acknowledged the possibility of books becoming a part of the LLM training dataset via "free access" sites which may have pilfered the content. 


"Such access violations might have occurred via the LibGen database, as all of the O'Reilly books tested were found in it," they wrote.


This is not the first time OpenAI has been accused of copyright violations. The company has faced multiple lawsuits from authors, news organisations, and media companies over the alleged use of copyrighted material in AI training.


In 2023, prominent authors including George R R Martin and John Grisham joined a class-action lawsuit accusing OpenAI of using their books without consent.


In early 2024, The New York Times sued OpenAI, alleging that ChatGPT was trained on its proprietary articles, enabling users to generate near-verbatim copies of its content.


Tech giants and content creators have raised concerns about AI models being trained on vast amounts of data scraped from the internet without proper licensing or attribution.


Amid increasing scrutiny, OpenAI has sought licensing agreements with major media publishers to legitimise its data sources.


Over the years, the company has struck deals with publishers such as The Associated Press (AP), Axel Springer, Financial Times, and Le Monde. 


Image Source: Unsplash

Comments


Stay up-to-date with the latest news in science, technology, and artificial intelligence by subscribing to Voltaire News.

Thank You for Subscribing!

  • Instagram
  • Facebook
  • Twitter

© 2023 by Voltaire News Developed & Designed by Intertoons

bottom of page