OpenAI Accused of Using Copyrighted Books to Train GPT-4o Model

OpenAI Accused of Using Copyrighted Books to Train GPT-4o Model

A new report from the AI Disclosures Project has ignited a fresh debate over how leading AI companies source their training data—this time, with OpenAI under scrutiny.

According to the study, OpenAI’s latest large language model, GPT-4o, demonstrates a significant ability to recognize content from copyrighted, paywalled O’Reilly Media books. This revelation has raised serious concerns about the company’s data acquisition practices, particularly whether it obtained and used such data without adequate transparency or consent.

Evidence of Copyrighted Data in Model Training

The AI Disclosures Project, led by technologist Tim O’Reilly and economist Ilan Strauss, conducted an in-depth analysis using a dataset of 34 legally obtained O’Reilly books. They employed the DE-COP membership inference attack technique to evaluate whether OpenAI’s GPT models were trained on this material.

Key findings include:

  • GPT-4o achieved an AUROC score of 82% when identifying paywalled O’Reilly content—indicating a strong likelihood it was trained on it.
  • In contrast, GPT-3.5 Turbo only slightly surpassed random chance with a score of just above 50%.
  • GPT-4o’s recognition of non-public O’Reilly data was significantly higher (82%) compared to publicly available excerpts (64%).
  • GPT-4o Mini, a scaled-down version, showed no significant recognition—suggesting it may not have been trained on the same data.

Where Did the Data Come From?

The report suggests that the content may have been accessed through the controversial LibGen platform, which is known for distributing copyrighted materials. All the books used in the study were found on LibGen, raising the possibility that OpenAI models may have been trained on pirated content.

Interestingly, the research also shows how newer AI models are becoming increasingly adept at distinguishing between human-written and machine-generated text. While this evolution improves performance, it also complicates efforts to determine the exact source of training data.

Ethical and Legal Implications

The implications of the findings are far-reaching. The researchers caution that if AI firms continue using copyrighted material without compensation, it could erode the economic foundation for professional content creation. This could lead to a decline in both quality and diversity of online content.

The AI Disclosures Project argues for stronger legal frameworks and transparency in AI development. They propose introducing liability policies that require companies to disclose the origins of their training data. This could pave the way for a commercial market where licensing and compensation for data usage become standard practice.

The EU AI Act is cited as a potential model. If fully implemented and enforced, it could initiate a new disclosure regime that gives IP holders insight into how their content is being used in AI training.

A Developing Market for Licensed AI Training Data

Despite the ethical gray areas, a shift may already be occurring. Companies such as Defined.ai are beginning to build marketplaces for legally sourced training data, ensuring consent and the removal of personal information.

Additionally, some AI developers are proactively forming partnerships with content providers to license material for training—such as OpenAI’s deal with the Financial Times. This suggests a growing awareness in the industry about the need for ethically sourced data and the importance of transparency.

For more on how companies are responding to these data challenges and building tools to manage AI model performance, see our coverage of Arthur’s real-time AI model monitoring tool.

Conclusion

This study reveals a critical inflection point in the evolution of artificial intelligence. As AI models grow more complex and influential, the origins of their training data must be clearly defined and ethically sourced. Without proper regulation and corporate accountability, the very foundation of digital content creation may be at risk.

Transparency is no longer optional—it’s a necessity for building trustworthy and sustainable AI systems.

On Key

Related Posts

stay in the loop

Get the latest AI news, learnings, and events in your inbox!