Meta trained its AI on copyrighted work, new lawsuit alleges

Major Publishers Sue Meta Over Llama AI’s Alleged Use of Pirated Books and Verbatim Outputs

Sharing is caring!

Meta trained its AI on copyrighted work, new lawsuit alleges

Meta trained its AI on copyrighted work, new lawsuit alleges – Image for illustrative purposes only (Image credits: Unsplash)

Major book publishers and author Scott Turow filed a class-action lawsuit against Meta Platforms and its CEO Mark Zuckerberg on May 5, accusing the company of downloading millions of copyrighted works from pirate sites to train its Llama AI models.[1][2] The complaint details how Llama now generates summaries, near-verbatim passages, and even full imitations of textbooks, novels, and journal articles, threatening revenue for creators.[3] Filed in the U.S. District Court for the Southern District of New York, the suit marks a significant escalation in the ongoing clash between AI developers and the publishing industry.

Pirated Sources Fuel Llama’s Training

Meta engineers allegedly turned to notorious piracy hubs to build datasets for Llama, versions 1 through 4.[1] Torrent files from sites like LibGen, Anna’s Archive, Sci-Hub, and Sci-Mag provided hundreds of terabytes of material, including over two million publications per download in 2022 and 2023.[1] Anna’s Archive alone supplied 81 terabytes of content from shadow libraries such as Z-Library.[4]

Web-scraped collections like Common Crawl, CCNet, and C4 further swelled the training data, capturing unauthorized copies from paywalled sites and piracy aggregators.[3] During training, Meta repeatedly copied these works – tokenizing text, cleaning metadata, and updating models – while stripping copyright notices and author attributions.[1] The process created trillions of tokens from high-quality sources like textbooks and scholarly journals, which Meta valued for their coherence.

Verbatim Reproductions and Knockoffs Exposed

Llama’s responses demonstrate direct ingestion of training data, the lawsuit claims.[1] Prompted with opening lines from Cengage’s Calculus: Early Transcendentals (9th edition) by James Stewart, the model continued word-for-word, replicating examples on envelope costs and seismographs.[3] For Elsevier journals, it produced lengthy “summaries” riddled with errors, far exceeding simple recaps.

Other outputs include plot breakdowns of Macmillan’s A Darker Shade of Magic by V.E. Schwab and chapter imitations mimicking Sylvia Day’s style in One With You.[1] Turow’s Presumed Innocent prompted Llama to admit training on its text, while a sequel request to Innocent yielded a 5,000-word fanfic with original characters and settings.[1] These range from Hachette titles like N.K. Jemisin’s The Fifth Season to McGraw Hill’s The Art of Public Speaking.

  • Calculus: Early Transcendentals (Cengage) – Verbatim section continuation.
  • Innocent (Turow) – Unauthorized 10-chapter sequel.
  • The Wild Robot (Hachette) – Style-mimicking derivatives.
  • Elsevier articles – Erroneous but detailed paraphrases.

Zuckerberg’s Directives and Abandoned Deals

The complaint pins direct responsibility on Zuckerberg, who reportedly escalated data acquisition decisions and greenlit torrenting after licensing talks stalled.[2] Meta initially approached publishers for deals post-Llama 1 release but shifted strategy under his guidance, opting for “fair use” defenses.[1] Internal memos highlighted legal risks, yet piracy proceeded.

Plaintiffs argue this willful approach not only reproduced works but distributed them via peer-to-peer torrents, where Meta uploaded as much as it downloaded.[1] The suit seeks statutory damages, injunctions, and model audits, claiming AI-generated knockoffs already flood Amazon, displacing originals.[4]

Escalation in the AI Copyright Wars

This suit differs from prior author cases, where Meta prevailed on fair use for training in 2025.[2] Publishers emphasize output harms – verbatim regurgitation and market substitution – over mere ingestion.[5] Anthropic settled a similar claim for $1.5 billion last year.

Meta responded firmly: “We will fight this lawsuit aggressively,” citing court precedents on AI training as fair use and its role in innovation.[2] The company views Llama, downloaded over a billion times, as key to products like Meta AI.

As courts grapple with AI’s appetite for data, this case could redefine boundaries between transformative tech and protected expression. Publishers warn of eroded incentives for human creativity, while Meta bets on legal evolution favoring rapid advancement. The outcome may influence licensing norms and guardrails for future models.

About the author
Lucas Hayes

Leave a Comment