Chabon v. Chatbot: About those ‘Shadow Libraries’

As many readers already know, another class-action lawsuit was filed on September 8 against OpenAI by book authors Michael Chabon, David Henry Hwang, Matthew Klam, Rachel Louise Snyder, and Ayelet Waldman on behalf of all authors similarly situated. The allegations are almost identical to the complaints in other class-action suits against various AI companies. I won’t repeat what I have already written about each allegation, but once again, I predict that if the court does not find unlawful reproduction in transient copies necessarily made in RAM, Open AI will likely prevail. Once again, this complaint alleges that the GPT model itself is an unlicensed “derivative work” of the entire corpus of books fed into it, but this does not seem to be a well-founded implication of the derivative works right under copyright law.

But one aspect of this complaint (as well as Tremblay et al.) is that Open AI is alleged to have obtained part of its database from known pirate repositories. In reference to one of the datasets used to train Chat GPT, the Chabon complaint states, “the only ‘internet-based books corpora’ that have ever offered that much material are infamous ‘shadow library’ websites, like Library Genesis (“LibGen”), Z-Library, Sci-Hub, and Bibliotik, which host massive collections of pirated books, research papers, and other text-based materials. The materials aggregated by these websites have also been available in bulk through torrent systems.” So, is the act of exploiting illegally obtained materials in this manner a violation of law?

Certainly, the Copyright Act does not address the issue. There is language about “lawfully made” copies in the context of first sale doctrine and certain exceptions for libraries. The only two uses of the words “lawfully obtained” in Title 17 pertain to acquisition of a computer program and permissible circumvention of technical protections for research purposes. So, nothing in the Copyright Act makes Open AI’s scraping “shadow libraries” an infringing act on its own, and there is no language in §107 on fair use that refers to lawfully making or obtaining material(s). This would be anathema since a fair use defense implies an unlicensed use.

Still, it seems wrong (probably because it is) to profit by exploiting another party’s unlawful possession of valuable materials. Under the criminal code (Title 18 §2315), it is a “federal offense to receive, possess, barter, sell, or dispose of stolen property with an aggregate value of $5,000 or more if the property crosses state lines.” The statute refers to physical property and not to exploiting databases full of pirated material. But if an AI developer knowingly exploits repositories replete with unlicensed copies of works, doesn’t that sound like it should be illegal?

This discussion reminds me a little bit of the rationale for the Protecting Lawful Streaming Act of 2020, which elevated the unauthorized public performance of works via streaming from a misdemeanor to a felony. After years of debate—and allegations by anti-copyright groups that felony streaming would be disastrous—Congress recognized that unlawful streaming is effectively a digital-age version of mass bootlegging physical copies, which had long been a felony. In fact, streaming is worse because it can reach a much larger black-market than any bootlegger distributing physical products ever could.

So, under a similar rationale by which Congress recognized that streaming digital repositories of unlicensed works is a felony, perhaps lawmakers might broaden the intent of Title 18 §2315 to prohibit mass exploitation of digital warehouses full of illegal copies of copyrighted works. Certainly, these warehouses contain materials with aggregate values in the tens of millions of dollars. Hence, any party that knowingly exploits these warehouses for financial gain might reasonably be liable under the criminal code.

Authors and artists are justifiably angry that their works are being used without permission to train generative AIs. And the fact that Chat GPT was allegedly trained in part with corpora of literary material acquired and stored by media pirates is salt in the wound to say the least. I don’t know what, if any, legal remedies might be proposed, but I am confident that it is generally wrong to profit from the intentional use of ill-gotten goods.

Photo by: onephoto