Training AI With Protected Works: Is Copyright Law Designed to Respond?

Many creators feel very strongly that “training” AI models with unlicensed, copyrighted works is unjust—not least because generative AIs built on their creativities will put some creators out of business while enriching more tech moguls. It is both insult and injury to see one’s work used, without consideration, to underwrite the mechanism of one’s own obsolescence. But regardless of how we may feel about the practice of “machine learning” (ML) with unlicensed material, it remains to be seen whether and where current law provides any remedies. I’ll try to consider that topic in this post and the next post, beginning with the allegation that ML is mass copyright infringement.

Four class action lawsuits against generative AI developers have been filed thus far in the District Court for the Northern District of California, and all by the same law firm. Because all the complaints are similar, I will stick to the two that were filed first. In Andersen et al. v. Stability AI et al., a class of visual artists is suing Stability AI and Midjourney;[1] and in Tremblay et al. v. Open AI, a class of book authors suing OpenAI over the development of ChatGPT.[2] Both complaints allege direct and vicarious copyright infringement as well as unlawful removal of copyright management information (CMI). Both complaints also contain counts for violation of the derivative works right §106(2), and based on that theory, the Andersen complaint alleges unlawful making available of said derivative works in violation of 106(3), (4), & (5). The complaints also contain state law allegations, but I will discuss those in the next post.

Reproduction and the Battle of Analogies

The question of whether ML with copyrighted works constitutes an act of mass infringement will turn on the factual consideration as to whether any copying occurs in violation of the reproduction right (§106(1)). In Andersen and Tremblay, there is considerable focus on the potential of a generative AI to output an infringing work based on its training corpus. For instance, if the work of Karla Ortiz (one of the named plaintiffs in Andersen) is part of the ingested materials, then the assumption is that the AI model has the potential to produce a copy of an existing Ortiz work or a work that is substantially similar to an Ortiz work.

The reproduction inquiry may be different for each model and each type of work used for input. In Andersen, the complaint states, “Because a trained diffusion model can produce a copy of any of its Training Images—which could number in the billions—the diffusion model can be considered an alternative way of storing a copy of those images.” By contrast, the Tremblay complaint alleges that copying occurs, but it does not specifically describe how the ChatGPT training process entails reproduction. “During training, the large language model copies each piece of text in the training dataset and extracts expressive information from it,” the complaint states.

If the AI system produces any copies of any of its training materials, this is evidence that the system violates the reproduction right. Prompt the generator to make an image of Dr. Strange, and if Dr. Strange comes out, then nobody can doubt that Dr. Strange is a latent copy in the system and that this potential to copy is sufficient evidence of infringement at the input stage. Alternatively, if the system can only produce work “in the style of” Karla Ortiz, this raises different issues (and very serious concerns) but may not be considered sufficient evidence of “reproduction” in the input process. But the courts need not look at outputs, or even potential outputs, to find violation of the reproduction right.

It has been held (specifically in the 9^th Circuit)[3] that even storing a copy in random access memory (RAM) is sufficient to find a violation of the reproduction right. The AI developers will seek to prove that their systems do not copy the works ingested in any sense, or that if they do, they copy only non-protected (i.e., factual) elements of the works. Using anthropomorphic words like observe, learn, study, etc. to describe ML, the argument from the developers will be that these models are designed to obtain information about the works but not copy the works anywhere in the system. Input an illustration, for example, and what the system allegedly stores are millions of data points about line weights, composition, colors, shading, etc. Then, combined with billions of other data points from billions of other works, the model generates probability algorithms which are then used to produce new visual works when users prompt the system with instructions.

AI developers like to compare “training” their models to the learning a human artist does when she experiences or studies works other than her own. In addition to being a reductive and dehumanizing analogy for the ways in which artists teach themselves a craft, this line of reasoning may be seen by the courts as smoke and mirrors. The factual question is whether the system retains a copy long enough to be perceived by the machine, which has been held to be violative of §106(1). Long-term storage of a copy is not required, and my understanding is that making a “more than fleeting” copy is unavoidable in any computer system—i.e., that there is no such thing as ingestion without reproduction.

Proving reproduction will be the whole ballgame insofar as litigation can address whether feeding a corpus of protected works is a violation of law. We shall see what the courts make of the facts presented, but without finding reproduction, the other copyright complaints likely fall. For instance, removal of CMI is not a stand-alone violation. Section 1202 of the DMCA states that removal is a violation if the party doing the removing knows or has reasonable grounds to know “that it will induce, enable, facilitate, or conceal an infringement of any right under this title.” Therefore, there must be a colorable claim of infringement for the CMI allegation to survive.

Derivative Works Allegations

Both the Andersen and Tremblay complaints allege that the AIs produce unlicensed derivative works in violation of §106(2), though the arguments are different in each case. In Andersen, the allegation arises from the premise that the system cannot produce anything outside the limitations of its data set composed of protected works. “The resulting image [output] is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images.…a latent diffusion system…can never exceed the limitations of its Training Images.”

It’s an interesting theory, but I’m not sure anything in copyright law can support the argument that all potential outputs of the generative AI are unauthorized derivatives of the total corpus of works in the training set. To find an infringing derivative of a visual work (typically one image) requires a substantial similarity inquiry comparing a specific original with the follow-on work to determine what has been copied and whether that copying renders the second work a derivative of the first. This is difficult enough in the world of humans intentionally using a single visual work to produce a different visual work (see Goldsmith v. Warhol!!). So, it seems highly speculative to ask a court to find generally that billions of images output are, as a matter of law, derivatives of the billions of images input. I’m not certain the court has anywhere to look for guidance to consider this reading of the derivative works right.

If this derivative works theory is tough with images, it would be even harder with text—i.e., to allege that the textual outputs are derivatives of all the textual inputs is akin to saying that every book written is a derivative of every book read. This echoes a popular sentiment among the anti-copyright crowd that no work is “original,” a premise that should not be given any legal weight, even in the service of trying to protect creators from AI developers.

In Tremblay, the allegation is not that the individual outputs of ChatGPT are derivatives of the corpus of books used in training, but that the entire model is a single derivative work of its corpus. “Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act,” the complaint states. [Emphasis added]

Again, claiming that the entire LLM is a single derivative work of the millions of literary works fed into the system would seem to strain the derivative works right beyond the limit where any court can venture. In fact, this allegation could potentially bolster the inevitable fair use defense the AI developers will be arguing—namely that the finding of “transformative use” in Google Books favors fair use of the corpus of work used in ML.

Fair Use & Google Books

Notably, these cases are brought in California, controlled by the Ninth Circuit and, therefore, not bound by the Second Circuit decision in Google Books, which many believe to be the strongest precedent favoring fair use for the AI developers. The comparison is a natural one. Google scanned whole books into a system to create a unique tool for searching the contents of books without providing any whole-copy substitutes for legally obtained copies. The court, noting that its decision “pushed the boundaries of fair use,” found under factor one that Google Books is “transformative” for its utility and found under factor four that it did not pose a threat to the market for the books used.

What the AI developers will try to argue under Google Books is that 1) their systems are highly “transformative” because they use protected works to create novel (even revolutionary) applications; and 2) their systems are designed to avoid outputting any copies that would serve as substitutes for the works in the data set. It is conceivable that courts or juries would find the comparison compelling, though the aforementioned capacity of a given AI to output Dr. Strange means that, unlike Google Books, the visual AI system at issue does make substitutes available and, therefore, the precedent is inapt.

By contrast, ChatGPT or other text-based application could have a stronger defense under Google Books if it is not possible, for instance, to have the system output an entire in-copyright literary work. The Tremblay complaint refers to the output of summaries, which is evidence that a whole book was ingested, but a summary is not generally an infringement and is certainly not a substitutional copy.

Meanwhile, other considerations should perhaps militate against finding fair use for generative AI model training. For instance, Google Books is a research tool for humans to learn about books written by other humans, including humans who write more books. Generative AIs are not necessarily comparable. For instance, Stable Diffusion does not provide a user with any information about an ingested work, and it poses an unprecedented threat to professional visual artists unlike any technology that has come before. Thus, the courts should consider the sui generis purpose of the generative AI at issue when citing Google Books or any other precedent to consider fair use.

In a May post, I proposed that unless the generative AI at issue can show that it promotes authorship, the court should decline to consider a fair use defense. To clarify, in Campbell, the Supreme Court states, “The fair use doctrine thus ‘permits [and requires] courts to avoid rigid application of the copyright statute when, on occasion, it would stifle the very creativity which that law is designed to foster.”[4] Until generative AI changed the landscape, there was no need to affirm that “the very creativity” fostered by copyright means “human creativity.” But today, that distinction is necessary. Although generative AI can produce volumes of “creative” material, only those works which can be protected by copyright are works of authorship. And just like it is indecent to exploit an artist’s work to build a machine that might end her career, it would be absurd to allow fair use (a component of copyright law) to defend a technology that would potentially annihilate copyright’s purpose.

Of course, that’s one man’s opinion, and one that would apply to some, but not all, works derived by generative AI. As these tools develop, and their uses are explored by various types of creators, there are examples, both in practice and in theory, where we can find that generative AI does foster new authorship. This gets into the complicated question of copyrightability of works that humans create with some AI used in the process, and because this is itself a new discussion, it is difficult to say which generative AIs, if any, can be said to “promote the progress” of authorship as a matter of law.

Legal experts, both pro and anti-copyright, will comment upon the strengths and weaknesses of Andersen, Tremblay et al. represented by the one firm that has taken the lead on these lawsuits. But even where these cases may be flawed, they can provide some insight into the question posed by this essay: is copyright law an answer to the potential hazards of generative AI? I suspect that a fundamental difficulty arises because generative AI poses an existential threat to the future of authors, and some of the injustices and cultural calamities inherent to that threat may not be remedied (or entirely remedied) by the principles of copyright. Remedies sounding in other areas of law could loom larger, especially for certain types of creators, and that will be the subject of the next post.

[1] Deviant Art is also a named defendant being sued for breach of contract for providing works to Stability for ingestion.

[2] The same firm is now representing Sarah Silverman and another class of book authors, though the complaint is essentially the same as Tremblay.

[3] MAI Systems Corp. v. Peak Computer, Inc., 991 F.2d 511 (9th Cir. 1993).

[4] Citing Stewart v. Abend (1990).

Image by: idaakerblom