Training AI With Protected Works:  Is Copyright Law Designed to Respond?

generative ai

Many creators feel very strongly that “training” AI models with unlicensed, copyrighted works is unjust—not least because generative AIs built on their creativities will put some creators out of business while enriching more tech moguls. It is both insult and injury to see one’s work used, without consideration, to underwrite the mechanism of one’s own obsolescence. But regardless of how we may feel about the practice of “machine learning” (ML) with unlicensed material, it remains to be seen whether and where current law provides any remedies. I’ll try to consider that topic in this post and the next post, beginning with the allegation that ML is mass copyright infringement.

Four class action lawsuits against generative AI developers have been filed thus far in the District Court for the Northern District of California, and all by the same law firm. Because all the complaints are similar, I will stick to the two that were filed first. In Andersen et al. v. Stability AI et al., a class of visual artists is suing Stability AI and Midjourney;[1] and in Tremblay et al. v. Open AI, a class of book authors suing OpenAI over the development of ChatGPT.[2] Both complaints allege direct and vicarious copyright infringement as well as unlawful removal of copyright management information (CMI). Both complaints also contain counts for violation of the derivative works right §106(2), and based on that theory, the Andersen complaint alleges unlawful making available of said derivative works in violation of 106(3), (4), & (5). The complaints also contain state law allegations, but I will discuss those in the next post.

Reproduction and the Battle of Analogies

The question of whether ML with copyrighted works constitutes an act of mass infringement will turn on the factual consideration as to whether any copying occurs in violation of the reproduction right (§106(1)). In Andersen and Tremblay, there is considerable focus on the potential of a generative AI to output an infringing work based on its training corpus. For instance, if the work of Karla Ortiz (one of the named plaintiffs in Andersen) is part of the ingested materials, then the assumption is that the AI model has the potential to produce a copy of an existing Ortiz work or a work that is substantially similar to an Ortiz work.

The reproduction inquiry may be different for each model and each type of work used for input. In Andersen, the complaint states, “Because a trained diffusion model can produce a copy of any of its Training Images—which could number in the billions—the diffusion model can be considered an alternative way of storing a copy of those images.” By contrast, the Tremblay complaint alleges that copying occurs, but it does not specifically describe how the ChatGPT training process entails reproduction. “During training, the large language model copies each piece of text in the training dataset and extracts expressive information from it,” the complaint states.

If the AI system produces any copies of any of its training materials, this is evidence that the system violates the reproduction right. Prompt the generator to make an image of Dr. Strange, and if Dr. Strange comes out, then nobody can doubt that Dr. Strange is a latent copy in the system and that this potential to copy is sufficient evidence of infringement at the input stage. Alternatively, if the system can only produce work “in the style of” Karla Ortiz, this raises different issues (and very serious concerns) but may not be considered sufficient evidence of “reproduction” in the input process. But the courts need not look at outputs, or even potential outputs, to find violation of the reproduction right.

It has been held (specifically in the 9th Circuit)[3] that even storing a copy in random access memory (RAM) is sufficient to find a violation of the reproduction right. The AI developers will seek to prove that their systems do not copy the works ingested in any sense, or that if they do, they copy only non-protected (i.e., factual) elements of the works. Using anthropomorphic words like observe, learn, study, etc. to describe ML, the argument from the developers will be that these models are designed to obtain information about the works but not copy the works anywhere in the system. Input an illustration, for example, and what the system allegedly stores are millions of data points about line weights, composition, colors, shading, etc. Then, combined with billions of other data points from billions of other works, the model generates probability algorithms which are then used to produce new visual works when users prompt the system with instructions.

AI developers like to compare “training” their models to the learning a human artist does when she experiences or studies works other than her own. In addition to being a reductive and dehumanizing analogy for the ways in which artists teach themselves a craft, this line of reasoning may be seen by the courts as smoke and mirrors. The factual question is whether the system retains a copy long enough to be perceived by the machine, which has been held to be violative of §106(1). Long-term storage of a copy is not required, and my understanding is that making a “more than fleeting” copy is unavoidable in any computer system—i.e., that there is no such thing as ingestion without reproduction.

Proving reproduction will be the whole ballgame insofar as litigation can address whether feeding a corpus of protected works is a violation of law. We shall see what the courts make of the facts presented, but without finding reproduction, the other copyright complaints likely fall. For instance, removal of CMI is not a stand-alone violation. Section 1202 of the DMCA states that removal is a violation if the party doing the removing knows or has reasonable grounds to know “that it will induce, enable, facilitate, or conceal an infringement of any right under this title.” Therefore, there must be a colorable claim of infringement for the CMI allegation to survive.

Derivative Works Allegations

Both the Andersen and Tremblay complaints allege that the AIs produce unlicensed derivative works in violation of §106(2), though the arguments are different in each case. In Andersen, the allegation arises from the premise that the system cannot produce anything outside the limitations of its data set composed of protected works. “The resulting image [output] is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images.…a latent diffusion system…can never exceed the limitations of its Training Images.”

It’s an interesting theory, but I’m not sure anything in copyright law can support the argument that all potential outputs of the generative AI are unauthorized derivatives of the total corpus of works in the training set. To find an infringing derivative of a visual work (typically one image) requires a substantial similarity inquiry comparing a specific original with the follow-on work to determine what has been copied and whether that copying renders the second work a derivative of the first. This is difficult enough in the world of humans intentionally using a single visual work to produce a different visual work (see Goldsmith v. Warhol!!). So, it seems highly speculative to ask a court to find generally that billions of images output are, as a matter of law, derivatives of the billions of images input. I’m not certain the court has anywhere to look for guidance to consider this reading of the derivative works right.

If this derivative works theory is tough with images, it would be even harder with text—i.e., to allege that the textual outputs are derivatives of all the textual inputs is akin to saying that every book written is a derivative of every book read. This echoes a popular sentiment among the anti-copyright crowd that no work is “original,” a premise that should not be given any legal weight, even in the service of trying to protect creators from AI developers. 

In Tremblay, the allegation is not that the individual outputs of ChatGPT are derivatives of the corpus of books used in training, but that the entire model is a single derivative work of its corpus. “Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act,” the complaint states. [Emphasis added]

Again, claiming that the entire LLM is a single derivative work of the millions of literary works fed into the system would seem to strain the derivative works right beyond the limit where any court can venture. In fact, this allegation could potentially bolster the inevitable fair use defense the AI developers will be arguing—namely that the finding of “transformative use” in Google Books favors fair use of the corpus of work used in ML.  

Fair Use & Google Books

Notably, these cases are brought in California, controlled by the Ninth Circuit and, therefore, not bound by the Second Circuit decision in Google Books, which many believe to be the strongest precedent favoring fair use for the AI developers. The comparison is a natural one. Google scanned whole books into a system to create a unique tool for searching the contents of books without providing any whole-copy substitutes for legally obtained copies. The court, noting that its decision “pushed the boundaries of fair use,” found under factor one that Google Books is “transformative” for its utility and found under factor four that it did not pose a threat to the market for the books used.

What the AI developers will try to argue under Google Books is that 1) their systems are highly “transformative” because they use protected works to create novel (even revolutionary) applications; and 2) their systems are designed to avoid outputting any copies that would serve as substitutes for the works in the data set. It is conceivable that courts or juries would find the comparison compelling, though the aforementioned capacity of a given AI to output Dr. Strange means that, unlike Google Books, the visual AI system at issue does make substitutes available and, therefore, the precedent is inapt.

By contrast, ChatGPT or other text-based application could have a stronger defense under Google Books if it is not possible, for instance, to have the system output an entire in-copyright literary work. The Tremblay complaint refers to the output of summaries, which is evidence that a whole book was ingested, but a summary is not generally an infringement and is certainly not a substitutional copy.

Meanwhile, other considerations should perhaps militate against finding fair use for generative AI model training. For instance, Google Books is a research tool for humans to learn about books written by other humans, including humans who write more books. Generative AIs are not necessarily comparable. For instance, Stable Diffusion does not provide a user with any information about an ingested work, and it poses an unprecedented threat to professional visual artists unlike any technology that has come before. Thus, the courts should consider the sui generis purpose of the generative AI at issue when citing Google Books or any other precedent to consider fair use.

In a May post, I proposed that unless the generative AI at issue can show that it promotes authorship, the court should decline to consider a fair use defense. To clarify, in Campbell, the Supreme Court states, “The fair use doctrine thus ‘permits [and requires] courts to avoid rigid application of the copyright statute when, on occasion, it would stifle the very creativity which that law is designed to foster.”[4] Until generative AI changed the landscape, there was no need to affirm that “the very creativity” fostered by copyright means “human creativity.” But today, that distinction is necessary. Although generative AI can produce volumes of “creative” material, only those works which can be protected by copyright are works of authorship. And just like it is indecent to exploit an artist’s work to build a machine that might end her career, it would be absurd to allow fair use (a component of copyright law) to defend a technology that would potentially annihilate copyright’s purpose.

Of course, that’s one man’s opinion, and one that would apply to some, but not all, works derived by generative AI. As these tools develop, and their uses are explored by various types of creators, there are examples, both in practice and in theory, where we can find that generative AI does foster new authorship. This gets into the complicated question of copyrightability of works that humans create with some AI used in the process, and because this is itself a new discussion, it is difficult to say which generative AIs, if any, can be said to “promote the progress” of authorship as a matter of law.

Legal experts, both pro and anti-copyright, will comment upon the strengths and weaknesses of Andersen, Tremblay et al. represented by the one firm that has taken the lead on these lawsuits. But even where these cases may be flawed, they can provide some insight into the question posed by this essay:  is copyright law an answer to the potential hazards of generative AI? I suspect that a fundamental difficulty arises because generative AI poses an existential threat to the future of authors, and some of the injustices and cultural calamities inherent to that threat may not be remedied (or entirely remedied) by the principles of copyright. Remedies sounding in other areas of law could loom larger, especially for certain types of creators, and that will be the subject of the next post.


[1] Deviant Art is also a named defendant being sued for breach of contract for providing works to Stability for ingestion.

[2] The same firm is now representing Sarah Silverman and another class of book authors, though the complaint is essentially the same as Tremblay.

[3] MAI Systems Corp. v. Peak Computer, Inc., 991 F.2d 511 (9th Cir. 1993).

[4] Citing Stewart v. Abend (1990).

Image by: idaakerblom

AI, Search, & Section 230

On May 18, the Supreme Court delivered opinions in Gonzalez v. Google and Twitter v. Taamneh, a pair of interrelated cases in which both plaintiffs sought to hold online platforms liable for hosting material meant to inspire acts of terrorism. Because the Court unanimously found in Taamneh that there was no basis in anti-terrorism law for liability (and therefore no claim for relief), it then declined to address the Section 230 question in Gonzalez, which was whether Google’s “recommendation algorithm” is sufficient to find contributory liability for the inciteful material being recommended.

Properly read, Section 230 shields OSPs from “publisher liability” but not from “distributor liability.” A distributor of allegedly harmful material may be liable when it knows, or has reason to know, the nature of the material and either affirmatively chooses to distribute it or willfully turns a blind eye to the potential harm and does nothing to stop it. Unfortunately, ever since 230 became law in 1996, the courts have generally read the law as a blanket shield for any OSP distributing any kind of material as long as it was uploaded by a user of the site and not by the site operators.

Plaintiff Gonzalez alleged that Google’s “recommendation” algorithm, designed to promote content based on the system’s interpretations of user behavior, played a crucial role in pushing ISIS propaganda toward the parties who eventually committed a mass shooting in Paris that resulted in the death of Nohemi Gonzalez. Plaintiffs argued that “targeted recommendations” are not properly shielded by Section 230, and to the extent one can read the tea leaves in oral arguments, justices as opposite as Thomas and Brown-Jackson may be sympathetic to this view.

For further reading in “Strange Bedfellows,” the amicus brief in Gonzalez filed by Senator Hawley echoes many of the same legal arguments in the brief filed by Cyber Civil Rights Initiative. Also, Senators Hawley and Blumenthal are at least publicly in synch on the need to correct the errors in Section 230. “Reform is coming,” Sen. Blumenthal declared in March. All of which is to say that there appears to be both bipartisan and multi-stakeholder consensus building around the idea that platforms can and should be held accountable for promoting harmful material.

Does AI-Enhanced Search Imply Liability?

Notably, one prong of Google’s defense in Gonzalez was that “recommendation” is analogous to search and that delivering search results cannot rise to the level of contributory liability. Whether the Court would agree with this comparison under full examination in a viable case remains an open question. But assuming the Court would not have sided with Google, what might it make of Google’s new Search Generative Experience (SGE)? Still in trial phase for users who choose to enable it, the AI-driven SGE could be the new mode of search, or (if it totally sucks) could tank Google’s core business. As James Vincent writes for The Verge:

… it’s the dynamics of AI — producing cheap content based on others’ work — that is underwriting this change, and if Google goes ahead with its current AI search experience, the effects would be difficult to predict. Potentially, it would damage whole swathes of the web that most of us find useful — from product reviews to recipe blogs, hobbyist homepages, news outlets, and wikis. Sites could protect themselves by locking down entry and charging for access, but this would also be a huge reordering of the web’s economy. In the end, Google might kill the ecosystem that created its value, or change it so irrevocably that its own existence is threatened. 

Hard to predict for sure, and I will not make the attempt. There are, of course, many potential hazards with AI-enhanced search, not the least being more virulent mutations of garbage results (as if misinformation needs any help). But in a Section 230 context, would the deployment of SGE as Google’s new search model increase the likelihood of its liability under the same legal arguments presented in Gonzalez? The “recommendation” algorithm is a form of AI, and if that level of platform influence could be sufficient to find liability, then presumably a more robust use of AI could result in a stronger allegation of liability.

On June 14, Senators Hawley and Blumenthal introduced a two-page bill that would make Section 230 immunity unavailable for service providers “if the conduct underlying the claim or charge involves the use or provision of generative artificial intelligence by the interactive computer service.’’ Presumably, this bill can be seen as performative along with other announcements from Congress that AI has their attention, with various Members promising not to be fooled again into allowing Big Tech to regulate itself. There’s a lot of “We’re on it” messaging coming from the Hill about AI, and we’ll see what comes.

In the meantime, perhaps there is something to the Hawley bill in light of the considerations in Gonzalez and the imminent release of SGE. At first, I sneered at the amendment because generative AI is primarily a tool of production, and Section 230 immunity has little or nothing to do with production. It doesn’t matter whether the harmful material at issue is produced with Midjourney or a box of crayons. But if a generative AI serves as the engine for a new mode of search (i.e., recommendation), then the language in the Hawley/Blumenthal amendment would seem to obviate the need to litigate the question presented in Gonzalez. Congress would be declaring that Google is not automatically shielded from liability.

Considering that we are far from resolving the damage done by the “democratization of information,” it’s tough to feel sanguine about the prospect of AI making search better rather than suck faster. On the other hand, if the adoption of AI in certain core functions of online platforms is a basis for Congress resetting the terms of liability, then perhaps service providers will discover a renewed interest in the original intent of Section 230—an incentive to remove harmful material, not to keep it online and monetize it.


Photo source by: sinenkiy

DCA Reports High Incidence of Credit Card Fraud on Pirate Sites

Digital Citizens Alliance (DCA) released a new report yesterday with the eye-popping statistic that 72% of Americans who subscribe to pirate media sites experience incidences of credit card fraud compared to 18% prevalence of credit card fraud among those who do not subscribe to pirate sites. These data are based on a survey of 2,030 Americans, of which 1 in 3 reported watching some pirated content in the last year, and 1 in 10 reported subscribing to a pirate streaming service. The report titled Giving Pirate Site Operators Credit states …

… piracy was once primarily a headache for content creators, users of these sites now face significant risks. Piracy subscription services make an estimated $1 billion a year providing services to at least nine million U.S. households.

DCA’s findings indicate that around 6.5 million Americans who choose to access movies, TV shows, and games in this black market, have been targeted for credit card fraud as a direct result of their subscriptions. And although I say the stat is “eye-popping,” given the environment we’re talking about, perhaps the real surprise is that the rate of unauthorized credit card charges in this network isn’t closer to 100%. After all, it’s one thing when hackers steal credit card data from legit retailers et al., but subscribing to a pirate site is cutting out the middleman and giving credit card info directly to a network of hackers.

The shift to high-quality streaming a little over ten years ago created an opportunity for pirates to launch new platforms offering low-price subscriptions to “everything” because, of course, none of the material they’re streaming is legally obtained but is stored on pirate servers around the world. Just as other DCA reports have shown that among the hidden costs of this all-you-can-eat offer is a high probability of infection with life-altering malware, the likelihood of unauthorized charges to a credit card is apparently even greater. “Combined with our previous research highlighting the risks associated with free piracy apps and services, the situation becomes even clearer. The pursuit of pirated content is an inherently risky behavior that threatens the devices, wallets, and privacy of consumers,” says DCA executive director Tom Galvin in a press release accompanying the new study.

DCA Research Subscriptions Trigger Fraud Within Eleven Days

Prior to conducting its survey of American consumers, DCA researchers subscribed to 20 pirate sites using a new credit card obtained for the experiment. In less than two weeks, the fraudulent charges began to appear from China, Singapore, Hong Kong, and Lithuania, and within three-months, DCA’s card was targeted with $1,495 in executed and attempted unauthorized transactions. The largest attempted transaction was $850, which was stopped by fraud protection, and the largest approved charge was $244.78. Given the implied cost to credit card services to provide protection against such transactions, DCA’s first recommended remedy—that the payment processors terminate relationships with known pirate sites—seems like a no-brainer.

DCA also recommends that the Federal Trade Commission “take piracy more seriously” and prioritize warning Americans about the risks associated with pirate sites; it recommends more consumer protection group outreach on this issue; and it recommends that law enforcement more aggressively investigate pirate site operators, now armed with the 2020 amendment to the U.S. Copyright Act which elevated large-scale piracy by means of streaming from a misdemeanor to a felony. “Given that the piracy ecosystem is now a $2 billion industry, the Department of Justice should use that authority to target piracy operators,” the report states.

Personally, I would be curious to know something about the thinking of 9 million Americans who want cheap media streaming so badly that they’re willing to tolerate the high risk of credit card fraud and/or a dangerous malware attack. Of course, to DCA’s point, perhaps the majority of these subscribers don’t know how risky accessing these sites can be.


Photo source by: Wichayada57844