Finding Fair Use for GAI Training is Highly Problematic

fair use

Although I have expressed aspects of these views in several posts over the past couple of years, I will try to consolidate my opinion as to why GAI training with protected creative works is a more problematic fair use consideration than many, even the courts, seem to believe. I acknowledge that even fellow copyright advocates will disagree with some of this analysis, but here it goes:

For the sake of narrowing the focus to the question of whether training generative AI (GAI) with protected works favors a fair use exception, the following assumes that the training requires unlicensed copying of protected expression. Further, even if the GAI maker limits the product’s capacity to output infringing copies, this does not alter the fact that considering fair use for this purpose is, at best, troubling and, at worst, so disturbing to case law that the AI developers are begging the courts to articulate doctrine out of whole cloth.

A GAI’s Purpose is Not Analogous to Past Fair Use Factor One Findings

The courts have largely rejected the overbroad opinion that making “something new” is a sufficient justification for unlicensed use of protected works. Thus, it is difficult to see where any court finds an authority to support the argument that making a “creator robot,” however revolutionary its developers proclaim it to be, is a transformative purpose under a factor one analysis.

Typically, a GAI’s purpose neither expresses “critical bearing” on the works used (AWF v. Warhol) nor provides information about the works to human readers (Authors Guild v. Google) nor fosters interoperability in computer devices (Google v. Oracle). Instead, a GAI’s most widely applied and widely promoted purpose is artificial “authorship” without authors—a purpose which forecasts myriad negative effects that may prove to dramatically overwhelm any benefits promised by the developers.

Naturally, certain GAIs (e.g., ChatGPT) can be used for various purposes, about which more below, but if the courts are distracted by the sheer novelty, scope, and hype around the “importance” of GAI and, therefore, presume transformativeness, they may be persuaded to articulate a rationale that would be tantamount to a blanket exception for GAI training. If the court adopts this carve-out in the context of fair use factor one, the result would be a reversal of its own reluctance to favor the broad “something new” argument for transformativeness so recently rejected in Warhol.

Notably, it is not unprecedented for the court to articulate rationales beyond the four-factor analysis. In the Google Books case, the court found that the search tool provides a “social benefit,” and a similar sentiment was articulated in Google v. Oracle regarding consumer benefit in advancing mobile products. Or looking back at the Betamax VCR case, the concept of “time shifting” the viewing schedule served the public interest by expanding flexibility in the consumption of copyrighted material that was lawfully obtained.

But if the courts look for a rationale beyond the case law (e.g., a clear social benefit of GAI), not only will they be making a wild guess, but any conclusion in favor of the developers will probably be wrong—perhaps dangerously so. While it is understandable that the courts may be reluctant to hobble technological development in principle, the available facts militate against disturbing fair use jurisprudence for the sake of nurturing GAI in general.

Put differently, if the courts are going to take a wait-and-see approach, there is ample evidence that GAIs already cause harm to individuals—from CSAM and defamation to cheating and psychological issues—to say nothing of the well-founded anxieties—social, political, economic, and environmental—associated with this multi-trillion-dollar gamble being played by the same people who unrepentantly accrued wealth and power from the darkest results of Web 2.0.

GAI as a Tool for Creators

To the extent that a given GAI product may be considered a tool for producing creative works, a fair use holding should at least find that the tool “promotes the progress” of authorship with respect to copyright’s purpose. But this is difficult because the same GAI in the hands of one skilled creator offers little insight about its ultimate purpose in the hands of 100-million unskilled users.

At the positive end of considering GAI’s purpose, my friend David Bolinsky, a medical illustrator and animator, recently made a series of 8 dozen topically and stylistically distinct ten second animations, introducing speakers and segment topics for a scientific conference, a daunting assignment. GAI collapsed well over a year of work (if using his standard 3D animation tools) into a matter of weeks. He was surprised at the breadth and depth of creative latitude GAI enabled. Further, he explained that although these presentations allowed more creativity than his typical discrete medical and scientific educational animations, an amateur lacking his experience still could not have used the same GAI tools to achieve the same results. Consequently, Bolinsky sees GAI as an opportunity to do more and different kinds of work and not as a threat to his creativity or livelihood.

In this example, the technology is socially beneficial and arguably “promoting the progress” of authorship, which may favor a finding that the tool is transformative. That said, due to the human authorship requirement, we are years away from guidance as to the degree of copyright protection on those animations; and if GAI tools are used to produce millions of works that have no “authors” as a matter of law, it is contrary to find that this “promotes progress” in regard to copyright’s purpose.

Further, the difficulty for the court in considering fair use is that Bolinsky and his colleagues who specialize in medical work are unique among professional creators, to say nothing of the many millions of non-creator customers that GAI developers need—because they are leveraged into the stratosphere—to make their products profitable. This scale implies an analysis reminiscent of Sony—i.e., a question of whether the purpose of the GAI is substantially beneficial or substantially harmful. But knowing that requires time travel.

If a court could see a few years into the future and find, for instance, that the GAI at issue will be used substantially for nonconsensual pornography, disinformation, and scams, it would presumably decline to find these purposes are social benefits that favor an expansive transformativeness finding. Instead, at the moment, the courts simply have no idea what the true “purposes” are of various GAIs, which is unprecedented in fair use jurisprudence. The VTR, Google Books, Android phones, et al. did not serve materially different purposes years after they were presented to the courts in their respective cases. By contrast, GAIs present an incomplete and dynamic set of facts; and in my view, this alone should militate against finding that factor one favors any of these products.

The Threat to Authorship Itself

As stated in other posts and in comments to the Copyright Office, one unique challenge of GAI is that it poses a potential threat to authorship (i.e., that it will shrink the number of creative workers), which is clearly destructive to the progress clause and copyright law. Although my own view is that a party who poses an existential threat to copyright’s purpose should not be allowed to invoke one of copyright law’s affirmative defenses, I recognize the difficulty in that opinion.

Under U.S. law, copyright protects authors indirectly by protecting certain exclusive rights to use their works. Consequently, there is little foundation for arguing generalized harm to authorship itself, despite the overwhelming recognition that diversity in authorship has benefitted the United States both culturally and economically for almost two centuries. In this context, GAI provokes the question as to whether U.S. policy might shift toward a “moral rights” approach akin to Europe, but that’s a discussion for a different post.

Instead, the general threat to authorship is considered, to an extent, under fair use factor four, which weighs the potential threat to the market value of the works used. The key difficulty, however, is that if the GAI does not output the song “Ordinary” but instead outputs music in the style of Alex Warren, then the output is not, strictly speaking, a threat to the market value of “Ordinary” itself. While proposals like the NO FAKES Act would prohibit unauthorized replication of Warren’s voice, copyright law does not clearly prevent a GAI that makes Warren-like music that could theoretically obviate the need for Warren himself.[1]

For now, several plaintiffs in the roughly 40 active lawsuits GAI developers have presented evidence of outputs that are substantially similar to the works used in training, and this should disfavor fair use for the GAI developers under factor four. More broadly, plaintiffs in these cases argue that licensing works for the purpose of AI training is itself a market opportunity exclusive to the copyright owner, and therefore, the failure to license constitutes market harm under factor four.

Some courts may be reluctant to agree with the lost licensing opportunity claim, but that reluctance is unfounded—even if a developer successfully prevents its product from outputting copies of works used in training. So long as one of the exclusive copyright rights is implicated (and here, it would be the reproduction right), then a requirement to license exists. Consequently, failure to license, especially at such an extraordinary scale for unprecedented commercial venture, is unquestionably market harm to the copyright owner.

Even where there may be a close call on factor four, because the GAI developer should lose on factor one, and because factors two and three decidedly favor creator plaintiffs, factor four should not reasonably control in many of these cases. Moreover, the courts should pay scant attention to the claim by developers that the cost of licensing is existentially prohibitive to the development of GAI. In addition to the fact that this plea is barely tolerable from parties wildly spending billions on high-risk ventures, any claim that a license is “too costly” for any venture is no defense under copyright law. The copyright owner sets the terms for the use of her work, and the prospective user can accept those terms or not before using the work. If that rule applies to the bootstrapping indie filmmaker, surely it applies to Microsoft, Meta, Google, et al.

Conclusion

Fair use is a mixed question of fact and law, and I maintain that what should be most fatal to the developers’ fair use defense is that, like the public, the courts have insufficient facts about the ultimate purpose of GAI products. Just as with Web 2.0 in the late 1990s, we are witnessing unfounded political sentiment to once again let Big Tech do what it wants, preaching to the public that this time, the technology really will “solve the world’s problems.”

Of course, there is no rational basis for that belief beyond the self-interest of the developers and the investors losing billions every year. If past is prologue, Congress would live to regret the folly of allowing AI to run amok, just as Members of both parties now rue the unconditioned immunity of Section 230. In the meantime, while licensing copyrighted works for GAI training will not address all, or most, of the potential hazards of artificial intelligence, the courts should decline to adopt strained fair use rationales in the name of assumed progress that may turn out to be a complete disaster.


[1] I believe there are cultural reasons that militate against this result, but those predictions do not influence the fair use consideration.

Chamber of Progress Says Tariffs Are an Excuse to Infringe Copyrights

tariff

Politico reported yesterday that the astroturf organization called Chamber of Progress stated that because Trump’s tariffs will be a “gut punch” to Silicon Valley stock prices, California legislators should decline to aggravate matters by passing a law that would require transparency among AI developers using copyrighted works in model training. Granted, the tone was more circumspect, but that’s what the argument boils down to:  Tariffs are going to screw our stock values, so we need to screw creators to offset the harm.

According to Chamber of Progress economist Kaitlyn Harger, the cost of compliance with AB 412, sponsored by Assembly Member Rebecca Bauer-Kahan, would cause a dip in stock values that “…could carve $381 million out of California’s tax haul from the four tech giants, all key players in the generative AI boom,” Politico reports.

I won’t comment on the numbers, especially because they are speculative, but I will note the amount of SOP fluff being used to package this argument against the transparency bill. Adam Eisgrau, senior director of AI, creativity, and copyright policy at Chamber of Progress states that founding this anti-AB 412 argument in the tariff controversy is “not opportunistic,” when of course it is. He states, “It is fair to call tariffs a tax, and I think it’s fair to call this bill an innovation tax.”

Kudos for dinging tariffs and taxes and promoting innovation in one sentence, but Eisgrau is parroting a longstanding practice of Silicon Valley, calling any price it would pay for necessary materials a “tax” on progress. While compliance with AB 412’s transparency provisions would naturally cost the tech giants something, why is that cost, let alone the effect of tariffs, a basis for ignoring the creators’ whose works are being mined for AI training?

Assuming tariffs will hit every sector and increase prices across multiple supply chains, that universal condition is not a rationale for tech giants getting a supply of copyrighted works for free. The creators who make those works aren’t getting their supplies for free—and most creators barely make a living wage if they’re lucky. Meanwhile, if the California Assembly is looking broadly at the state’s economy in this North v. South narrative, even a cursory review of the numbers shows that motion picture production supports more jobs than the tech giants.

“Bauer-Kahan’s proposal has the backing of Hollywood labor groups,” Politico states, “including the powerful actors’ guild SAG-AFTRA and the National Association of Voice Actors. But it’s been side-eyed by tech industry critics who say it would upend fair-use protections and turn AI training into a lawsuit in waiting.”

This “upend fair use” claim, whether it comes from Eisgrau or any other tech representative, is standard parlor trick of that industry. First, they advocate a broad, generalized application of fair use (a doctrine that defies generalization) and then claim that any counterargument to their position would “upend” some standard that has been established. This is simply false.

AI training with protected works presents a novel set of facts to be weighed in context to fair use case law, and, thus, a finding that training is not fair use would not “upend” precedent. On the other hand, the rhetoric used by Big Tech in this regard asks for a “fair use” application so sweeping that it would be tantamount to a statutory carve-out for all machine learning now or in the future. That is asking to upend fair use.

The consensus appears to be that Trump’s tariff tactics can only sow chaos and drive up the cost of living for all Americans—including, by the way, creators of works protected by copyright. But despite the prospect of universal economic pain, the Chamber of Progress asks California lawmakers to shield a few of the wealthiest corporations on Earth from the rights and financial interests of the creators whose works those companies are exploiting. Wow.


Photo by Beebright

Shedding Light: Briefs Filed in Kadrey v. Meta

kadrey

The purpose of cultivating works of authorship is to shed light on human experience, and the foundational purpose of the fair use doctrine in copyright law is to shed light on works of authorship. From its 18th century, English roots to the U.S. Supreme Court’s 2023 decision in AWF v. Goldsmith, the primary rationale for fair use is to permit the unlicensed use of works in ways that critique or comment upon the works themselves. Harvesting millions books to train an LLM does not do this.

With the growth of digital technologies and copyright protection for highly utilitarian computer code, fair use doctrine expands somewhat to permit certain “non expressive” uses of works. But these uses allowed by the courts have still tended to provide information about the works used or have been held to advance purposes like software interoperability. Harvesting millions of books to train an LLM does not do this.

A pair of briefs filed in Kadrey v. Meta—one by Association of American Publishers (AAP), the other filed by a group of IP law professors—present compelling arguments against finding that Meta’s unlicensed copying of millions of books to train its generative AI product Llama is fair use. A common theme in both briefs exposes a core fallacy, and legal hypocrisy, common to AI developers in these cases—namely that they copy protected “expression,” but they don’t copy protected “expression.”

As we see in the shorthand of social media, the developers write their own dichotomy by simultaneously humanizing and dehumanizing their products. In one breath, they compare machine leaning (ML) to human learning but then drop the analogy when they seek to claim that the protected “expression” in the works used is not copied or stored by their mysterious and complex “training” models. The AAP brief argues that copying “expression” is central to training an LLM, and the professors’ brief shows why “learning like a human” is precisely why fair use does not exempt Meta from obtaining licenses.

Both AAP and the professors naturally present specific arguments as to why none of the fair use case law supports Meta’s defense, but I was intrigued by the ways in which both briefs argue from different perspectives that training Llama indeed exploits the “expressive content” of the books appropriated. In fact, if it could be shown that no protected expression is copied or stored, this would be an argument that no case for infringement exists. But considering the emphasis on fair use—and all similar cases will almost certainly turn on fair use—we can assume that this statement from AAP is correct:

Meta would have this Court believe that authors’ original expression is not preserved in or exploited by the model. But this is not so. The LLM algorithmically maps and stores authors’ original expression so it can be used to generate output—indeed, that is the very point of the training exercise.

Kadrey and all AI training lawsuits with similar facts presented will turn on fair use factors one and four. Under factor two (nature of the works used), the books in Kadrey, and the works in most other cases, are “expressive” rather than “factual” in nature, and therefore, this factor favors plaintiffs. Under factor three (amount of the work used), it is understood that whole works have been fed into the LLM models, and so, this factor also favors plaintiffs.

Under the first fair use factor (purpose of the use), the court considers 1) whether the use is transformative; and 2) whether the use is commercial. Here, Meta’s commercial purpose is undeniable, and the AAP brief soundly argues that there is nothing transformative about copying the word-for-word expression in textual works for a purpose that sheds no light on the works used. On the contrary, the intent of the LLM is to create a non-human, substitute “author,” a purpose for which there is indeed no judicial precedent.

Factor four considers potential market harm to the copyright owner(s) of the work(s) used, and factor four may be the keystone in the broader creators versus GAI battle. Meta, a trillion-dollar company run by executives whose credibility is in doubt, contends that it is not feasible to license the books they used to train Llama. In response, AAP presents substantial evidence of licensing agreements between copyright owners and several major AI developers, and it states that Meta abandoned negotiations with publishers and chose instead to harvest books from pirate repositories.

Further, AAP argues “from a policy perspective” that Meta’s accessing those pirate “libraries” of DRM-free books militates against finding fair use in contravention of Congress’s intent when it passed the Digital Millennium Copyright Act (DMCA) in 1998. “Congress sought to establish a robust digital marketplace by ensuring appropriate safeguards for works made available online, including copyright owners’ ability to rely on DRM protections in distributing electronic copies of their works.”

In this spirit, inherent to the history of the fair use doctrine is the notion of “fair dealing” or, put differently, general legality in the overall purpose and character of the use. “The compiler of the training data’s knowledge of the unlawful provenance of the source copies might well taint the ‘character’ of the defendant’s use,” writes Professor Jane Ginsburg in a paper examining the question of fair use of works for AI training.[1]

The Professors’ Brief

The brief filed by the IP professors also emphasizes that the protected “expression” in the works is copied and exploited without license, but it also rather deftly uses Meta’s own rhetoric to doom the fair use defense. In general, when the AI cheerleaders say that LLMs “learn the way humans do,” my instinct has been to sneer at this anthropomorphic sentiment. But by giving the “learning like humans” analogy weight, the professors’ brief demonstrates exactly why that claim is fatal to a defense that the developer’s purpose is fair use.

Noting that humans indeed use protected works for “learning” all the time, the professors make plain that this exact relationship between author and reader (the basis for copyright) does not exempt the human from obtaining works legally. Thus, by Meta’s own analogy, the “machines learn like humans” claim is both an affirmation that the “expression” is being exploited and proof that that there is nothing transformative about using works for “learning.”

Further, the professors have a bit of fun emphasizing that Meta et al. strain to make the machine leaning process sound as technically complex as possible to obscure the fact that only by copying “expression” could the LLM actually “learn” anything. Here, a tip of the hat is deserved for the brief’s description of a human being reading a book thus:

… many billions of photons hit the book’s surface; some of those billions reached a lens, which focused them onto a retina, which converted them into electronic signals, which then resulted in electronic and chemical changes in some portion of over 100 billion neurons with over 100 trillion connections, some of those changes being transitory, and others more permanent.

The technical description of human processing and learning is even more mysterious because not even expert specialists in neuroscience know how the brain works at the neuronal level.

Well done! If that needlessly technical description of human reading requires legal access to the book, then so does the far less complex process of machine learning for AI development. Moreover, even if Meta were the vanguard developer and there were no examples of licensing deals being made, there is no rationale anywhere in commerce that a necessary resource must be free because it is essential. Meta et al. need electricity, engineers, and probably a computer or two to develop Lllama, and not one of these resources is free. Yet, somehow the most essential resource—the work of millions of authors—should be free.

On that note, there has never been a more important time to protect the rights and economic value of authors who shed light on the world we inhabit. I remain more than skeptical that it will ever be desirable to create literary works without authors, musical works without composers, etc. And certainly, licensing deals alone do not address all the potential hazards of unethical or questionable uses of generative AI. How products like Llama are used will provoke discussions that are cultural as well as legal. But for the moment, fair training of all AI models is the only rule that is both ethical and consistent with copyright’s purpose.


[1] Prof. Ginsburg is not one of the professors in the brief cited for this post.

Photo by Busko