Shedding Light: Briefs Filed in Kadrey v. Meta

kadrey

The purpose of cultivating works of authorship is to shed light on human experience, and the foundational purpose of the fair use doctrine in copyright law is to shed light on works of authorship. From its 18th century, English roots to the U.S. Supreme Court’s 2023 decision in AWF v. Goldsmith, the primary rationale for fair use is to permit the unlicensed use of works in ways that critique or comment upon the works themselves. Harvesting millions books to train an LLM does not do this.

With the growth of digital technologies and copyright protection for highly utilitarian computer code, fair use doctrine expands somewhat to permit certain “non expressive” uses of works. But these uses allowed by the courts have still tended to provide information about the works used or have been held to advance purposes like software interoperability. Harvesting millions of books to train an LLM does not do this.

A pair of briefs filed in Kadrey v. Meta—one by Association of American Publishers (AAP), the other filed by a group of IP law professors—present compelling arguments against finding that Meta’s unlicensed copying of millions of books to train its generative AI product Llama is fair use. A common theme in both briefs exposes a core fallacy, and legal hypocrisy, common to AI developers in these cases—namely that they copy protected “expression,” but they don’t copy protected “expression.”

As we see in the shorthand of social media, the developers write their own dichotomy by simultaneously humanizing and dehumanizing their products. In one breath, they compare machine leaning (ML) to human learning but then drop the analogy when they seek to claim that the protected “expression” in the works used is not copied or stored by their mysterious and complex “training” models. The AAP brief argues that copying “expression” is central to training an LLM, and the professors’ brief shows why “learning like a human” is precisely why fair use does not exempt Meta from obtaining licenses.

Both AAP and the professors naturally present specific arguments as to why none of the fair use case law supports Meta’s defense, but I was intrigued by the ways in which both briefs argue from different perspectives that training Llama indeed exploits the “expressive content” of the books appropriated. In fact, if it could be shown that no protected expression is copied or stored, this would be an argument that no case for infringement exists. But considering the emphasis on fair use—and all similar cases will almost certainly turn on fair use—we can assume that this statement from AAP is correct:

Meta would have this Court believe that authors’ original expression is not preserved in or exploited by the model. But this is not so. The LLM algorithmically maps and stores authors’ original expression so it can be used to generate output—indeed, that is the very point of the training exercise.

Kadrey and all AI training lawsuits with similar facts presented will turn on fair use factors one and four. Under factor two (nature of the works used), the books in Kadrey, and the works in most other cases, are “expressive” rather than “factual” in nature, and therefore, this factor favors plaintiffs. Under factor three (amount of the work used), it is understood that whole works have been fed into the LLM models, and so, this factor also favors plaintiffs.

Under the first fair use factor (purpose of the use), the court considers 1) whether the use is transformative; and 2) whether the use is commercial. Here, Meta’s commercial purpose is undeniable, and the AAP brief soundly argues that there is nothing transformative about copying the word-for-word expression in textual works for a purpose that sheds no light on the works used. On the contrary, the intent of the LLM is to create a non-human, substitute “author,” a purpose for which there is indeed no judicial precedent.

Factor four considers potential market harm to the copyright owner(s) of the work(s) used, and factor four may be the keystone in the broader creators versus GAI battle. Meta, a trillion-dollar company run by executives whose credibility is in doubt, contends that it is not feasible to license the books they used to train Llama. In response, AAP presents substantial evidence of licensing agreements between copyright owners and several major AI developers, and it states that Meta abandoned negotiations with publishers and chose instead to harvest books from pirate repositories.

Further, AAP argues “from a policy perspective” that Meta’s accessing those pirate “libraries” of DRM-free books militates against finding fair use in contravention of Congress’s intent when it passed the Digital Millennium Copyright Act (DMCA) in 1998. “Congress sought to establish a robust digital marketplace by ensuring appropriate safeguards for works made available online, including copyright owners’ ability to rely on DRM protections in distributing electronic copies of their works.”

In this spirit, inherent to the history of the fair use doctrine is the notion of “fair dealing” or, put differently, general legality in the overall purpose and character of the use. “The compiler of the training data’s knowledge of the unlawful provenance of the source copies might well taint the ‘character’ of the defendant’s use,” writes Professor Jane Ginsburg in a paper examining the question of fair use of works for AI training.[1]

The Professors’ Brief

The brief filed by the IP professors also emphasizes that the protected “expression” in the works is copied and exploited without license, but it also rather deftly uses Meta’s own rhetoric to doom the fair use defense. In general, when the AI cheerleaders say that LLMs “learn the way humans do,” my instinct has been to sneer at this anthropomorphic sentiment. But by giving the “learning like humans” analogy weight, the professors’ brief demonstrates exactly why that claim is fatal to a defense that the developer’s purpose is fair use.

Noting that humans indeed use protected works for “learning” all the time, the professors make plain that this exact relationship between author and reader (the basis for copyright) does not exempt the human from obtaining works legally. Thus, by Meta’s own analogy, the “machines learn like humans” claim is both an affirmation that the “expression” is being exploited and proof that that there is nothing transformative about using works for “learning.”

Further, the professors have a bit of fun emphasizing that Meta et al. strain to make the machine leaning process sound as technically complex as possible to obscure the fact that only by copying “expression” could the LLM actually “learn” anything. Here, a tip of the hat is deserved for the brief’s description of a human being reading a book thus:

… many billions of photons hit the book’s surface; some of those billions reached a lens, which focused them onto a retina, which converted them into electronic signals, which then resulted in electronic and chemical changes in some portion of over 100 billion neurons with over 100 trillion connections, some of those changes being transitory, and others more permanent.

The technical description of human processing and learning is even more mysterious because not even expert specialists in neuroscience know how the brain works at the neuronal level.

Well done! If that needlessly technical description of human reading requires legal access to the book, then so does the far less complex process of machine learning for AI development. Moreover, even if Meta were the vanguard developer and there were no examples of licensing deals being made, there is no rationale anywhere in commerce that a necessary resource must be free because it is essential. Meta et al. need electricity, engineers, and probably a computer or two to develop Lllama, and not one of these resources is free. Yet, somehow the most essential resource—the work of millions of authors—should be free.

On that note, there has never been a more important time to protect the rights and economic value of authors who shed light on the world we inhabit. I remain more than skeptical that it will ever be desirable to create literary works without authors, musical works without composers, etc. And certainly, licensing deals alone do not address all the potential hazards of unethical or questionable uses of generative AI. How products like Llama are used will provoke discussions that are cultural as well as legal. But for the moment, fair training of all AI models is the only rule that is both ethical and consistent with copyright’s purpose.


[1] Prof. Ginsburg is not one of the professors in the brief cited for this post.

Photo by Busko

Gen AI & the Hubris of Data

data

In almost every discussion I’ve had with creators about generative AI (GAI), I have said that we should not overlook Big Tech’s capacity for exaggeration and total flops. Because it is possible that AI products may be the next Google Glass due to cultural and/or economic forces that reject their business models. For instance, last week, Digital Music News (DMN) announced a partnership between Amazon and the AI music product Suno for the next generation of Alexa+. DMN quotes Amazon’s Panos Panay, SVP of Devices and Services thus:

Using Alexa’s integration with Suno, you can turn simple, creative requests into complete songs, including vocals, lyrics, and instrumentation. Looking to delight your partner with a personalized song for their birthday based on their love of cats, or surprise your kid by creating a rap using their favorite cartoon characters? Alexa+ has you covered.

The first time I read about Suno, it struck me as a gimmick that may not attract or sustain enough market interest to be profitable. Just the example cited above of making personalized birthday songs seems like the kind of thing a household can only do a few times before it gets stale. “Surprise your kid by creating a rap…” sounds like what the kids calls “cringy.” But the broader question posed by Suno is whether consumers want “personalized” music, or whether the whole concept is the just another hubristic statement about the power of data in the arts.

There have been many arguments presented by theorists and scholars that consumer data either obviates the need for creators’ rights (copyrights) or justifies substantially limiting those rights. The general premise is that if consumer data informs creators about what audiences want, this insight lowers the risk of investing in production. Lowering that risk, say the theorists, implies rethinking copyright protection—or even rethinking the nature and value of creators, as Professors Sprigman and Rustalia proposed in a paper I critiqued in 2018.

As argued in that criticism and elsewhere on this blog, the goal of artists and creators is not necessarily to give audiences what they want. While one cannot dispute the market value of certain “formulas,” there is substantial evidence that when producers strive too hard to meet audience expectations, audiences are often disappointed. In short, risk is inherent to creative expression and audience experience.

In every medium and every genre, consumers want to be surprised by artists, and shifting modes of expression reflect artists’ personal responses to contemporary events. In general, the most successful (i.e., meaningful) works are the ones we didn’t know we wanted until we had them. And once these works become part of the vernacular of our lives, we cannot imagine living without them.

By contrast theories about the power of data as a predictor of creative success are founded in a techno-centric arrogance that, to me, is exemplified in a product like Suno. The idea that the consumer wants music to be tailored from a few instructions—“Alexa make me a punk rock song about a guy who lost his job.”—is typical of the kind of “innovation” many technologists would develop by ignoring fundamental reasons we enjoy music in the first place.

As explored in this post about opera, I agree that music, and other expressive media, can be replicated by an AI to provoke emotional responses in human observers. Simply put, if a composer knows that minor chords have a certain effect on the Western listener, then an AI can follow the same rule to produce a “melancholy” tune. But the science of music and human psychology only explains our instinctive, animal-like responses to combinations of sounds while leaving out the rest of the experience.

We cherish our playlists for reasons that transcend the sounds’ effects on our brains—i.e., transcend mere taste. We relate and return to artists or their messages; we store and recall memories in the songs we replay; and we connect to friends and family through songs we have in common. Suno, outputting a bespoke song like a tepid cocktail cannot provide any of that. On the contrary, it omits all those aspects of music that make us care about it, suggesting that its outputs are indeed gimmicks destined to become as dull as they are disposable when the short-lived novelty wears off. At least that’s my prediction.

There is, of course, a more insidious question worth asking—namely whether a product like Suno, especially when paired with Amazon, is less significant as a custom jukebox than it is as a new surveillance device. The use of personal data to micro-target and manipulate people and alter the course of major world events is not science fiction anymore. In that light, is it not conceivable that, say, 100-million people expressing their sentiments to an AI “music composer” will add color to data that will only exacerbate surveillance capitalism? That would be one hell of a way to pervert music.


Photo by: Cm2012

Are Creators Aligned on Artificial Intelligence?

creators

One of many challenges with adoption of generative AI (GAI) tools is whether creators are willing to demonstrate a degree of solidarity on the matter—i.e., apply the principle we generally call fair trade. If Creator A uses a GAI that might be harmful to Creator B in a different field, and so on, will most creators take this broader perspective in a group effort to demand ethical uses of GAI?  Moreover, this question becomes intertwined with copyright because the use of GAI is a subject of evolving legal doctrine, meaning that creators who want to produce commercial content outside their core talents should be aware that the material produced may not be protectable under the law.

Two simple examples would be the self-published book author who might use an AI voice app to produce an audiobook, and the documentary filmmaker who might use an AI music generator to produce a soundtrack for a film. In both examples, creators in other fields—voice actors and composers respectively—are potentially harmed by the development and use of these AI tools, but 1) will the author and filmmaker take that consideration into account?; and 2) will the sound recordings in either case be protected by copyright?

In the case of the author using AI in lieu of hiring a narrator to produce the audiobook, I predict that under current doctrine, the sound recording would not be protected by copyright law because there is no human performance captured in that recording. Thus, remedies for any piracy of the audiobook would rely solely on the protection of the underlying literary work, which is effective—but if the sound recording is also protected and registered, that would be two works infringed instead of one.

This increases the potential damages for infringement, which puts the author/owner in a stronger position if she needs to take legal action. By this example, authors’ interests may be seen as aligned with those of professional book narrators. Hiring a narrator will not only achieve better quality in the reading, but capturing the human performance is also a basis for copyright attaching to the sound recording.

Similar considerations would apply to the filmmaker with the GAI soundtrack, although there may be other factors that provide the AI music with some protection we don’t find with the AI audiobook. One factor that may become relevant is whether the filmmaker can show that he exerted sufficient creative control over the final sounds. If so, he may be able to defend a claim of copyright in the soundtrack, but we are likely several years and a few lawsuits away from clear guidance on this question.

Another consideration with the soundtrack may be the Copyright Office’s current view that material using assistive AI “within a larger work” is protected. Creators should be careful about interpreting that broad language because constituent works that stand alone—and this would apply to a soundtrack for a film—would logically not be independently protected.

Of course, there are many GAI products that allow one type of creator to avoid hiring another type of creator for a given project. Some of this is inevitable, and it is not necessarily unethical or bad for creative culture. That said, even with ethically trained and ethically used AI tools, the copyright considerations should be weighed by the individual creator (i.e., do they care about protecting what might not be protectable?), but also collectively by all creators contributing to a new ecosystem.

Since 1978 in the U.S., the default is automatic copyright protection, even if most rights are never enforced. But as GAI is used to produce a lot of material that is not protected, it is hard to predict what effect this might have on copyright overall. Even older than automatic copyright with the 1976 Act, the human authorship principle fosters a new tension for creators who may wish to combine GAI and human-authored work. As a response to that tension, it would be a mistake in my view to overwrite the “human spark” doctrine and simply protect any material that “walks and talks” like a creative work. This isn’t just an emotional appeal to anthropocentrism but rather a conviction that copyright would become meaningless—even unconstitutional—by eroding the incentive rationale for its existence.

Regardless of the theoretical questions addressed in this post, I believe that as a practical matter, creators should think carefully about how and when to use GAI for various projects. As an ethical consideration, perhaps if you’re opposed to “scraping” in your industry, then opposing it in others is the right view to take. But as a business consideration, if what you’re making is meant to have commercial value, AI-generated might mean not protected by copyright—and that means even if you spend money and time on it, it isn’t yours.