“Innovation” Doesn’t Mean Anything

"innovation"

Two headlines in the first week of this month said a lot about the United States as an “innovative” nation right now. One story announced that the first driverless semi-trucks are on the highway covering normal long-haul routes, and the second reported that the final shipments of pre-tariff goods from China were arriving at U.S. ports. Leave it to contemporary America to dispatch a new fleet of robot trucks just in time for the cargo containers to be empty. On the other hand, I guess it works out in principle because the unemployed truck driver won’t have money to buy the goods that won’t be on the shelves.

According to the tech-utopians about a decade ago, the displaced truck driver shouldn’t worry because he now lives in a world of abundance and can, at last, spend his days painting or writing poetry or making music with all the leisure time he now enjoys. Isn’t that what happened? Didn’t technology “innovate” that Keynesian promise of a social and economic golden age? Doesn’t look like it. In fact, we’ve even got machines to write poetry and make music, so the ex truck driver will just have to pound sand.

Big Tech historically calls everything it does “innovation,” allowing scant room for critique of a product’s pros and cons while labeling any policy that might protect some injured parties “anti-innovation.” Even where harmful results are identified and become the subjects of congressional hearings, the product makers effectively sell these “unintended” hazarrds as a price that must be paid for “more innovation.” And, by the way, that promised “age of abundance” will start any day now, if we are just patient and keep feeding the beast more data.

The Coalition for a Safer Web can describe in grim detail how social media and other tech platforms have “innovated” teen suicide, scams, and drug trafficking. Or the recent proliferation of AI “companion” apps (virtual girlfriends and boyfriends) has “innovated” new concerns among child psychologists—and these apps may also “innovate” new vectors for malware attacks. And, of course, increasingly realistic AI deepfakes may further “innovate” our fleeting grasp on reality, which has been essential to “innovating” American democracy to the edge of extinction.

Sporting the word “innovation” as a cloak for all manner of sins, the tech industry contends that the materials used to build the next generation of AI products (i.e., the works of artists and creators) are so essential for even more “innovation” that copyright rights must be disregarded. Elon Musk and Jack Dorsey even opined that the U.S. should simply abandon intellectual property rights altogether, and the industry rhetoric appealing to the current administration claims that copyrights must not hamper the national interest in “winning” the competition to build the “best” AI.

The folly of declaring an intent to “win the AI war” without defining what success looks like is consistent with U.S. tech policy for decades and with policy affecting all sectors, public and private, today. To call Trump 2.0 incoherent is too kind, as that term can imply well-meaning error when, in fact, the administration is engaged in a purposeful, multi-pronged attack on science and the arts in direct conflict with the intent of the progress clause of the Constitution.

Article I, Section 8, Clause 8, giving Congress the power to “promote science and the useful arts” by establishing copyright and patent laws was an expression of the Framers hope that the fledgling, agrarian nation might one day create great cultural works and inventions. But of course, IP law alone can’t do that. Quite simply, without the I, you ain’t got no P—and I is under assault in the United States. Brain-drain and chaos are now the hallmarks of every federal department from healthcare to defense, and in the private sector, Trump’s goons attack universities, the motion picture industry, publishers, authors, journalists, and scientists—literally anyone smarter than they are, which includes a lot of damn people.

“Innovation,” Copyright, and AI Training

Big Tech argues that all AI training with protected works should be exempted from infringement claims by the doctrine of fair use. Ordinarily, broad claims about fair use remain in the blogosphere while specific legal questions are weighed in court. But in regard to AI training, I worry that the general perception of the technology as “innovative” may result in overbroad application of “transformativeness” under factor one, which considers the purpose of a use.

For instance, Judge Chhabria, in last week’s hearing in Kadrey et al. v. Meta, stated that Meta’s Llama is “highly transformative,” which may signal an overbroad reading that synonymizes “transformative” with “innovative” while also eliding a thorough weighing of the extensive purposes for which the use is made. Or in a nutshell, how can a court fully consider the purpose of a use when the technology at issue is dynamic and open-ended?

As noted in an earlier post, landmark fair use cases have involved technologies that were complete models as facts presented to the courts—e.g., the VCR and the Google Books search tool. The court did not need to wonder, for instance, whether the purpose of Google Books—i.e., to provide information about books—might also be used to build an AI “psychologist” that may harm patients seeking mental healthcare. In fact, as The Guardian reports on this very issue, Mark Zuckerberg advocates “innovating” psychotherapy with AI “providers,” thus adding doctor next to historian, journalist, and constitutional scholar to the list of qualifications he lacks as he proceeds to break all things.

In this context, and with the recognition that Meta’s commercial interests entail application of its AI tools across many, if not all, initiatives in the company, what exactly is the purpose of Llama as weighed in a factor one fair use consideration? I’m not convinced the court can really know.

Beyond the Four Factors

When Congress codified fair use in the 1976 Act, it sought to convey over a century of judge-made law as statutory guidance, but beyond the four-factor test, “courts may take other considerations into account,” writes Professor Jane Ginsburg in a paper about AI and fair use. Indeed, she cites to the Google Books case, in which the court states, “the use provides a significant benefit to the public.” But with a product like Llama, where a court has reason to predict substantial crossover between socially beneficial and socially toxic purposes, how can a judge reasonably decide whether the purpose is “highly transformative” when the facts themselves are so ephemeral?

It is one matter for a court to consider the “transformativeness” of an AI built for a clearly defined purpose as presented, but it seems another matter if the technology has myriad purposes, including ones that will manifest after a case has been resolved. Whether Midjourney’s purpose to enable the production of visual works makes fair use of visual works in its training may be a sufficiently narrow consideration, but by contrast, an LLM developed by Meta is arguably open-ended development for purposes as yet undefined.

After all, Meta began with a college student ranking sorority girls and is now a trillion-dollar company that has altered the course of human history—and many of its “innovations” have had destructive results. In this light, the courts should decline to find “transformativeness” in the same overbroad spirit in which the tech industry wields the term “innovation.” Because without a clear definition and coherent law and policy, “innovation” is how we end up with a truck with no driver carrying a load of nothing to nobody.


Photo by Snoopydog1955

Shedding Light: Briefs Filed in Kadrey v. Meta

kadrey

The purpose of cultivating works of authorship is to shed light on human experience, and the foundational purpose of the fair use doctrine in copyright law is to shed light on works of authorship. From its 18th century, English roots to the U.S. Supreme Court’s 2023 decision in AWF v. Goldsmith, the primary rationale for fair use is to permit the unlicensed use of works in ways that critique or comment upon the works themselves. Harvesting millions books to train an LLM does not do this.

With the growth of digital technologies and copyright protection for highly utilitarian computer code, fair use doctrine expands somewhat to permit certain “non expressive” uses of works. But these uses allowed by the courts have still tended to provide information about the works used or have been held to advance purposes like software interoperability. Harvesting millions of books to train an LLM does not do this.

A pair of briefs filed in Kadrey v. Meta—one by Association of American Publishers (AAP), the other filed by a group of IP law professors—present compelling arguments against finding that Meta’s unlicensed copying of millions of books to train its generative AI product Llama is fair use. A common theme in both briefs exposes a core fallacy, and legal hypocrisy, common to AI developers in these cases—namely that they copy protected “expression,” but they don’t copy protected “expression.”

As we see in the shorthand of social media, the developers write their own dichotomy by simultaneously humanizing and dehumanizing their products. In one breath, they compare machine leaning (ML) to human learning but then drop the analogy when they seek to claim that the protected “expression” in the works used is not copied or stored by their mysterious and complex “training” models. The AAP brief argues that copying “expression” is central to training an LLM, and the professors’ brief shows why “learning like a human” is precisely why fair use does not exempt Meta from obtaining licenses.

Both AAP and the professors naturally present specific arguments as to why none of the fair use case law supports Meta’s defense, but I was intrigued by the ways in which both briefs argue from different perspectives that training Llama indeed exploits the “expressive content” of the books appropriated. In fact, if it could be shown that no protected expression is copied or stored, this would be an argument that no case for infringement exists. But considering the emphasis on fair use—and all similar cases will almost certainly turn on fair use—we can assume that this statement from AAP is correct:

Meta would have this Court believe that authors’ original expression is not preserved in or exploited by the model. But this is not so. The LLM algorithmically maps and stores authors’ original expression so it can be used to generate output—indeed, that is the very point of the training exercise.

Kadrey and all AI training lawsuits with similar facts presented will turn on fair use factors one and four. Under factor two (nature of the works used), the books in Kadrey, and the works in most other cases, are “expressive” rather than “factual” in nature, and therefore, this factor favors plaintiffs. Under factor three (amount of the work used), it is understood that whole works have been fed into the LLM models, and so, this factor also favors plaintiffs.

Under the first fair use factor (purpose of the use), the court considers 1) whether the use is transformative; and 2) whether the use is commercial. Here, Meta’s commercial purpose is undeniable, and the AAP brief soundly argues that there is nothing transformative about copying the word-for-word expression in textual works for a purpose that sheds no light on the works used. On the contrary, the intent of the LLM is to create a non-human, substitute “author,” a purpose for which there is indeed no judicial precedent.

Factor four considers potential market harm to the copyright owner(s) of the work(s) used, and factor four may be the keystone in the broader creators versus GAI battle. Meta, a trillion-dollar company run by executives whose credibility is in doubt, contends that it is not feasible to license the books they used to train Llama. In response, AAP presents substantial evidence of licensing agreements between copyright owners and several major AI developers, and it states that Meta abandoned negotiations with publishers and chose instead to harvest books from pirate repositories.

Further, AAP argues “from a policy perspective” that Meta’s accessing those pirate “libraries” of DRM-free books militates against finding fair use in contravention of Congress’s intent when it passed the Digital Millennium Copyright Act (DMCA) in 1998. “Congress sought to establish a robust digital marketplace by ensuring appropriate safeguards for works made available online, including copyright owners’ ability to rely on DRM protections in distributing electronic copies of their works.”

In this spirit, inherent to the history of the fair use doctrine is the notion of “fair dealing” or, put differently, general legality in the overall purpose and character of the use. “The compiler of the training data’s knowledge of the unlawful provenance of the source copies might well taint the ‘character’ of the defendant’s use,” writes Professor Jane Ginsburg in a paper examining the question of fair use of works for AI training.[1]

The Professors’ Brief

The brief filed by the IP professors also emphasizes that the protected “expression” in the works is copied and exploited without license, but it also rather deftly uses Meta’s own rhetoric to doom the fair use defense. In general, when the AI cheerleaders say that LLMs “learn the way humans do,” my instinct has been to sneer at this anthropomorphic sentiment. But by giving the “learning like humans” analogy weight, the professors’ brief demonstrates exactly why that claim is fatal to a defense that the developer’s purpose is fair use.

Noting that humans indeed use protected works for “learning” all the time, the professors make plain that this exact relationship between author and reader (the basis for copyright) does not exempt the human from obtaining works legally. Thus, by Meta’s own analogy, the “machines learn like humans” claim is both an affirmation that the “expression” is being exploited and proof that that there is nothing transformative about using works for “learning.”

Further, the professors have a bit of fun emphasizing that Meta et al. strain to make the machine leaning process sound as technically complex as possible to obscure the fact that only by copying “expression” could the LLM actually “learn” anything. Here, a tip of the hat is deserved for the brief’s description of a human being reading a book thus:

… many billions of photons hit the book’s surface; some of those billions reached a lens, which focused them onto a retina, which converted them into electronic signals, which then resulted in electronic and chemical changes in some portion of over 100 billion neurons with over 100 trillion connections, some of those changes being transitory, and others more permanent.

The technical description of human processing and learning is even more mysterious because not even expert specialists in neuroscience know how the brain works at the neuronal level.

Well done! If that needlessly technical description of human reading requires legal access to the book, then so does the far less complex process of machine learning for AI development. Moreover, even if Meta were the vanguard developer and there were no examples of licensing deals being made, there is no rationale anywhere in commerce that a necessary resource must be free because it is essential. Meta et al. need electricity, engineers, and probably a computer or two to develop Lllama, and not one of these resources is free. Yet, somehow the most essential resource—the work of millions of authors—should be free.

On that note, there has never been a more important time to protect the rights and economic value of authors who shed light on the world we inhabit. I remain more than skeptical that it will ever be desirable to create literary works without authors, musical works without composers, etc. And certainly, licensing deals alone do not address all the potential hazards of unethical or questionable uses of generative AI. How products like Llama are used will provoke discussions that are cultural as well as legal. But for the moment, fair training of all AI models is the only rule that is both ethical and consistent with copyright’s purpose.


[1] Prof. Ginsburg is not one of the professors in the brief cited for this post.

Photo by Busko

Reversal in Thomson Reuters Case May Bode Well For Copyright Owners Against AI

Thomson

It has already caught the attention of most copyright watchers that Judge Bibas of the District Court for the District of Delaware (3rd Circuit) reversed his own 2023 summary judgment ruling in the copyright AI case Thomson Reuters v. Ross Intelligence. Thompson, which owns the legal research database Westlaw, sued Ross for copyright infringement after the latter built its competitive AI-powered search tool by copying over 2,000 headnotes from Westlaw. Headnotes contain summaries which the court finds are sufficiently original for copyright protection, and it also finds that the material is protected under the doctrine of “selection and arrangement.”

Judge Bibas found copyright infringement of the headnotes and held that Ross’s defenses, including fair use, all failed. It is the fair use ruling that may be predictive of outcomes in other cases alleging copyright infringement for the purpose of AI training. Notably, Judge Bibas held that fair use factors one and four favored Thomson, and that Thompson prevails overall on fair use. To review, my amended summaries of the fair use factors are:

  • The purpose of the use, including whether the use is commercial.
  • The nature of the work used (i.e., whether it is more factual or creative).
  • The amount of the work used, including whether the “heart” of the work was used.
  • The potential market harm to the work used, namely whether the use substitutes for a use that the copyright owner retains the exclusive right to exploit in the market.

In Thomson, it is compelling that the court finds factors one and four go to plaintiff and that these carry the fair use finding overall when factors two and three go to defendant Ross. I say this because in other AI cases involving ingestion of entire visual, musical, and literary works, factors two and three will surely go to plaintiffs, and the AI developers can only hang their hopes on factors one and four.

Under factor one, Judge Bibas held that Ross’s use was clearly commercial and that the purpose of the use serves essentially the same purpose as the works used. Here, the opinion uses language that could benefit other AI developers, but not necessarily. It states:

Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself). Rather, when a user enters a legal question, Ross spits back relevant judicial opinions that have already been written.  

On the one hand, that parenthetical note that Ross is “not generative” could be cited to argue that generative AIs like Midjourney or Udio favor a finding of transformativeness under factor one. But several of the strongest cases against the developers present similar evidence of “spitting back” copies of the material ingested. Further, as emphasized in Udio and Suno, two AIs built on ingesting protected sound recordings, plaintiffs also present a strong argument that the GAIs serve the same purpose as the works used and, therefore, the purpose is not transformative.

Where a court finds under factor one that an infringing use serves the “same purpose” as the work used, this will often, quit logically, lead to finding market substitution under factor four. Here, Judge Bibas is forthright in his reversal about his initial instinct to leave factor four as a question of fact to be decided by the jury. Most notably, in my view, he writes…

I worried whether there was a relevant, genuine issue of material fact about whether Thomson Reuters would use its data to train AI tools or sell its headnotes as training data. And I thought a jury ought to sort out “whether the public’s interest is better served by protecting a creator or a copier.”

Those first considerations from 2023 reprise two familiar arguments presented in fair use defenses, but which courts have generally found unpersuasive in recent high-profile cases. That the plaintiff is not yet in the market being pursued by the defendant has been held erroneous because it fails to properly consider the “potential” market for the protected works. Next, the “public interest” (i.e., for innovation’s sake) argument has been held too broad in major fair use cases—except Google v. Oracle, which is an outlier for several reasons. Thus, in reversing his thinking, Judge Bibas writes…

Even taking all facts in favor of Ross, it meant to compete with Westlaw by developing a market substitute. And it does not matter whether Thomson Reuters has used the data to train its own legal search tools; the effect on a potential market for AI training data is enough. Ross bears the burden of proof. It has not put forward enough facts to show that these markets do not exist and would not be affected.

Because factor two is generally considered the least important and factor four has long been considered the most important, Judge Bibas rests on that precedent to find that fair use overall favors Thomson. What this decision could signal for many AI developers who have copied millions of creative works to train their models is that the generalized “innovation and important for society” arguments will find slippery footing when they argue fair use.