Hacked Off at Facebook

Well, it finally happened. After criticizing the worst effects of social media for over 10 years, I was finally hacked, locked out of my Facebook account, and (I assume) will be unable to restore any of the material or connections going back to 2007. I’m sharing the details in this post because what I now believe to be a phishing-style attack had the appearance of Meta erroneously booting me for failure to comply with community standards. And frankly, Meta is so useless from a support standpoint that it hardly matters.

Whether Facebook moderators are in error, or the account was targeted by a hacker, there is no clear process for the average user to remedy either issue—just a Kafka-designed carousel of unhelpful articles and FAQs. And of course, beyond Facebook’s garden wall, one finds more scammers with offers to “help” because if you recently fell prey to a hacker, you’re bleeding in shark-infested waters.

Hacker or Facebook Moderators?

I say the attack was phishing-like because the initial communication did not come through email. Those are common enough and usually easy to spot. The email with the slightly blurry logo and wrong URL that claims to be your bank or insurance company or some other party with a message, invoice, or payment for you is trying to get you to click a link and download malware. As I say, these are easy enough to recognize and delete. But in this case, the communication came from within the Meta/Facebook environment—and not just as a DM in the Chat app.

Initially, I received messages from “Meta Business” in the Meta Business section of the platform. These were directed to me as the administrator of the Illusion of More page and not to me personally. I was told that IOM had been reported for (get this!) a copyright violation. As I do not engage in copyright violations, I responded to say that an error had been made, believing that I was writing to Meta since I was clearly on the Meta Business page and not some bogus URL. Unsurprisingly, there was no response, and a few days later, I was told in the same thread that my business page had been disabled. But the IOM page was not disabled, and I did not know what to make of the messages, especially when communication with Meta is not an option.

A few days later, I received a message directed to me personally, again within the Facebook platform, stating that an attempted login had occurred from an unusual location. I took the recommendation to change my password, and I do not believe I clicked on anything outside the Facebook universe such that I might provide the new password to a hacker. Nevertheless, several hours later, my personal account was disabled, and the relevant email and phone number were newly associated with an account called “Meta Copyright Infringement.”

I created a new personal account and did a search for “Meta Copyright Infringement” as People and found that many accounts have suffered this same fate. Some appear to still have pages intact, while others are blank:

Attacks of this nature have been reported since at least the start of 2023, but the articles I found all describe phishing via email, which is usually the vector. But unless I was truly distracted, all communication I received was within the Meta environment, and if hackers are spoofing Meta from within Meta, this implies a new and sophisticated campaign to acquire login credentials.

As for the rationale of the hacker(s), it is hard to say. In my case, as a copyright advocate, I can be a target for an anti-copyright hacker who just wants to mess with me. But so far, nothing inappropriate seems to have appeared on Facebook in my name. In fact, that account appears to have been deleted altogether. On the other hand, this just happened, so we’ll see. In the meantime, I no longer have control of two business pages, including Illusion of More on Facebook, because I was the sole administrator.

As mentioned above, this apparent hack is barely distinguishable from Meta disabling my account for an alleged violation of community standards, and the company offers zero remedies to address either issue. I mean, yeah, there’s a Help Center, but it makes the average DMV look like a hotel concierge. Meta provides a “review form” for disabled accounts, but this “form” only asks the customer to input a name, email, and a copy of ID to prove identity. But, of course, if the email entered is associated with a disabled account, you get a message saying that the account doesn’t exist, which indicates a hack, so…

Follow the instructions for recovering an account you think was hacked, and Meta will help you identify the account associated with the email…

Assuming that’s what FB thinks my account is now, I reluctantly click This is My Account, and…

And you can guess where that link “here” leads. Yup. Right back onto the carousel playing the calliope from Hell mocking you for getting on the ride in the first place.

I don’t know. Maybe I missed a clue somewhere in the attack, but the most compelling detail here is that it looked a lot like communication from Meta and within Meta. In fact, if Meta were to contact me at some point and confirm that they did kick me off for an alleged copyright violation, I would not be very surprised—except that it would still be an error. But apparently, this is what support looks like for a platform hosting three-billion people: when we can’t quite tell the difference between a cyber-attack and half-assed moderation insulated from its users by layers of bullshit.

EFF to Honor Scientific Paper Pirate Sci-Hub

The Electronic Frontier Foundation (EFF) announced that among the 2023 recipients of the EFF Award (formerly the Pioneer Award), it will honor Sci-Hub founder Alexandra Asanova Elbakyan this September. The Russian-based Sci-Hub is an enterprise-scale pirate site specifically built to host scientific papers about which the EFF states:

Through Sci-Hub, Elbakyan has strived to shatter academic publishing’s monopoly-like mechanisms in which publishers charge high prices even though authors of articles in academic journals receive no payment. She has been targeted by many lawsuits and government actions, and Sci-Hub is blocked in some countries, yet she still stands tall for the idea that restricting access to information and knowledge violates human rights.

In addition to the EFF’s usual flare for the dramatic, the organization continues to flaunt its unwavering hostility toward all copyright rights as a foundational raison d’etre. Note the word shatter in that quote above. To the ideologues at EFF et al., Sci-Hub is not merely a response to subscription fees but should be revered for its assault on the very idea that journal publishers ought to exist in the first place. In fact, the timing of this award is telling in that it comes two years after a landmark, open access agreement was negotiated by the University of California (UC) with publishing giant Elsevier.

Although subscription cost has often been a point of contention for many in the academic community, even contributing authors, the UC has been negotiating open access agreements with academic publishers on behalf of colleagues at smaller institutions with more limited resources. “We refer to these agreements as transformative open access agreements because they convert subscription payments into payments for open access publishing (with reading provided for free). It is a new approach we helped develop with other leading institutions a few years ago, in large part through the OA2020 initiative,” says librarian and economics professor Jeffrey MacKie-Mason of UC Berkeley.

Settled in March of 2021, the Elsevier deal was the ninth open access agreement the UC negotiated with academic publishers, which suggests that many authors of scientific research papers do not view a pirate site like Sci-Hub as a viable “solution” to whatever criticisms they have of the commercial publishers. While I do not presume to have expertise about the complex world of scientific journal publishing, a 2018 article by industry consultant Kent Andersen lists “102 things journal publishers do,” and it’s a lot more than hosting PDFs on a website. Notably, even among some of the critical comments on that article, which appear to be written by academic authors, there is no mention of Sci-Hub in particular, or piracy in general, as obviating the role played by journal publishers in the industry.

Although it is true (as the EFF emphasizes) that the authors of these papers are not paid for their writing, the academic publisher is more like a venue operator than a trade book publisher—a venue operator that, at best, serves as a neutral party to control quality. These journals invest substantial resources to review millions of submissions, prepare documents, maintain databases, check for plagiarism, organize peer review, etc. And although these investments need to be recovered profitably for the publisher to exist, the UC deals indicate that there is room for negotiation, which leaves EFF’s panegyric to Sci-Hub sounding as hollow as it is untimely.

Speaking of timing, with academics, policymakers, journalists, artists, and just about everyone else wondering how badly generative AI might exacerbate the misinformation problem, could there be a worse moment to award a pirate of scientific journals? How is Sci-Hub not the natural place for a generative AI developer to harvest scientific writing to train an algorithm to, perhaps, “write” papers without scientists? Notably, in the class-action case Tremblay et al. v. OpenAI, the plaintiffs allege that the defendant obtained literary works for machine learning (ML) from “shadow libraries,” (i.e., pirate sites like Z-Library). So, by the same logic, Sci-Hub would seem to be a natural source where an AI developer can scrape scientific papers.

I am neither motivated nor qualified to critique the entire scientific publishing ecosystem, let alone to dispute complaints among some academics about cost, et al. I would grant Elbakyan the benefit of the doubt that her intent is at least distinguishable from the typical entertainment media pirate whose only motive is financial, and I recognize that scientists and academics in various regions have access Sci-Hub for what may be difficult to obtain information. Nevertheless, the worn out view that piracy is a solution to imperfections in a given system is, at best, narrowly focused on distribution while ignoring the means and motives for production.

Not unlike Peter Sunde’s mourning the lost Marxist idealism he saw in the The Pirate Bay, Elbakyan echoed this same naivete when she told the Washington Post in 2016, “On my website, any person can read as many papers as they want for free, and sending donations is their free will. Why Elsevier cannot work like this, I wonder?” Indeed. The alleged “white hat” pirate never seems to grasp that there is always a cost to production and that, whatever system covers that cost, it won’t be a damn tip jar, and it will rely on copyright in some form. As the court stated in 2015 when Elsevier successfully sued Sci-Hub for infringement, “Elbakyan’s solution to the problems she identifies, simply making copyrighted content available for free via a foreign website, disserves the public interest.”

As for the EFF Award, it’s worth asking what Sci Hub’s agenda is in 2023, if indeed traditional publishers are adopting open access agreements and academics are still willing to work with those publishers? Is it truly Elbakyan’s mission to “shatter” the entire scientific publishing ecosystem and, with it, essential processes like peer review? Or is that just the EFF’s hyperbole? Presumably, it’s both. And by honoring Sci-Hub, the EFF proves once again that it will promote any anti-copyright agenda—legal or otherwise—with the zeal of a conspiracy theorist watching “chemtrails” fill the sky.

Training AI With Protected Works: Is Copyright Law Designed to Respond?

Many creators feel very strongly that “training” AI models with unlicensed, copyrighted works is unjust—not least because generative AIs built on their creativities will put some creators out of business while enriching more tech moguls. It is both insult and injury to see one’s work used, without consideration, to underwrite the mechanism of one’s own obsolescence. But regardless of how we may feel about the practice of “machine learning” (ML) with unlicensed material, it remains to be seen whether and where current law provides any remedies. I’ll try to consider that topic in this post and the next post, beginning with the allegation that ML is mass copyright infringement.

Four class action lawsuits against generative AI developers have been filed thus far in the District Court for the Northern District of California, and all by the same law firm. Because all the complaints are similar, I will stick to the two that were filed first. In Andersen et al. v. Stability AI et al., a class of visual artists is suing Stability AI and Midjourney;[1] and in Tremblay et al. v. Open AI, a class of book authors suing OpenAI over the development of ChatGPT.[2] Both complaints allege direct and vicarious copyright infringement as well as unlawful removal of copyright management information (CMI). Both complaints also contain counts for violation of the derivative works right §106(2), and based on that theory, the Andersen complaint alleges unlawful making available of said derivative works in violation of 106(3), (4), & (5). The complaints also contain state law allegations, but I will discuss those in the next post.

Reproduction and the Battle of Analogies

The question of whether ML with copyrighted works constitutes an act of mass infringement will turn on the factual consideration as to whether any copying occurs in violation of the reproduction right (§106(1)). In Andersen and Tremblay, there is considerable focus on the potential of a generative AI to output an infringing work based on its training corpus. For instance, if the work of Karla Ortiz (one of the named plaintiffs in Andersen) is part of the ingested materials, then the assumption is that the AI model has the potential to produce a copy of an existing Ortiz work or a work that is substantially similar to an Ortiz work.

The reproduction inquiry may be different for each model and each type of work used for input. In Andersen, the complaint states, “Because a trained diffusion model can produce a copy of any of its Training Images—which could number in the billions—the diffusion model can be considered an alternative way of storing a copy of those images.” By contrast, the Tremblay complaint alleges that copying occurs, but it does not specifically describe how the ChatGPT training process entails reproduction. “During training, the large language model copies each piece of text in the training dataset and extracts expressive information from it,” the complaint states.

If the AI system produces any copies of any of its training materials, this is evidence that the system violates the reproduction right. Prompt the generator to make an image of Dr. Strange, and if Dr. Strange comes out, then nobody can doubt that Dr. Strange is a latent copy in the system and that this potential to copy is sufficient evidence of infringement at the input stage. Alternatively, if the system can only produce work “in the style of” Karla Ortiz, this raises different issues (and very serious concerns) but may not be considered sufficient evidence of “reproduction” in the input process. But the courts need not look at outputs, or even potential outputs, to find violation of the reproduction right.

It has been held (specifically in the 9^th Circuit)[3] that even storing a copy in random access memory (RAM) is sufficient to find a violation of the reproduction right. The AI developers will seek to prove that their systems do not copy the works ingested in any sense, or that if they do, they copy only non-protected (i.e., factual) elements of the works. Using anthropomorphic words like observe, learn, study, etc. to describe ML, the argument from the developers will be that these models are designed to obtain information about the works but not copy the works anywhere in the system. Input an illustration, for example, and what the system allegedly stores are millions of data points about line weights, composition, colors, shading, etc. Then, combined with billions of other data points from billions of other works, the model generates probability algorithms which are then used to produce new visual works when users prompt the system with instructions.

AI developers like to compare “training” their models to the learning a human artist does when she experiences or studies works other than her own. In addition to being a reductive and dehumanizing analogy for the ways in which artists teach themselves a craft, this line of reasoning may be seen by the courts as smoke and mirrors. The factual question is whether the system retains a copy long enough to be perceived by the machine, which has been held to be violative of §106(1). Long-term storage of a copy is not required, and my understanding is that making a “more than fleeting” copy is unavoidable in any computer system—i.e., that there is no such thing as ingestion without reproduction.

Proving reproduction will be the whole ballgame insofar as litigation can address whether feeding a corpus of protected works is a violation of law. We shall see what the courts make of the facts presented, but without finding reproduction, the other copyright complaints likely fall. For instance, removal of CMI is not a stand-alone violation. Section 1202 of the DMCA states that removal is a violation if the party doing the removing knows or has reasonable grounds to know “that it will induce, enable, facilitate, or conceal an infringement of any right under this title.” Therefore, there must be a colorable claim of infringement for the CMI allegation to survive.

Derivative Works Allegations

Both the Andersen and Tremblay complaints allege that the AIs produce unlicensed derivative works in violation of §106(2), though the arguments are different in each case. In Andersen, the allegation arises from the premise that the system cannot produce anything outside the limitations of its data set composed of protected works. “The resulting image [output] is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images.…a latent diffusion system…can never exceed the limitations of its Training Images.”

It’s an interesting theory, but I’m not sure anything in copyright law can support the argument that all potential outputs of the generative AI are unauthorized derivatives of the total corpus of works in the training set. To find an infringing derivative of a visual work (typically one image) requires a substantial similarity inquiry comparing a specific original with the follow-on work to determine what has been copied and whether that copying renders the second work a derivative of the first. This is difficult enough in the world of humans intentionally using a single visual work to produce a different visual work (see Goldsmith v. Warhol!!). So, it seems highly speculative to ask a court to find generally that billions of images output are, as a matter of law, derivatives of the billions of images input. I’m not certain the court has anywhere to look for guidance to consider this reading of the derivative works right.

If this derivative works theory is tough with images, it would be even harder with text—i.e., to allege that the textual outputs are derivatives of all the textual inputs is akin to saying that every book written is a derivative of every book read. This echoes a popular sentiment among the anti-copyright crowd that no work is “original,” a premise that should not be given any legal weight, even in the service of trying to protect creators from AI developers.

In Tremblay, the allegation is not that the individual outputs of ChatGPT are derivatives of the corpus of books used in training, but that the entire model is a single derivative work of its corpus. “Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act,” the complaint states. [Emphasis added]

Again, claiming that the entire LLM is a single derivative work of the millions of literary works fed into the system would seem to strain the derivative works right beyond the limit where any court can venture. In fact, this allegation could potentially bolster the inevitable fair use defense the AI developers will be arguing—namely that the finding of “transformative use” in Google Books favors fair use of the corpus of work used in ML.

Fair Use & Google Books

Notably, these cases are brought in California, controlled by the Ninth Circuit and, therefore, not bound by the Second Circuit decision in Google Books, which many believe to be the strongest precedent favoring fair use for the AI developers. The comparison is a natural one. Google scanned whole books into a system to create a unique tool for searching the contents of books without providing any whole-copy substitutes for legally obtained copies. The court, noting that its decision “pushed the boundaries of fair use,” found under factor one that Google Books is “transformative” for its utility and found under factor four that it did not pose a threat to the market for the books used.

What the AI developers will try to argue under Google Books is that 1) their systems are highly “transformative” because they use protected works to create novel (even revolutionary) applications; and 2) their systems are designed to avoid outputting any copies that would serve as substitutes for the works in the data set. It is conceivable that courts or juries would find the comparison compelling, though the aforementioned capacity of a given AI to output Dr. Strange means that, unlike Google Books, the visual AI system at issue does make substitutes available and, therefore, the precedent is inapt.

By contrast, ChatGPT or other text-based application could have a stronger defense under Google Books if it is not possible, for instance, to have the system output an entire in-copyright literary work. The Tremblay complaint refers to the output of summaries, which is evidence that a whole book was ingested, but a summary is not generally an infringement and is certainly not a substitutional copy.

Meanwhile, other considerations should perhaps militate against finding fair use for generative AI model training. For instance, Google Books is a research tool for humans to learn about books written by other humans, including humans who write more books. Generative AIs are not necessarily comparable. For instance, Stable Diffusion does not provide a user with any information about an ingested work, and it poses an unprecedented threat to professional visual artists unlike any technology that has come before. Thus, the courts should consider the sui generis purpose of the generative AI at issue when citing Google Books or any other precedent to consider fair use.

In a May post, I proposed that unless the generative AI at issue can show that it promotes authorship, the court should decline to consider a fair use defense. To clarify, in Campbell, the Supreme Court states, “The fair use doctrine thus ‘permits [and requires] courts to avoid rigid application of the copyright statute when, on occasion, it would stifle the very creativity which that law is designed to foster.”[4] Until generative AI changed the landscape, there was no need to affirm that “the very creativity” fostered by copyright means “human creativity.” But today, that distinction is necessary. Although generative AI can produce volumes of “creative” material, only those works which can be protected by copyright are works of authorship. And just like it is indecent to exploit an artist’s work to build a machine that might end her career, it would be absurd to allow fair use (a component of copyright law) to defend a technology that would potentially annihilate copyright’s purpose.

Of course, that’s one man’s opinion, and one that would apply to some, but not all, works derived by generative AI. As these tools develop, and their uses are explored by various types of creators, there are examples, both in practice and in theory, where we can find that generative AI does foster new authorship. This gets into the complicated question of copyrightability of works that humans create with some AI used in the process, and because this is itself a new discussion, it is difficult to say which generative AIs, if any, can be said to “promote the progress” of authorship as a matter of law.

Legal experts, both pro and anti-copyright, will comment upon the strengths and weaknesses of Andersen, Tremblay et al. represented by the one firm that has taken the lead on these lawsuits. But even where these cases may be flawed, they can provide some insight into the question posed by this essay: is copyright law an answer to the potential hazards of generative AI? I suspect that a fundamental difficulty arises because generative AI poses an existential threat to the future of authors, and some of the injustices and cultural calamities inherent to that threat may not be remedied (or entirely remedied) by the principles of copyright. Remedies sounding in other areas of law could loom larger, especially for certain types of creators, and that will be the subject of the next post.

[1] Deviant Art is also a named defendant being sued for breach of contract for providing works to Stability for ingestion.

[2] The same firm is now representing Sarah Silverman and another class of book authors, though the complaint is essentially the same as Tremblay.

[3] MAI Systems Corp. v. Peak Computer, Inc., 991 F.2d 511 (9th Cir. 1993).

[4] Citing Stewart v. Abend (1990).

Image by: idaakerblom

The Illusion of More

Dissecting the digital utopia.

Category: Digital Culture