machine reading

Is “Machine Learning” Copying or Reading?

I recently attended a round-table discussion on the subject of artificial intelligence and copyright.  The first of several engaging topics I thought warranted a post was the question of “machine learning,” which I put in quotes here with respect to one scholar who admonished against anthropomorphizing AI by using words for human activities to describe the actions of computers.  I think that view is fundamentally correct, though there is also grounds for analogy, as will be made clear by the following premise:

When you read a book, even if we might say, by way of analogy, that you are “copying” the content of that book onto your brain, this clearly does not infringe §106(1) of the copyright law proscribing unauthorized copying.  Since the author naturally hopes that you will read her book, such a prohibition would be absurd, even if you had an eidetic memory and could, if prompted, recite the entire work verbatim.  But if you used that gift to type from memory the entire book and made that document available, you would then violate more than one statute under the copyright law.

So, the question raised in regard to “machine learning” is whether the computer scientist who wishes to feed a corpus of books—say the anthology of American literature—into an AI should be required to obtain licenses for the works still under copyright.  Thus, the first analysis is whether the act of “copying” can be said to occur in this circumstance any more than it would be for the human reader who consumes the same body of literature.

It strikes me that if what the AI does in this case is ingest the corpus of books and almost instantly deconstructs those works by synthesizing them through a neural network, then the computer scientist has a pretty solid argument that no copying has taken place.   If the machine does not retain intact copies of works—or even large sections of works—-with the purpose of making those intact copies available to the human market, then this “machine reading” process is arguably analogous to the human whose reading does not infringe §106(1) of the copyright law.

That said, intent of the computer scientist may be a significant factor.  For instance, if the training of the AI will have a commercial purpose, this may suggest a requirement to license the works under copyright.  But intent can be very tricky on the leading edge of science because it is neither realistic, nor even desirable, to insist that every researcher know exactly where his experiments will lead.  This would nullify the process of discovery whence many great achievements have been made; hence, discovery is justification itself, and I suspect the tech companies would appeal to this rationale in regard to “machine learning.”

If the computer scientist’s goal is to see whether he can get his AI to “learn” about the American experience through literature, but he does not have a particular product or service in mind at the outset, it seems that copyright owners would be on fairly shaky ground to enjoin his use of the books.  As long as nothing that comes out the other end looks like any of the products that went in, it strikes me that this experiment exists beyond the statutory framework of copyright law.

Of course this portrait of the individual scientist beavering away in his modest lab to see what he may discover is not what is taking place in reality. We know perfectly well that major AI experimentation occurs in the R&D labs of companies like Google and Facebook, who are well shielded by trade-secret law from divulging what they are working on or for what purpose.  Like any other corporations, they are free to announce a new product or service without telling the public how they arrived at the latest result.

So, even if the use of copyrighted works as source material resulting in a commercial end might recommend some type of licensing regime, it may be very difficult to identify the threshold when the blind process of scientific discovery becomes a clear intent to exploit a commercial opportunity.  And, as mentioned, these companies would be under no obligation to divulge that eureka moment to anyone.  

On the other hand, the moment Google or Facebook did announce that new product, rightsholders could justifiably complain that a massive, highly-profitable corporation has used potentially billions of dollars worth of material without paying for any of it.  As one scholar at the round-table noted, tech companies may not use raw silicon for free, so why should they get to exploit millions of creative works for free, no matter what they’re turning that data into?

It’s a good question.  One that would seem to suggest a new subsection of the copyright law, and this would certainly be consistent with the fact that new forms of exploitation of works may demand equally new forms of compensation.  If nothing else, that type of statutory response could spare us all the tedious and false harangue that insists “copyright owners just want to stand in the way of innovation.”

That argument prevailed for far too long, and now the so-called innovators have a lot of splainin’ to do about their culture of blind disruption for the sake of disruption. Especially in light of the fact that AI may have some very profound effects on society as we know it, maybe this time around the copyright owners should be treated like experienced voices in the conversation rather than canaries wasting their breath in the proverbial coal mine.

© 2019, David Newhoff. All rights reserved.

Follow IOM on social media:


  • Pingback: Is “Machine Learning” Copying or Reading? - Copyright/Intellectual Property - - The Passive Voice

  • Pingback: Is “Machine Learning” Copying or Reading?  – RightsTech Project

  • Thanks for this fascinating perspective!
    Distinctions need to be clarified between claims of copyright ownership of ideas vs data (symbolic language). If Artificial Intelligence generates, say, a new song derived from the input of an immense library of copyrighted songs (as data, not music), the resulting compiled output (generated song) does not currently qualify for copyright protection. In this case, the content is being input and compiled within a data processing context – rather than as a library or archive of musical works.

    In effect, songs are input into AI systems as data (at the level of fundamental musical language), which becomes a basis for algorithmic processing methods – based on patterns characteristic to predefined styles of music.

    A similar process happens with human musicians where imitation becomes a basis of innovation. At what point in this creative/generative process, ideas and language should be subject to ownership claims and control is a very timely issue in both cases. And maybe in both cases, a solution may lie within the Fair Use doctrine and its exemptions from ownership claims within the context of creating ‘transformative works’.

    • Thanks for your comment Dolores. The question of copyright in a work that is truly produced by an AI is one that is often discussed, and I am squarely in the camp that says the lack of a human author extinguishes the foundation for copyright. Hypothetical business owners of the future “music making” machines would likely feel differently, and we may see that debate at some point; but I think there are also a lot of reasons why AI-produced “art” could fail miserably as a commodity.

      I would be careful about over-stressing the human process as analogous to the machines just as I would be careful about anthropomorphizing the machines. The creative process you describe among human authors is already well served by copyright doctrine, yes, by fair use but even more substantially by the idea/expression dichotomy. Unfortunately, though, we are witnessing a generally unwarranted revision of fair use–especially in the blogosphere where it is often mischaracterized. The doctrine of “transformativeness” has created a lot of confusion, even in the courts, and it’s my view that this is as much a problem of semantics as anything else; what began as perfectly sound analysis between two expressive works has ballooned in some cases into infringing one or more exclusive rights under Section 106 (see my posts on Brammer v. Violent Hues) largely because the word itself is not well defined as a term of art.

      Thanks for reading!

Join the discussion.

This site uses Akismet to reduce spam. Learn how your comment data is processed.