Is “Machine Learning” Copying or Reading?

machine reading

I recently attended a round-table discussion on the subject of artificial intelligence and copyright.  The first of several engaging topics I thought warranted a post was the question of “machine learning,” which I put in quotes here with respect to one scholar who admonished against anthropomorphizing AI by using words for human activities to describe the actions of computers.  I think that view is fundamentally correct, though there is also grounds for analogy, as will be made clear by the following premise:

When you read a book, even if we might say, by way of analogy, that you are “copying” the content of that book onto your brain, this clearly does not infringe §106(1) of the copyright law proscribing unauthorized copying.  Since the author naturally hopes that you will read her book, such a prohibition would be absurd, even if you had an eidetic memory and could, if prompted, recite the entire work verbatim.  But if you used that gift to type from memory the entire book and made that document available, you would then violate more than one statute under the copyright law.

So, the question raised in regard to “machine learning” is whether the computer scientist who wishes to feed a corpus of books—say the anthology of American literature—into an AI should be required to obtain licenses for the works still under copyright.  Thus, the first analysis is whether the act of “copying” can be said to occur in this circumstance any more than it would be for the human reader who consumes the same body of literature.

It strikes me that if what the AI does in this case is ingest the corpus of books and almost instantly deconstructs those works by synthesizing them through a neural network, then the computer scientist has a pretty solid argument that no copying has taken place.   If the machine does not retain intact copies of works—or even large sections of works—-with the purpose of making those intact copies available to the human market, then this “machine reading” process is arguably analogous to the human whose reading does not infringe §106(1) of the copyright law.

That said, intent of the computer scientist may be a significant factor.  For instance, if the training of the AI will have a commercial purpose, this may suggest a requirement to license the works under copyright.  But intent can be very tricky on the leading edge of science because it is neither realistic, nor even desirable, to insist that every researcher know exactly where his experiments will lead.  This would nullify the process of discovery whence many great achievements have been made; hence, discovery is justification itself, and I suspect the tech companies would appeal to this rationale in regard to “machine learning.”

If the computer scientist’s goal is to see whether he can get his AI to “learn” about the American experience through literature, but he does not have a particular product or service in mind at the outset, it seems that copyright owners would be on fairly shaky ground to enjoin his use of the books.  As long as nothing that comes out the other end looks like any of the products that went in, it strikes me that this experiment exists beyond the statutory framework of copyright law.

Of course this portrait of the individual scientist beavering away in his modest lab to see what he may discover is not what is taking place in reality. We know perfectly well that major AI experimentation occurs in the R&D labs of companies like Google and Facebook, who are well shielded by trade-secret law from divulging what they are working on or for what purpose.  Like any other corporations, they are free to announce a new product or service without telling the public how they arrived at the latest result.

So, even if the use of copyrighted works as source material resulting in a commercial end might recommend some type of licensing regime, it may be very difficult to identify the threshold when the blind process of scientific discovery becomes a clear intent to exploit a commercial opportunity.  And, as mentioned, these companies would be under no obligation to divulge that eureka moment to anyone.  

On the other hand, the moment Google or Facebook did announce that new product, rightsholders could justifiably complain that a massive, highly-profitable corporation has used potentially billions of dollars worth of material without paying for any of it.  As one scholar at the round-table noted, tech companies may not use raw silicon for free, so why should they get to exploit millions of creative works for free, no matter what they’re turning that data into?

It’s a good question.  One that would seem to suggest a new subsection of the copyright law, and this would certainly be consistent with the fact that new forms of exploitation of works may demand equally new forms of compensation.  If nothing else, that type of statutory response could spare us all the tedious and false harangue that insists “copyright owners just want to stand in the way of innovation.”

That argument prevailed for far too long, and now the so-called innovators have a lot of splainin’ to do about their culture of blind disruption for the sake of disruption. Especially in light of the fact that AI may have some very profound effects on society as we know it, maybe this time around the copyright owners should be treated like experienced voices in the conversation rather than canaries wasting their breath in the proverbial coal mine.

David Newhoff
David is an author, communications professional, and copyright advocate. After more than 20 years providing creative services and consulting in corporate communications, he shifted his attention to law and policy, beginning with advocacy of copyright and the value of creative professionals to America’s economy, core principles, and culture.

Enjoy this blog? Please spread the word :)