“Machines seek no public policy purpose, and there is no public benefit in replacing artists with them as content creators.” This warning came in June, from the Writers Guild of Canada, which represents around 2,500 English-language screenwriters working across TV, film, digital media, and radio. In a letter to the ministers of heritage and innovation, science, and industry, the WGC insisted that the government play “a leading role” in protecting the economic interests of Canadian creators from the threat of generative AI.
Generative AI, of course, refers to that subset of artificial intelligence that can generate video, images, music, and text. It’s the technology behind the breakout hit ChatGPT, a tool that answers questions using natural, human-like language and can simulate writing styles. Thanks to its broad utility, generative AI promises to become foundational for new services and economies. The technology has also sparked controversy and legal challenges. Artists and creators have been the most vocal, but the debate is also drawing in other industries concerned with how automation will reshape society.
That concern is not simply about how generative AI threatens to erode, and overtake, human creativity. It goes deeper: Are such models legal? There isn’t an easy answer. Today’s AI systems are variations of machine learning that require a lot of data to train on (as well as a lot of energy-intensive processing). The more data they get, the more accurate the systems become. Where do AI teams find all that data? The internet. Common Crawl, the nonprofit which scrapes the entire internet every month or two and makes the archive available to researchers for free, released its latest data set this summer. It contains 3.1 billion web pages collected from 35 million registered domains and includes 1 billion new URLs. You would need over 750 512 GB iPhones to hold it all.
ChatGPT’s legal status depends on how it treats this public data. OpenAI, the Microsoft-backed company that created the advanced chatbot, promises not to “actively seek out personal information to train our models, and we do not use public information on the internet to build profiles about people, advertise to or target them, or to sell user data.” The problem is AI systems tend not to distinguish between what’s private and what isn’t. They gobble up whatever they find. But to process personal information, AI companies need explicit consent. In March, the Italian government was the first in the world to ask whether OpenAI had complied with regulations governing the protection of personal data. Citing the lack of clarity around the company’s practices, Italy initially banned ChatGPT. The ban, lifted after OpenAI agreed to a set of privacy controls, has turned into a larger investigation into the technology’s relationship to Europe’s data collection rules.
Now, Canada’s Office of the Privacy Commissioner and its provincial affiliates are asking a similar question: Did ChatGPT process the personal information of Canadians without expressed consent? The answers to these concerns, and OpenAI’s response, will shape the extent to which generative AI tools can be built on public data protected by privacy laws.
Foundational models could be at odds with copyright law too. Normally, making a copy of copyrighted work carries legal and financial consequences. There are exemptions, however, such as fair dealing in Canada (or the similar concept of fair use in the United States). Fair dealing means that you can make a copy for “the purpose of research, private study, education, parody or satire.” Notice that copying data to train a commercial foundational model does not appear on the list.
But these AI companies are quickly learning that you can use copyright laws to justify mass data collection. When Innovation, Science and Economic Development Canada asked for opinions on revising copyright laws, many companies called for their work to be protected under the rubric of academic fair dealing. This request raises worries about the autonomy of academic research in an era of AI—but also whether many commercial firms are exploiting academic exemptions to produce generative models with commercial intent. If development happens under academic fair dealing, is it okay for companies to profit from the results? We are finding out. Google is being sued for copyright infringement. Sarah Silverman is suing OpenAI and Meta. The cases are piling up.
Government action could settle the privacy and copyright issues. New policy directives could decide that there is nothing illegal about how generative AI processes personal information, or they could rule against organizations like the Writers Guild, deciding that training a foundational model does not violate copyright. New laws could greenlight the technology so long as the industry adopts strict safeguards, such as the proposed Artificial Intelligence and Data Act, which sets out financial penalties for applications that harm individuals. Or the technology could be blocked outright for privacy violations, much like Italy did.
The real struggle might not be between AI and artists, or even governments, but with large copyright holders. What happens when Disney gets tired of ChatGPT fan-fictioning its Star Wars universe? Either large models will avoid using content from firms powerful enough to protect their intellectual property or they will pay for it. Large social media sites have already started selling their users’ data to companies. The result would be an AI system shaped by the few big players powerful enough to pay for training a large model and a few big firms willing to sell out their users’ data.
But the problems for large foundational models are not just about data. AI might not be simply processing personal information but our very habits. “When we think about how AI might change labor, we have to understand that what we’re really doing is teaching the machine how to replace us,” writes AI researcher Solon Barocas. “And so, in many situations, often without realizing it, the work that we are doing is being co-opted in some sense by the companies that are able to then use that to train the model to perform the job that we once did.” The issues only become more vexing when models training off stock images directly compete with the stock image photographers. Getty Images has already sued Stable Diffusion, an AI model that creates images, for using its licensed content.
We might be quick to call AI firms pirates, but what we’re really watching is the rise of new empires. If we take the metaphor of machine learning seriously, then generative AI models learn from our common knowledge and sell that knowledge back to us. Co-opting, copying, processing, replacing: ChatGPT’s legal status is a proxy for a debate that is bigger than individual copyright or personal information. How do we foster an AI culture that cares about the public culture that created it?