In the race to build smarter Artificial Intelligence (AI), should knowledge be locked behind paywalls or flow freely? Generative AI, as a technology djinn, has burst onto the digital scene, enhancing productivity by answering questions, solving complex problems, and creating content across multiple domains—capabilities that are only possible thanks to breakthroughs in machine learning (ML).

The Legal Battleground: Creators vs. Tech Giants

Beneath this magical facade lies a brewing conflict that pits content owners against tech giants. Content owners have filed several lawsuits against AI companies: Getty Images sued Stability AI over image use; authors like Sarah Silverman sued Meta and OpenAI for using their books without permission; the New York Times and eight daily newspapers have taken legal action against OpenAI and Microsoft for utilizing their articles; and Artists have filed a class-action lawsuit against Midjourney and Stability AI, and more. The crux of the matter is the alleged unauthorized use of copyrighted works to train AI models.

An AI track mimicking Drake and The Weeknd hit streaming platforms before legal removal; Wacom and Hasbro facing backlash for using AI images in marketing; OpenAI’s Sora producing realistic videos from text prompts; AI language models are generating news articles and opinion pieces; and more. Over 200 artists signed an open letter against AI voice cloning; Wizards of the Coast admitted to using AI in marketing materials for Magic: The Gathering (tabletop and digital collectible card game); ongoing lawsuits debating copyright infringement in AI training; and discussions continuing on ethical guidelines for AI-generated content.

At first glance, the solution seems deceptively simple— make the tech giants pay for the training data. But this knee-jerk reaction could be a Trojan horse. By erecting a paywall around data, one risks not only fortifying the very monopolies that are being challenged but also concentrating AI’s benefits in the hands of established cash-rich tech corporations rather than distributing them widely across society.

Restricting AI’s access to the vast repository of data and knowledge, on the other hand, will severely hamper its ability to learn, significantly impeding potential advancements and curtailing broad societal benefits. And, if AI trains and learns from its own outputs or synthetic data, it risks a downward spiral where its output becomes less varied, less accurate, and of lower quality, leading to hallucinations—a process researchers call “model collapse.”

In fact, many legal scholars argue that ML systems should generally be allowed to use copyrighted databases for training purposes under the already established “fair use”doctrine.

Fair Use and Fair Learning

While it may appear that AI tools create content from nothing, this is not the case. Behind the scenes, generative AI platforms are trained and developed on massive datascapes and question snippets, leveraging billions of parameters constructed by software processing massive archives of images and text data.

Dan Cahoy, a Professor of Business Law at Penn State’s Smeal College of Business, states, “Ultimately, there are four broad and somewhat amorphous factors to determine fair use: the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect on the potential market for the original work. The first factor generally favors fair use if the work is “transformed” into something that is different in terms of its message or use from the original. It looks at whether new ideas, meanings, or value has been added for a different audience. This change might create fresh information, artistic elements, insights, or viewpoints beyond simply copying the original.”

Mark Lemley, the William H. Neukom Professor of Law at Stanford Law School and the Director of the Stanford Program in Law, Science, and Technology, writes about ML models in a recent paper, “To perform, they must first learn how—generally, through a process of trial-and-error of epic proportions. And in order to create the right conditions for this learning process, engineers must begin by collecting and compiling enormous databases of exemplary tasks for machines to practice on, known as “training sets.” ML systems generally require a more permanent training data set to test successive iterations of the software against. […] Creating a training set of millions of examples almost always requires, first, copying many more millions of images, videos, audio, or text-based works. Those works are almost all copyrighted.” It’s like creating the ultimate study guide.

Lemley posits further, “AI isn’t competing with authors or artists. Instead, it is using their work in an entirely different manner. […] ML systems generally copy works, not to get access to their creative expression (the part of the work the law protects), but to get access to the uncopyrightable parts of the work— the ideas, facts, and linguistic structure of the works.” He proposes ‘fair learning’ as a principle that the use of copyrighted works to train ML systems should be fair even if fair use factors—the nature of the work, and the amount taken would otherwise weigh against fair use.

Take a language AI model trained on millions of books. It’s not interested in the stories, characters, or themes; instead, it aims to learn linguistic patterns – things like grammar rules, sentence structures, and word relationships. Similarly, for an AI model to learn what a dog looks like, it needs to analyze millions of dog photos. The system isn’t interested in the artistic composition or the specific dog in each photo – elements that might be protected by copyright. Instead, it’s learning to recognize general features like fur, four legs, tails, and typical dog shapes. In fact, “verbatim copying” is the necessary intermediate step toward accessing the unprotectable “ideas and functional elements” of works that allow AI systems to learn generalizable patterns and concepts rather than simply memorizing specific content. AI models instead encode patterns from training data into parameters, generating responses using learned probabilities and not by referencing stored content.

The concept of “enablement” in patent law requires inventors to fully disclose their innovations in depth— akin to a detailed recipe or instruction manual containing all essential information to teach someone skilled in the field to understand, use, and recreate the invention. The principle of “thorough knowledge” sharing underpins the legal framework to drive progress and should apply to the ethos of both human inventions and ML.

Behind the Curtain: AI’s Insatiable Data Appetite

To illustrate AI’s exponential appetite for data, consider the GPT series’ evolution: GPT-1 (117M) → GPT-2 (1.5B) → GPT-3 (175B) → GPT-4 (est. 1T*). This approximately 8,547-fold increase in parameters (the AI’s adjustable components) over four generations helps explain why AI needs vast datasets. More parameters allow AI to understand the context and generate human-like responses across diverse topics but require exponentially more training data.

The unprecedented scale of data required to train modern AI models lies at the heart of the current controversy. At a certain point, it’s no longer about selecting data; it’s about crawling the entire internet. Both ML systems and humans learn from information, but machines require vastly more data to excel at tasks.

Colossal Expenditures into Widespread Technological Empowerment

Developing large ML models not only requires massive amounts of data – it also demands enormous capital investments in data centers, top-tier tech talent, mega-scale energy utilities, and infrastructure. The scale of this spending is staggering. In the latest quarter alone, tech giants Amazon, Microsoft, Alphabet, and Meta poured a combined $52.9 billion into capital expenditures. Meanwhile, venture capital firms are investing $64.1 billion in AI startups so far in 2024.

Despite the huge upfront costs borne by tech companies and investors, the benefits of AI are poised to ripple across the entire business and consumer landscape. Enterprises of all sizes and individuals will be able to piggyback on these AI breakthroughs through efficiencies in the AI value chain. As a result, they will be able to solve their unique problems: streamline operations, create new services, boost productivity, and spark innovation — all without spending billions on research and development. The features and benefits of AI will be broad and egalitarian, empowering businesses, individuals, and diverse sectors globally to dramatically enhance productivity and actively shape the evolving economy.

AI’s Creative Surge Challenges Art, Ethics, and Law

AI-generated content is blurring the lines between human and machine-made art and challenging notions of creativity, authenticity, and human expression. AI can create artwork in the style of celebrated masters, compose music as fluently as skilled musicians, play complex strategy games, write software, produce writings that are difficult to distinguish from those of real journalists, and more—all at a blazing speed. While these AI feats inspire awe, they simultaneously fuel worries about artificial intelligence as a genuine competitive threat. And what we’re seeing now is just the tip of the iceberg in AI’s potential.

The uncannily realistic AI outputs are sparking debates about intellectual property rights, attribution, and accountability. Who can claim the right to an AI-produced image, text, or song as their own? Artists whose works are part of the massive training data sets that the computers utilize to generate their results should be credited and paid for? How should the legal system respond to the blurring lines between human and AI-created works?

Copyright Law: Adapting to the AI Era

Rebecca Tushnet, who teaches law at Harvard, has a different take on this. She believes the rules about who owns creative work are already pretty clear when it comes to using AI. Tushnet points out that for a long time, the U.S. Copyright Office has only given copyright to things that were mainly made by actual people. She says courts have backed this up over and over. In her words, “The U.S. courts are generally in agreement that you need a human being sufficiently in the loop to have an author. And a lot of AI-generated works are not that.” Basically, she’s saying that if a computer does most of the work, it might not count as something you can copyright.

Imagine an AI system trained to recognize Prince across various artistic representations, processing millions of images, including Warhol’s iconic pop-art portraits. While the AI’s goal is simply to identify Prince within all artistic variations, it’s not specifically trying to learn Warhol’s techniques. Yet the model inadvertently gains the ability to mimic Warhol’s style as a byproduct of its extensive training.

Such cases raise questions about authenticity and artistic content ownership across myriad creative genres, not just visual art, and allude to why many lawsuits are making their way through the legal system. However, in the short term, the low-hanging solutions are with the companies themselves.

Solutions are already being implemented in the form of bot-filters to prevent the output of content that closely resembles existing copyrighted works. YouTube uses technology to spot when someone uploads a video with copyrighted music or clips; once it catches these, it takes them down automatically. AI companies could similarly build smart filters that look at what their AI is producing and conclude, “This looks an awful lot like that famous painting,” or “This tune is super close to that hit song from last year.” If it’s about to spit out something that’s practically a carbon copy of something that already exists, these filters would step in and block it.

Professor Lynda Oswald of the University of Michigan sums it up: “One of the biggest legal challenges of the mid-21st century will be figuring out how to regulate AI effectively. Consider the complexity of the ownership and monetization issues posed by AI and its voracious consumption of data, the breakneck pace of advancing AI technology and the shuffling pace of the law trailing behind it, and the diverse and often conflicting interests and objectives of the myriad stakeholders involved. Crafting a regulatory response that adequately accounts for all of these factors and allows AI development to proceed on an orderly track will be a herculean task.”

Conclusion

As we navigate the complex landscape of AI and copyright law, a nuanced understanding is emerging. Legal scholars suggest two key points: the input data used for AI training may often be permitted under “fair use” or “fair learning.” At the same time, purely machine-produced output is typically not copyrightable. This perspective recognizes that ML, at its core, is about extracting patterns and facts rather than copying creative expression. Ultimately, fair use is about more than transforming existing works. It’s about preserving our collective ability to create, share, and build upon ideas. Or it’s about preserving the ability to learn—whether the entity doing the learning is a human or a machine.

Read the full article here

Share.