Research
General-Purpose Artificial Intelligence (GPAI) Models and GPAI Models with Systemic Risk: Classification and Requirements for Providers
Aug 8, 2024
Expert InsightsPublished Nov 20, 2024
Photo by Panuwat Sikham/Getty Images
The impact of generative artificial intelligence (AI) technology on copyright is a subject of academic, legal, and policy debate, involving a variety of stakeholders, from artists to major newspapers to technology companies. Recently, the U.S. House of Representatives Judiciary Subcommittee on Courts, Intellectual Property and the Internet has held several hearings on AI and copyright.[1] The U.S. Copyright Office (USCO) has been preparing several major reports concerning digital replicas or the use of AI to digitally replicate individuals' appearances, voices, or other aspects of their identities; the copyrightability of works incorporating AI-generated material; and training AI models on copyrighted works as well as any licensing considerations and liability issues, including considerations of fair use.[2] The USCO will also publish an update to the Compendium of U.S. Copyright Office Practices, the administrative manual for registration.[3] This follows a public consultation, which had amassed more than 10,000 comments from artists, lawyers, teachers, publishers, and trade groups representing all 50 states and 67 foreign countries.[4] Moreover, generative AI and copyright have also been the subject of more than two dozen lawsuits, stakeholder roundtables led by the U.S. Federal Trade Commission,[5] and several bills proposed in Congress.[6] Further developments regarding the right of publicity have also taken place separately, with the USCO releasing a dedicated report.[7]
This paper presents three main questions regarding whether:
These questions remain open and hotly debated. This paper provides policymakers with legal insights on copyright law and AI from both the United States and abroad, explaining various positions to help balance competing interests. The interests at stake include providing incentives to authors and securing their rights, promoting innovation and the interests of the technology industry, maintaining global competitiveness of AI, and addressing the underlying issues of free expression, the practical difficulties involved, and the law's adaptability to new technological landscapes. This paper aims to spark dialogue on key aspects of AI governance without offering a comprehensive analysis or definitive recommendations, especially as AI applications proliferate worldwide and complex governance debates persist.
Generative AI systems can produce material that would be copyrightable if it were created by a human author. In the United States, copyright law developed from the Patent and Copyright Clause of the Constitution and later the Copyright Act of 1976; it requires all protectable works to be authored by a human being and to be original. The term originality means that a particular work is an author's own, not copied, and more than minimally creative.[8] These do not make a high bar to protection — in fact, many people author multiple original works on a daily basis. Nonetheless, the courts have interpreted the requirements to mean that works generated by AI without human creative authorial contribution are not protectable. The courts have consistently held that "human authorship is a bedrock requirement of copyright."[9] Similarly, the USCO refuses to register copyright in works without a human author.[10]
This means that U.S. law requires AI to be merely an assisting instrument allowing authors to express their own conception.[11] The Copyright Registration Guidance provides that, "If a work's traditional elements of authorship were produced by a machine, the work lacks human authorship and the Office will not register it."[12] Thus, simple instructions or prompts given to generative AI software that result in complex artwork will likely not be sufficient for the work to be protectable and registrable under current law.[13] For example, if a user provides an AI with an instruction to write a poem in the style of a famous artist, the expressive elements of the work will be produced by AI, rather than the user, and thus will likely be unprotectable.[14] Nonetheless, the use of AI in creation of a work is not an absolute bar to registration. As long as the software merely assists in authorial expression, the result might be protectable. Similarly, if an author creatively arranges the outputs of an AI system or edits them significantly, the end result also might be protectable.[15] As of February 2024, the USCO had issued registrations to "well over 100" AI-assisted works.[16] In attempting to delineate the difference between AI-assisted and AI-generated works, the USCO relies on applicants' disclosures; if the use of AI extends beyond minimal use, it needs to be disclosed. The USCO's examiners make a case-by-case analysis, with the conclusion ultimately depending on the level of involvement of human intervention.[17] For example, using a spellcheck or an automatic filter in photography will clearly not present an obstacle to obtaining a copyright, whereas using a simple command to generate series of works likely would not result in copyrightable content. It is less clear, however, how the law will approach works that fall in the middle of this continuum, where the lines between assistance and generation might blur. Finally, the U.S. approach presupposes that the user of the AI tool would be the author — not, for example, the AI programmers.
Many scholars agree with the USCO's policy and the principle upheld by the courts,[18] arguing that copyright is justified insofar as it promotes human authorial creative expression; if there is not an author, it would therefore be unjust to deprive the public of the freedom of expression and freedom to use uncopyrighted resources.[19] Such an approach aligns, in principle, with the direction taken by the European Union and its member states,[20] as well as Japan and South Korea.[21] Other scholars are more critical of the U.S. approach to copyright protection, claiming that it could disincentivize development of AI foundation models or not account for the reality of AI-assisted creativity;[22] thus, they argue for a lower bar to copyright protection.[23] Indeed, some jurisdictions, such as the United Kingdom, allow for statutory protection of computer-generated works;[24] in China, courts have granted protection to AI-generated works, including ones arising from simple prompts.[25] Finally, the influence that private contractual arrangements will have on questions of ownership, commercial exploitation, and ability to reproduce the AI-generated works remains to be seen.[26]
Many AI models — including the machine learning (ML) models and the new generation of LLMs, such as the popular ChatGPT, DALL·E, Midjourney, and Stable Diffusion — are trained on millions of available online materials, many of which are protected by copyright.[27] For AI models to work, they need to engage in text data mining (TDM), which is a computational process that allows AI to learn from data using statistical methods, structure the texts that it ingests, and reveal patterns.[28] When AI scrapes, downloads, and processes works, it might be infringing on the right of reproduction protected by copyright law — i.e., that one does not copy the work of another.[29] This right, however, is subject to limitation by the doctrine of fair use.[30]
Fair use allows for copying, distribution, display, or performance without permission based on a four-factor test.[31] The fair use test is complicated, but it essentially asks about the following:
Although litigation on this issue is pending, it is unclear whether simple legal answers that are broadly applicable can be expected.
Generally, if AI copying and processing of works are found to fall under the umbrella of fair use, the technology companies can use the material as they wish; if they do not fall under that umbrella, then permission must be sought and payment made. How the fair use question is answered will affect the future of both technology and creative industries, whether authors and publishers will be able to profit or refuse to allow their works to be used for AI training, and ultimately, how much and what kind of generative AI innovation appears in the United States.
One example of fair use that could allow for AI training is described by the principle of non-expressive use. Many scholars agree that TDM — which comes down to AI learning on the "non-expressive" elements of works (that is, extracting facts and statistical patterns rather than retaining the original, creative parts of works) — should be considered fair use and allowed under the law.[34] For example, ML (which is an older form of AI) works in the following way: When AI ingests images of beaches, it learns to identify the concept of a beach, and distinguish it from, say, a classroom. Such non-expressive technical elements of works, just like fact or ideas, are not copyrightable and thus cannot be infringed, but their processing is useful for AI to learn about the works it ingests.[35] In other words, the use is deemed transformative, because the photographer of a beach and the AI owner use the photographs for entirely different purposes.[36] This principle of non-expressive use can be seen in several cases predating generative AI. One example is a plagiarism detection tool that copied original works to compare them with new ones.[37] A more important example is searchable databases, such as HathiTrust or Google Books, which allowed for TDM and browsing of snippets.[38] At the same time, it is not entirely clear whether rulings will go the same way for currently pending, generative AI-specific litigation, for several reasons.[39] First, generative AI does not merely analyze training data as information; it is able to produce digital artifacts in the same form as its training data. Second, many generative AI outputs are direct competitors to the works on which AI was trained. Third, generative AI is able to reproduce particular works with a high degree of similarity (for example, fictional characters, such as Mickey Mouse).[40] These issues are addressed further below.
The development of the fair use doctrine in the area of generative AI involves not only legal analysis but also policy choices that could affect the shape of the AI industry and thus influence the direction of innovation and the distribution of costs and benefits across groups such as rights owners, technology companies, and users. Some argue that a broad interpretation of fair use is advantageous because that would allow for TDM in such sectors as the life sciences, linguistics, ML, and internet search engines (which rely on TDM heavily), thus supporting innovation and research and, ultimately, benefiting society.[41] Some argue that allowing TDM is crucial for generative AI models to exist because they could not otherwise be trained. Furthermore, some participants in the legal and policy debate claim that licensing of the works on which the AI is trained is an unrealistic proposal, given the size of datasets ingested by AI and the fact that one would need to obtain a license to both the database and the individual works contained in it.[42] Technology industry representatives have also argued that licensing entails a risk of disincentivizing innovation on the one hand and providing an incentive to sue for infringement on the other.[43] In other words, for generative AI to continue developing at a rapid pace, and for innovation and the technology sector to flourish, it might be important to consider TDM fair use.
Nonetheless, there are competing interests, values, and perspectives. Authors' rights advocates emphasize that fairness of a particular use must be decided on the facts of a particular case, claiming that TDM is not presumptively fair.[44] They further argue that, under the existing law, scraping of existing works should not be free, especially when done for commercial purposes.[45] Artists and creative professionals have also voiced concerns about the lack of their "consent, compensation or control," when it comes to AI model training.[46] They argue that a loss of a licensing market is an important fair use consideration. In their view, large, for-profit companies are naturally suited to bear such costs, and licensing markets have already begun developing.[47] Some companies have chosen to enter into licensing agreements with rights holders, others are litigating,[48] and still others have decided to train their AI on the data they already possess.[49] Moreover, data obtained illegally online weigh against a fair use finding. Finally, some argue that the analysis should consider not just the intermediary purpose of training AI but also the ultimate purpose of the creation of new works.[50] The challenge of balancing different fair use considerations makes the standard difficult to apply generally — especially to the novel types of AI, including generative LLMs, as addressed below.
Given the ongoing global AI race and the transnational nature of digital economy, the development of U.S. copyright policy cannot be considered in isolation. Even where the legal frameworks or objectives differ, it is important to understand the requirements that U.S. companies need to comply with to enter foreign markets; this is why scholars of regulatory competition often speak of the "Brussels effect," highlighting the influence of EU regulations abroad.[51] For example, Japan and Singapore provide TDM exceptions to the general protection of rights holders provided by copyright law.[52] In the United Kingdom, similar ambitions were scrapped on account of potential harm to the creative industries and a decision not to incentivize AI development at all costs, with the law explicitly allowing TDM only for research purposes, although this could be revisited.[53] Similarly, legal frameworks in Latin America and China are still being developed.[54] Scholars have argued that, on the one hand, developers of AI models could choose to locate their investments in the most friendly jurisdiction, such as the United States. On the other hand, they also call attention to the possibility that a greater international convergence of standards (or "regulatory race to the middle") is likely to develop, striking a global compromise between different economic interests at play.[55]
The recently adopted EU AI Act (together with an earlier Directive) created a framework for dealing with TDM for AI. [56] The EU legislation contains two exceptions that allow TDM.[57] First, research organizations and cultural heritage institutions are free to use reproductions and extractions for the purposes of scientific research, provided they have lawful access to the works.[58] Copyright holders cannot opt out or prevent such practices; nonetheless, they do have the right to apply measures ensuring security and integrity of the networks and databases and to develop codes of practices. Second, when TDM is undertaken for nonresearch or commercial use, owners of copyrighted works can prevent the mining of those works by making an express reservation of right in an appropriate manner.[59] Additionally, the EU AI Act imposes an obligation to implement technologies enabling providers of AI models to honor copyright holders' decisions to opt out and prevent data mining of their work.[60] Companies have already started using opt-out notices pursuant to the EU AI Act,[61] while AI providers have implemented opt-out processes.[62] The opt-out model is widely seen as a rights holder–friendly compromise, especially contrasted with the currently developing shape of fair use in the United States.[63] Others maintain that an opt-in solution should be pursued instead,[64] arguing that opt-outs pose an undue burden on rights holders or are unworkable in practice.[65]
Although the European Union allows for TDM, it also puts an explicit obligation on providers of general-purpose AI models to implement policies complying with the law and any reservation of rights by copyright holders.[66] Importantly, the EU legislation further imposes significant transparency obligations, including the requirement to publish a detailed, comprehensive summary of content used for model training; this summary could include the main data collections or sets that went into training the model, such as large private or public databases or data archives, and a narrative explanation about other data sources used.[67] Compliance with those objectives will be monitored by the EU AI Office,[68] and public authorities will be able to impose fines and orders to withdraw AI models from the European market.[69] Importantly, these obligations apply to all models that are placed on the EU market, regardless of where the training took place; thus, they could apply extraterritorially, including to U.S. companies.[70] While this is widely seen as a victory for rights holders, others argue that it might disincentivize AI models from entering the EU market at all.[71] Importantly, no such requirement exists so far in the United States, although the Generative AI Copyright Disclosure Act proposed by Democratic California Congressman Adam Schiff would impose a series of disclosure requirements.[72] Furthermore, content transparency and provenance concerns are emphasized in an AI Roadmap presented by Democratic New York Senator Chuck Schumer and the Bipartisan Senate AI Working Group and are featured in the newly proposed Content Origin Protection and Integrity from Edited and Deepfaked Media (COPIED) Act of 2024.[73] These concerns have also been emphasized by commentators in public consultations in the United States.[74]
Fair use is often interpreted by legal commentators as allowing for non-expressive use to train AI models.[75] LLMs, the newest kind of generative AI, pose a distinct problem, however. Such models as ChatGPT, DALL·E, Midjourney, and Stable Diffusion can produce text, images, and music that are indistinguishable from the works on which they are trained.[76] According to some legal scholars, the fact that LLMs create such works could undermine the claim that the use is fair.[77] Rights holders argue that LLMs fall afoul of the fourth fair use factor, effectively competing with the artists' works and publishers' websites on the markets, or outright substituting for the authors' works.[78] Creative industry advocates argue that training of a model should not be considered as the end purpose of TDM; rather, the ultimate purpose is the generation of output that serves the same purpose as the ingested works, which weighs against a finding of fair use.[79] Although some artists compare LLMs to plagiarists or robbers,[80] other stakeholders highlight many social, economic, and consumer benefits that the technology seems to bring in such industries as art, medical research, and autonomous vehicles.[81] Perhaps, as some scholars note, the shape of the doctrine should depend on whether the licensing solutions and data markets are developed sufficiently, justifying rights holders' claims.[82]
In addition to the question of whether training AI models on unlicensed works and databases is infringing copyright, there is the related question of whether the outputs of such generative models are infringing.[83] For both of these issues, one of the technical questions concerns how much AI models retain the actual, expressive content of works they were trained on, apparently being able to re-create nearly exact copies of substantial portions of particular works on which they were trained.[84] Seemingly, LLMs do, at least sometimes, "memorize" works in their training data, as recent lawsuits allege;[85] in such cases, AI communicates the original expression from the works it was trained on, which is suspect under the fair use framework.[86] There are already several highly publicized cases in which AI seemingly re-created whole articles;[87] images with a painter's signature or watermark;[88] or copyrightable characters, such as Snoopy or Mickey Mouse.[89] Although experts caution that these are rare instances and AI providers are taking steps to prevent them,[90] these cases might nonetheless undermine the claim to fair use if the outputs are substantially similar to the works on which AI was trained.[91] It is not difficult to re-create copyrightable characters if the users provide detailed prompts.[92] Finally, generative AI could also be used to mimic the style of artists, such as singers, illustrators, or writers. This makes the fair use argument less persuasive because the output is closer to substitution for the copyrighted material used in the training data than transformation.[93] At the same time, style has always been difficult for copyright doctrine to address,[94] sometimes being called unprotectable, making many cases of algorithmic reproduction allowable under the law — or, at least, difficult to analyze legally.[95] The USCO recently issued a report concluding that although artistic style is and should remain unprotectable under copyright, "there may be situations where the use of an artist's own works to train AI systems to produce material imitating their style can support an infringement claim."[96]
In deciding AI fair use cases, courts might be swayed by a host of legal and policy arguments regarding generative AI model training — the supposed spread of misinformation, perpetuation of biases, and replacing of artists; increasing productivity; enabling new forms of creativity; and accelerating research.[97] Further arguments involve job displacement resulting from AI and questions of antimonopoly policy.[98] Some of these concerns might be too broad for copyright doctrine to address.[99] More generally, if the courts reject fair use for generative AI, they could halt innovation or push it offshore; if they accept fair use, they might divert economic gain from individual creators.[100] Applications of the fair use doctrine in this context are yet to be decided; it is possible that cases could point in divergent directions.
Copyright law protects original works of human expression. It does not protect AI-generated works where a human makes little to no creative impact, such as by typing a simple prompt, but it does protect works created with the use or assistance of AI. It is not yet clear how much creative input will be required to render AI-assisted work protectable under copyright. Training of AI models is likely to be deemed legal if the AI model does not retain protectable expression about works. Generative AI, such as LLMs, presents more-complex considerations, leading to a fact-specific inquiry into the source of training data, the purpose of the model, and the effect on the licensing markets, whether existing or potential. These questions will be settled in litigation and might not yield uniform answers initially, though the issue likely will be resolved by either legislation or the Supreme Court. It is unclear whether legislative solutions pursued in other jurisdictions, such as the European Union, will influence domestic U.S. development; they might, however, exert an effect on which policies U.S. companies implement. Finally, if global copyright standards continue to diverge, an expansive doctrine of fair use might allow the United States to remain a leader in international technological competition and attract investment in AI — at the price of domestic rights holders.
Matt Blaszczyk is a research fellow at the University of Michigan Law School. Geoffrey McGovern is director of Intellectual Property and a senior political scientist at RAND. Karlyn D. Stanley is a senior policy researcher at RAND.
This research was sponsored by RAND Institute for Civil Justice and conducted in the Justice Policy Program within RAND Social and Economic Well-Being and the Science and Emerging Technology Research Group within RAND Europe.
This publication is part of the RAND expert insights series. The expert insights series presents perspectives on timely policy issues.
This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.
RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.