A Playbook for Securing AI Model Weights

Sella Nevo, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, Jeff Alstott

Research SummaryPublished Nov 21, 2024

A glowing red lock hovers above a circuit board. Photo by WenPhoto/Photo by Sakibjaman777/Adobe Stock

Photo by Sakibjaman777/Adobe Stock

As frontier artificial intelligence (AI) models — that is, models that match or exceed the capabilities of the most advanced models at the time of their development — become more capable, protecting them from theft and misuse becomes more critical. Especially important are a model's weights — the learnable parameters derived by training the model on massive data sets. Stealing a model's weights gives attackers the ability to exploit the model for their own use. The requirement to secure AI models also has important national security implications. AI developers and stakeholders across industry, government, and the public need a shared language to assess threats, security postures, and security outcomes.

RAND researchers developed a first-of-its-kind playbook to help AI companies defend against a range of attacker capabilities, up to and including the most sophisticated attacks: Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models.[1] The report also strives to facilitate meaningful dialogue among stakeholders on risk management strategies and the broader impact of AI security.



This brief provides an overview of the report, which

  • identifies 38 meaningfully distinct attack vectors
  • explores a variety of potential attacker capabilities, from opportunistic criminals to highly resourced nation-states
  • estimates the feasibility that an attack vector can be executed by different categories of attackers
  • proposes and defines five security levels and recommends preliminary benchmark security systems that roughly achieve the security levels.

Avoiding significant security gaps requires comprehensively implementing a broad set of security practices. However, several recommendations should be urgent priorities for frontier AI organizations:

  • Develop a security plan for a comprehensive threat model.
  • Centralize all copies of weights in access-controlled and monitored systems.
  • Reduce the number of people with access to the weights.
  • Harden interfaces for model access.
  • Employ defense-in-depth for redundancy.
  • Implement insider threat programs.
  • Incorporate confidential computing to secure the weights and reduce the attack surface.

Certain measures are needed to protect against the most sophisticated attackers. These include physical bandwidth limitations between devices or networks containing weights and the outside world; hardware to secure model weights while providing an interface for inference; and secure, completely isolated networks for training, research, and other interactions. Because such efforts make take significant time (e.g., five years) to implement, it would be wise for organizations to begin now.

Why Focus on Securing AI Systems, Especially Their Model Weights?

Advanced AI models hold the promise of enhancing labor productivity and improving human health. However, the promise comes with the attendant risk of misuse and unintended consequences of deployment.

The need to protect frontier AI models is not merely commercial: Concerns that the risks of AI models may have national security significance add the security and interests of the public to the risk calculation.

Potential threats are sophisticated, particularly high-priority operations conducted by nation-states. Organizations develop their own security strategies, based on their assessment of threats. But the idiosyncratic view of one organization's security team might have implications wider than the organization itself. All stakeholders need to have a shared understanding of how security strategies, whether voluntary or governmental, translate into actual security.

The research team's analysis focused on ways to prevent the theft of model weights, the learnable parameters of AI models. An AI model's weights represent the culmination of many costly prerequisites for training advanced AI models: significant investment in computing power, large amounts of training data, and years of research by top talent to optimize algorithms. If attackers have a model's weights, they have complete control over the model.

The research team analyzed a variety of written sources from the academic literature, commercial security reports, official government documents, media reports, and other online sources. They also conducted interviews with nearly three dozen experts, including national security government personnel specializing in information security, prominent information security industry experts, senior information security staff from frontier AI companies, other senior staff from frontier AI companies, independent AI experts with prior experience at frontier AI organizations, and insider threat experts.

What Are the Potential Avenues of Attack?

The research team identified 38 attack vectors that potential attackers could use to steal model weights. The vectors are not merely theoretical: The vast majority have already been used. Table 1 profiles five common attack vectors and gives examples of their use. The examples illustrate the span of attack capabilities — from incredibly common and easy, such as placing USBs in parking lots, to the most sophisticated attacks, such as development of game-changing cryptanalytic tools that only the most capable actors can achieve.

These examples offer an intuitive profile of the capabilities of different types of actors. The full report provides detailed attack descriptions and hundreds of examples.

Table 1. Five Common Attack Vectors

Type of Attack Example Approach Real-World Instance
Social engineering Phishing: An attacker can trick a legitimate user into running malicious code or inadvertently sharing their authentication credentials, overcoming even multifactor authentication. Phishing is responsible for an estimated annual global loss of $6.9 billion.[2] Advanced phishing is provided as a service for only $400 per month.[3]
Malicious placement of portable devices Placement of "dropped" USBs: Hackers can "drop" USB devices in the parking lots of organizations of interest. Eventually some employee plugs the USB into their work computer to find out who dropped it, allowing the hacker to execute code in an internal network. One can even buy USB cables for $180 that provide remote control over a computer.[4] Multiple U.S. nuclear facilities have been successfully infected with malware using "dropped" USBs.[5]
AI-specific attack vectors AI-specific infrastructure tends to have sprawling dependencies and moves faster than most other software—leaving it even more vulnerable to supply chain attacks. Attackers can intentionally introduce vulnerabilities to open-source packages that are used by common machine learning infrastructure. PyTorch, arguably the most common machine learning development framework in the world, was compromised by a malicious open-source software package it imported.[6]
Unauthorized physical access to systems Particularly capable attackers can not only break into a device's location, but also penetrate various types of hardware security. One approach is to use voltage glitching: changes to the voltage provided to a chip that cause it to malfunction and release information meant to be secure. Chips by AMD, the producer of about 23% of all desktop and server chips globally, were found to be vulnerable to a "voltage glitching" attack.[7]
Undermining the access control system itself Extremely capable actors can find and exploit vulnerabilities in core cryptographic building blocks underlying ubiquitously used encryption, authentication, and access control systems. Exploiting these vulnerabilities would undermine many of the assumptions made by most security systems. The method of "differential cryptanalysis," described by Eli Biham and Adi Shamir in 1991,[8] undermined vast swathes of encryption and authentication systems—most systems that were not fine-tuned specifically to prevent it. It was later revealed that IBM discovered this type of attack as early as 1974, only to find out that the National Security Agency was aware of it before then.[9]

Because attack vectors are so diverse and numerous, defenses need to be varied and comprehensive; achieving strong security against a specific category of attack does not protect an organization from others.

In addition, publicly known examples of attacks are only a subset of actual attacks. In the research team's interviews, many national security experts noted that the vast majority of highly resourced state actor attacks they are aware of were never publicly revealed.

What Are the Security Needs of Different AI Systems?

To facilitate more-nuanced discourse on the security needs of different AI systems, Securing AI Model Weights proposes five security levels (SLs), broadly defined as the level of security required to prevent increasingly capable operations:

  • SL 1 can protect against amateur attempts: hobbyist hackers and untargeted "spray and pray" attacks.
  • SL 2 can likely hinder opportunistic efforts by professionals: both individual professional hackers and groups executing untargeted or lower-priority attacks.
  • SL 3 provides protection against cybercrime syndicates and insider threats. This includes world-renowned criminal hacker groups, well-resourced terrorist organizations, and disgruntled employees.
  • SL 4 can foil standard operations by leading cyber capable institutions: for example, many leading state-sponsored groups and intelligence agencies.
  • SL 5 can spoil the least common but the most dangerous attacks: top-priority operations conducted by the world's most capable nation-states.
A pyramid that shows the five security levels for AI systems

Security Level 5 can likely thwart most top-priority operations by the top cyber-capable institutions

Security Level 4 can likely thwart most standard operations by leading cyber-capable institutions

Security Level 3 can likely thwart most cybercrime syndicates and insider threats

Security Level 2 can likely thwart most professional opportunistic efforts

Security Level 1 can likely thwart most amateur attempts

How Can AI Organizations Implement Security Measures Proportional to Risk?

In iterative consultation with experts, the team defined a benchmark system for each SL. The system offers a rough tool for calibrating the relationships between implementing security measures and expected security outcomes. The benchmarks suggest concrete measures and policies estimated to comprise the minimum requirements of a system that conforms to the goals of that security level. The benchmarks are neither a complete standard nor a compliance regime. They offer organizations concrete suggestions for next steps.

Multiple AI labs estimated that if the effort were prioritized, it could take about one year to reach SL 3; two to three years to achieve SL 4; and at least five years, and the support of the national security community, to achieve SL 5, Here we provide a few selected comments from the discussion of the SLs in the full report.

Table 2. Security Level Benchmarks

Benchmark Overview
SL 1 At this level, organizations should rely on existing security products and best practices rather than trying to develop proprietary solutions. SL 1 provides reliable security from only the most trivial attackers.
SL 2 The most important concern at this level is implementing the fundamentals comprehensively across the board, ensuring that there are no "blind spots" left unaddressed. Prioritizing the most common attack vectors is key: for example, ensuring that email security, password policies, and multifactor authentication are enforced correctly.
SL 3 At this level, a key goal is to reduce the risks from insider threats (e.g., company employees), thus simultaneously reducing the risk from attackers who masquerade as insiders or gain illegitimate access to employees' digital devices. Mitigating these risks includes reducing the number of people authorized to access the model's weights, hardening the interfaces to them, and implementing defense-in-depth. The benchmark includes monitoring and securing the full supply chain: software, hardware, even air conditioners.
SL 4 The remaining security-critical surface can be comprehensively hardened, reviewed, monitored, and penetration-tested. This requires significant compromises on productivity, convenience, and efficiency. Confidential computing should be implemented to protect the weights in use. Because state actors have extensive capabilities, the security team must have specific experience dealing with such actors.
SL 5 Except for production use, weights are stored in a completely isolated setup disconnected from the external world, with extremely stringent policies on data transfer that prevent even those with approved access from being able to take large amounts of data out of the room. More research and development is needed to enable organizations to support production models while meeting SL5 security requirements. Achieving SL5 is currently not possible.

Securing AI Model Weights describes 167 recommended security measures that make up the security benchmarks. This brief provides two examples — a small sample of the many important and feasible actions that organizations can take to protect their model weights.

Hardening Interfaces for Weight Access

In many leading labs, hundreds or thousands of individuals have full "read" access to frontier model weights. Any one of those individuals can make a copy of the weights, which they could sell or disseminate. Generally, these individuals need to use the weights for their work — but the vast majority do not need the ability to copy them. This recommended security measure ensures that authorized users interact with the weights through a software interface that reduces the risk of the weights being illegitimately copied.

Combining three simple types of access could accommodate varied types of employee access while significantly reducing exfiltration risk:

  1. Use of predefined code reviewed and vetted by the security team. This is the most logical choice for inference interfaces (both internally and for public application programming interfaces [APIs]).
  2. More flexible access (including execution of custom code) on a server with rate-limited outputs, so that exfiltration of a significant portion of the weights would take too long to be practical. This could be used for most research and development use cases.
  3. Direct work (without constraints on code or output rates) on an air-gapped isolated computer. This may be useful for rare instances where complete flexibility is needed, possibly when conducting interpretability research directly on frontier models.

Confidential Computing

Even if model weights (and other sensitive data) are encrypted in transport and in storage, they are decrypted and vulnerable to theft during their use. Many employees, as well as cyber attackers with a minimally persistent presence, can steal the weights once they are decrypted ahead of their intended use. Confidential computing is a technique for ensuring that data remain secure, including during use, by decrypting the data only within a hardware-based trusted execution environment (TEE) that will not run insecure code. Implementing confidential computing to secure AI model weights could significantly reduce the likelihood of the weights being stolen.

However, confidential computing needs to be implemented correctly:

  • The TEE must include protections against physical attacks (current implementations of confidential computing in graphics processing units [GPUs] do not).
  • Model weights must be encrypted by a key generated within the TEE and stored within it.
  • The TEE will run only prespecified and audited signed code. That code decrypts the weights, runs inference, and outputs only the model response. The code cannot output weights, the weight encryption key, or any information directly outputted by the model.

The use of confidential computing in GPUs is still nascent and may not be production-ready for frontier systems. However, there is an overwhelming consensus among experts regarding its importance, and it is expected to be deployed shortly.

How Can Securing AI Model Weights Be Used by Stakeholders?

There is an ongoing, lively debate regarding the extent to which different models need to be secured (if at all). The research team's goal was to improve the ability to secure whichever frontier AI models are deemed worth securing at the desired SL by systemizing knowledge about which security postures achieve various desirable security outcomes, specifically in the context of securing AI systems and their model weights. Securing AI Model Weights supports informed decisionmaking in both the private and public sectors:

  • Executives and security teams at leading AI companies can explore the attack vectors profiled in the report and the many documented instances of their use. Many security experts that the research team interviewed were familiar with some attack vectors but unaware or skeptical of others. This knowledge gap can leave their systems vulnerable.
  • AI companies should also compare their existing security posture to the five security benchmarks. By identifying which benchmark is closest to their current state, they can better understand which actors they are likely secure against and, more importantly, which actors remain threats. The benchmarks are not prescriptive, and their details will evolve, but they provide a useful calibration tool.
  • Companies can use the security benchmarks to identify next steps in improving its security posture. If companies are missing specific security measures (or alternatives that make more sense with their infrastructure) to meet a specific benchmark, they should focus on those first. Once they have achieved a security benchmark, they can look to the next one for further recommendations and next steps.
  • The security benchmarks could also be the foundation for standards and regulation, giving regulators and executives a bedrock of concepts and measures to assess what level of security companies have achieved or have plans to reach.

Notes

  • [1] Sella Nevo, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, and Jeff Alstott, Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models, RAND Corporation, RR-A2849-1, 2024.
  • [2] Internet Crime Complaint Center, Internet Crime Report 2021, Federal Bureau of Investigation, 2021.
  • [3] "Cybercriminals Increasingly Using EvilProxy Phishing Kit to Target Executives," Hacker News, August 10, 2023; Hak5, "O.MG Cable," webpage, undated, https://shop.hak5.org/products/omg-cable.
  • [4] Sean Gallagher, "Playing NSA, Hardware Hackers Build USB Cable That Can Attack," Ars Technica, January 20, 2015.
  • [5] Salih Bıçakcı, "Introduction to Cyber Security for Nuclear Facilities," in Sinan Ülgen and Grace Kim, eds., A Primer on Cyber Security in Turkey and the Case of Nuclear Power, Edam Centre for Economic and Foreign Policy Studies, 2015, p. 69.
  • [6] MITRE, "Compromised PyTorch Dependency Chain," incident date of December 25, 2022.
  • [7] Thomas Claburn, "Re-Volting: AMD Secure Encrypted Virtualization Undone by Electrical Attack," The Register, August 13, 2021.
  • [8] Eli Biham and Adi Shamir, "Differential Cryptanalysis of DES-Like Cryptosystems," Journal of Cryptology, Vol. 4, January 1991.
  • [9] "The NSA's Work to Make Crypto Worse and Better," Ars Technica, September 6, 2013.

Document Details

Citation

RAND Style Manual

Nevo, Sella, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, and Jeff Alstott, A Playbook for Securing AI Model Weights, RAND Corporation, RB-A2849-1, 2024. As of April 30, 2025: https://www.rand.org/pubs/research_briefs/RBA2849-1.html

Chicago Manual of Style

Nevo, Sella, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, and Jeff Alstott, A Playbook for Securing AI Model Weights. Santa Monica, CA: RAND Corporation, 2024. https://www.rand.org/pubs/research_briefs/RBA2849-1.html.
BibTeX RIS

Research conducted by

This publication is part of the RAND research brief series. Research briefs present policy-oriented summaries of individual published, peer-reviewed documents or of a body of published work.

This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.

RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.