Methods for Synthetic Data Generation

Joshua Snoke, Satkartar K. Kinney

ResearchPosted on rand.org Dec 13, 2024Published in: Handbook of Sharing Confidential Data: Differential Privacy, Secure Multiparty Computation, and Synthetic Data, Chapter 15, pages 179-193 (2024). DOI: 10.1201/9781003185284

In order to understand the methods for generating synthetic data, it is important to start with an understanding of the basis from which these methods have arisen. Viewed purely from the statistical methods that are invoked, synthetic data generation looks similar to other statistical methods, such as missing data imputation or micro-simulation. It is true that synthetic data generation has borrowed significantly from work in other domains, but the methods diverge because the goals of missing data imputation or micro-simulation differ from those of synthetic data. There are two fundamental questions which have guided the development of synthetic data models that highlight the differences between these methods and other approaches. First, and perhaps most obviously, is the question of privacy or disclosure risk. Synthetic data was developed as a method for allowing researchers to access microdata while minimizing the risk of disclosure from releasing that data. It was proposed as an alternative to prior approaches, such as micro-suppression, data swapping, and data reduction methods such as coarsening and top- or bottom-coding. Conceptually, it deviated from other approaches; rather than starting from the entire confidential sample and attempting to maintain as much of the original data records as possible, synthetic data started from the sample parameters of the data, according to some assumed data generating process, and drew fully new records based on a model using these sample parameters. In this way, a model and a data-generating process are essential to synthetic data in a way that they are not for other statistical disclosure control methods. This is true even in cases when only parts of the original records are replaced with synthetic values as discussed in Section 11.2.1.

Document Details

  • Publisher: CRC Press
  • Availability: Non-RAND
  • Year: 2024
  • Pages: 15
  • Document Number: EP-70770

Research conducted by

This publication is part of the RAND external publication series. Many RAND studies are published in peer-reviewed scholarly journals, as chapters in commercial books, or as documents published by other organizations.

RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.