Data Confidentiality Workshop
Home Workshop Agenda Participants Travel Information

 

Contact

 


WORKSHOP ON DATA CONFIDENTIALITY

September 6-7, 2007 in Arlington, VA

White Paper & Bio


Some Challenges in Confidentiality

Miron L. Straf
The National Academies

The digital age has spawned the rapid development of massive data sets, including large databases for tracking individuals over time and for remote sensing of the environment and of individual behavior. The data revolution extends as well to the very essence of human beings: their genes, their DNA, and their activities in the brain that govern thoughts, emotions, and all human functioning. Moreover, advanced data mining and visualization techniques enable the recognition of patterns involving many variables at once.

The future will bring many other advances, such as computer-based time-use measures of location and activity, perhaps even of emotions or thoughts; geographic mapping of variables to the specific location, without regard to boundaries or how units are aggregated; and deciphering meaning from language in textual form. Our ability to develop data through new measurements and observations and to analyze them with new methods and models has never been greater. And prospects for far greater advances abound.

At the same time, our political and other social institutions have become voracious in their needs for data to inform, implement, and evaluate policies and practices–activities that often require an understanding of highly complex human behavior. For example, the genomics revolution promises to be a watershed in health and medicine, making health care “predictive, personalized, and pre-emptive.” Within this next decade, it may be possible to determine an individual’s genetic sequence overnight and at a reasonable cost.

The data revolution could provide for the many needs for data from personal information, but it must do so responsibly by addressing concerns about privacy and confidentiality. How issues of privacy and confidentiality should be addressed in an era of burgeoning massive databases of personal information is the Gordian knot of the digital age.

Principles of data stewardship

The research community and government data providers could better inform policies and practices with the development of an authoritative set of principles for privacy and confidentiality of personal information together with an exposition of the fundamental reasons underlying them and guidance on their application. Such a statement of principles could serve privacy and confidentiality as the Belmont Report has been serving for research involving human subjects.

Actions by data providers

Government and other data providers should further expand the availability of information derived from confidential data, including, through protected enclaves, the confidential data themselves. Three actions by government data providers would advance this purpose:

1. Seek legislation and regulations to place the onus of responsibility on users to avoid identifying any individual. Penalties could be provided by law, such as the National Center for Education Statistics has. With such a law, government statistics agencies could make more data available with greater detail and in more easily accessible and more usable ways without incurring increased risks. The protection would automatically go with the data, from producer to user or from one user to another.

2. Take a risk-benefit approach, not a zero tolerance one. Many people believe that we can prevent individuals being identified from misuse of statistical data by setting a standard of zero tolerance for such a possibility. Although the concept of zero tolerance applies to disclosure of personal information, it does not apply to statistical data derived from such information. For such data, the risk that some individual may be identified from them must be minimized. The standard for doing so should be to exercise reasonable care. The risks need to be weighed, but in comparison with the benefits of research.

3. Inform the public and engage it in a meaningful dialogue on privacy and confidentiality. For example, a promise of confidentiality in a survey may not motivate some people to respond if they perceive the survey as unnecessarily intrusive. Other factors pertaining to trust may also affect responses. Trying to control the risk of a breach in confidentiality without addressing them may be ineffective and could be counter productive in the long term.

Actions by researchers

Researchers also have important roles and responsibilities in working with government data providers, many of whom are also researchers, to protect confidentiality. Three important actions for researchers are the following:

1. Assess the risks pertaining to privacy and confidentiality. For example, what are the risks when common identifiers, such as those specified by the HIPAA privacy rules, are deleted? How can these and other risks be balanced against the benefits of research?

2. Provide for the training of researchers on the needs for and means of protecting confidentiality. Such training should be part of the how students learn about the ethical conduct of research. The America COMPETES Act, recently signed into law, requires an institution applying for NSF support to “describe in its grant proposal a plan to provide appropriate training and oversight in the responsible and ethical conduct of research.”
3. Continue to advance research on disclosure limitation. In this regard, a better understanding of the properties and uses of so-called “synthetic data” appear to be most fruitful. The recently launched Virtual Research Data Center (VRDC) under John Abowd at Cornell provides for many opportunities to study the properties of synthetic data. The Center develops, upon request, a synthetic public use microdata file with data from the Survey of Income and Program Participation linked to Social Security earnings.

The validity of such data is a major issue. Comparisons of synthetic to actual data have been made on univariate distributions, on correlations, and on regression coefficients. But what about the use of synthetic data to determine outliers or multifactor interactions? In particular, how well do synthetic data serve in model selection? If an algorithm to select variables to use in a model can be specified, the VRDC can generate the results for synthetic files and compare them to results for the actual confidential data. Model validation is especially important for transportation planning, done by local jurisdictions using over 400 models. Synthetic data files could be produced as samples on which to validate these models.

17 August 2007

 

Miron L. Straf

The National Academies

 

 

Biographical Data

 

Miron L. Straf

Miron L. Straf is the Deputy Director of the Division of Behavioral and Social Sciences and Education at the National Academies. For many years, he was Director of the Division's Committee on National Statistics.

He is recognized for his contributions to government statistics, in particular the federal statistical system, and the use of information for public policy decision making. He is the author of entries on these topics for The International Encyclopedia of the Social and Behavioral Sciences. At the Academies, he has developed over 50 major studies and over 40 conferences in the application of statistics to public policy. He was honored by the American Association of Public Opinion Research with its Innovators Award for his work on cognitive aspects of survey methodology having had a "catalyzing effect on research on survey measurement."

He received his Ph.D. in statistics from the University of Chicago and has taught on the faculties of the University of California, Berkeley, and The London School of Economics and Political Science. He is active in many national and international associations and is a past president of the American Statistical Association.

17 August 2007