Some Challenges in Confidentiality
Miron L. Straf
The National Academies
The digital age has spawned the rapid development of massive data
sets, including large databases for tracking individuals over time
and for remote sensing of the environment and of individual behavior.
The data revolution extends as well to the very essence of human beings:
their genes, their DNA, and their activities in the brain that govern
thoughts, emotions, and all human functioning. Moreover, advanced
data mining and visualization techniques enable the recognition of
patterns involving many variables at once.
The future will bring many other advances, such as computer-based
time-use measures of location and activity, perhaps even of emotions
or thoughts; geographic mapping of variables to the specific location,
without regard to boundaries or how units are aggregated; and deciphering
meaning from language in textual form. Our ability to develop data
through new measurements and observations and to analyze them with
new methods and models has never been greater. And prospects for far
greater advances abound.
At the same time, our political and other social institutions have
become voracious in their needs for data to inform, implement, and
evaluate policies and practices–activities that often require
an understanding of highly complex human behavior. For example, the
genomics revolution promises to be a watershed in health and medicine,
making health care “predictive, personalized, and pre-emptive.”
Within this next decade, it may be possible to determine an individual’s
genetic sequence overnight and at a reasonable cost.
The data revolution could provide for the many needs for data from
personal information, but it must do so responsibly by addressing
concerns about privacy and confidentiality. How issues of privacy
and confidentiality should be addressed in an era of burgeoning massive
databases of personal information is the Gordian knot of the digital
age.
Principles of data stewardship
The research community and government data providers could better
inform policies and practices with the development of an authoritative
set of principles for privacy and confidentiality of personal information
together with an exposition of the fundamental reasons underlying
them and guidance on their application. Such a statement of principles
could serve privacy and confidentiality as the Belmont Report has
been serving for research involving human subjects.
Actions by data providers
Government and other data providers should further expand the availability
of information derived from confidential data, including, through
protected enclaves, the confidential data themselves. Three actions
by government data providers would advance this purpose:
1. Seek legislation and regulations to place the onus of responsibility
on users to avoid identifying any individual. Penalties could be provided
by law, such as the National Center for Education Statistics has.
With such a law, government statistics agencies could make more data
available with greater detail and in more easily accessible and more
usable ways without incurring increased risks. The protection would
automatically go with the data, from producer to user or from one
user to another.
2. Take a risk-benefit approach, not a zero tolerance one. Many
people believe that we can prevent individuals being identified from
misuse of statistical data by setting a standard of zero tolerance
for such a possibility. Although the concept of zero tolerance applies
to disclosure of personal information, it does not apply to statistical
data derived from such information. For such data, the risk that some
individual may be identified from them must be minimized. The standard
for doing so should be to exercise reasonable care. The risks need
to be weighed, but in comparison with the benefits of research.
3. Inform the public and engage it in a meaningful dialogue on privacy
and confidentiality. For example, a promise of confidentiality in
a survey may not motivate some people to respond if they perceive
the survey as unnecessarily intrusive. Other factors pertaining to
trust may also affect responses. Trying to control the risk of a breach
in confidentiality without addressing them may be ineffective and
could be counter productive in the long term.
Actions by researchers
Researchers also have important roles and responsibilities in working
with government data providers, many of whom are also researchers,
to protect confidentiality. Three important actions for researchers
are the following:
1. Assess the risks pertaining to privacy and confidentiality. For
example, what are the risks when common identifiers, such as those
specified by the HIPAA privacy rules, are deleted? How can these and
other risks be balanced against the benefits of research?
2. Provide for the training of researchers on the needs for and
means of protecting confidentiality. Such training should be part
of the how students learn about the ethical conduct of research. The
America COMPETES Act, recently signed into law, requires an institution
applying for NSF support to “describe in its grant proposal
a plan to provide appropriate training and oversight in the responsible
and ethical conduct of research.”
3. Continue to advance research on disclosure limitation. In this
regard, a better understanding of the properties and uses of so-called
“synthetic data” appear to be most fruitful. The recently
launched Virtual Research Data Center (VRDC) under John Abowd at Cornell
provides for many opportunities to study the properties of synthetic
data. The Center develops, upon request, a synthetic public use microdata
file with data from the Survey of Income and Program Participation
linked to Social Security earnings.
The validity of such data is a major issue. Comparisons of synthetic
to actual data have been made on univariate distributions, on correlations,
and on regression coefficients. But what about the use of synthetic
data to determine outliers or multifactor interactions? In particular,
how well do synthetic data serve in model selection? If an algorithm
to select variables to use in a model can be specified, the VRDC can
generate the results for synthetic files and compare them to results
for the actual confidential data. Model validation is especially important
for transportation planning, done by local jurisdictions using over
400 models. Synthetic data files could be produced as samples on which
to validate these models.
17 August 2007