|
White Paper & Bio
Privacy has many meanings, and technology relevant to achieving
one of these is not necessarily helpful for the others. Let us focus
on the boundary between "Inside'' -- roughly, those who are supposed
to have access to information -- and "Outside'' -- everyone else.
When there is a clearly defined boundary between Inside and Outside,
access control mechanisms and cryptographic techniques may suffice.
For example, a psychiatrist may take notes during one session, and
use these to refresh her memory before a later session. In this case
the psychiatrist is Inside, and everyone else is Outside. Simple access
control may suffice for the necessary protection.
Similarly, if Alice and her remotely located colleague Bob wish to
converse privately in the presence of a wire tap, then Alice and Bob
are the Insiders, the wire tapper is the Outsider, and any cryptosystem
that is semantically secure against an eavesdropping attack solves
the problem.
Now consider a statistical database. Here, information is collected
by a trusted and trustworthy curator, whose goal is to release statistics
of interest while protecting the privacy of the individual respondents.
Here, the curator and the respondents are the Insiders, and the consumers
of the statistics (eg, the public in the context of official statistics),
are the Outsiders. However, the boundary is a bit murky, as the curator
must give real information about the data set (else, what is the point
of the database?). Indeed, it is provably impossible to guarantee
the high quality of privacy enjoyed by Alice and Bob in their encrypted
conversation.
Nonetheless, in such cases it is possible to give useful ad hoc and
even ad omnia guarantees about privacy, sometimes with minimal distortion
of statistics. This is a thriving area of research in the statistics,
databases, and cryptography communities. We are excited about a new
measure, "differential privacy,'' that quantifies the extent to which
a privacy mechanism for responding to queries against a confidential
database masks the presence or absence of individual respondents in
the data set. (This is a conceptual shift from all previous work,
which analyzes the difference between an adversarial user's prior
and posterior views of a respondent; we now know that at least theoretically
such a definition of privacy cannot be achieved.) The new notion is
an ad omnia guarantee and it is useful: there are algorithmic techniques
for ensuring differential privacy while carrying out many standard
datamining tasks with excellent accuracy.
At the same time there are theoretical limits to what can be done
in the context of private data analysis, to wit, theorems along the
lines of "Any mechanism that permits too accurate answers to too many
questions is blatantly non-private.'' Also, the algorithmic techniques
mentioned above do not necessarily extend well to analysis of social
networks and other graphs. So in this area there is much to be done,
but there is a wealth of sound definitional and algorithmic techniques
on which to draw.
We have seen, then, that there is privacy technology for the case
that the boundary between Insiders and Outsiders is clear and when
it is more murky but nicely definable, eg, through the Differential
Privacy definition. When the boundary between Inside and Outside is
completely porous, there is as yet no general framework for approaching
the problem. Examples in this category include:
1) Outsourcing of confidential data for processing, eg, evaluation
of insurance claims.
2) Testing medical database software: the testers cannot be given
access to real patient data, yet they must test the software on realistic
databases. How can we generate truly realistic data for testing purposes?
When is a synthetic database sufficiently protective of the real data
on which the model for its generation is based?
3) Bug reporting: this naturally releases information to the software
vendor both about the application being run and possibly the data
on which it is run.
4) Datamining for counterterrorism, data fusion.
Clearly, privacy preservation in these contexts is of great importance
nationally and internationally. There is urgent need for research
into the questions of what can and cannot be achieved, algorithmic
techniques for solving what can be solved, and social/policy mechanisms
for addressing the problems with no technical solution.
|
|
Cynthia Dwork
Microsoft Research
|
|
|
Biographical Data
Cynthia Dwork has made fundamental contributions to complexity
theory, distributed computing, and cryptography. Her current focus
is the development of a mathematically rigorous framework for the
privacy-preserving analysis of data. Together with several collaborators,
she has articulated new and powerful privacy definitions, and applied
them to data analysis problems ranging from reporting of tabular
data to machine learning. This year her work on the inefficacy of
anonymization in social network graphs received special recognition
at the World Wide Web conference, and her 1984 work on fault-tolerant
distributed computing received the ACM Edsger W. Dijkstra "test
of time" award.
|
|