Data Confidentiality Workshop
Home Workshop Agenda Participants Travel Information

 

Contact

 


WORKSHOP ON DATA CONFIDENTIALITY

September 6-7, 2007 in Arlington, VA

White Paper & Bio


Privacy has many meanings, and technology relevant to achieving one of these is not necessarily helpful for the others. Let us focus on the boundary between "Inside'' -- roughly, those who are supposed to have access to information -- and "Outside'' -- everyone else.

When there is a clearly defined boundary between Inside and Outside, access control mechanisms and cryptographic techniques may suffice. For example, a psychiatrist may take notes during one session, and use these to refresh her memory before a later session. In this case the psychiatrist is Inside, and everyone else is Outside. Simple access control may suffice for the necessary protection.

Similarly, if Alice and her remotely located colleague Bob wish to converse privately in the presence of a wire tap, then Alice and Bob are the Insiders, the wire tapper is the Outsider, and any cryptosystem that is semantically secure against an eavesdropping attack solves the problem.

Now consider a statistical database. Here, information is collected by a trusted and trustworthy curator, whose goal is to release statistics of interest while protecting the privacy of the individual respondents. Here, the curator and the respondents are the Insiders, and the consumers of the statistics (eg, the public in the context of official statistics), are the Outsiders. However, the boundary is a bit murky, as the curator must give real information about the data set (else, what is the point of the database?). Indeed, it is provably impossible to guarantee the high quality of privacy enjoyed by Alice and Bob in their encrypted conversation.

Nonetheless, in such cases it is possible to give useful ad hoc and even ad omnia guarantees about privacy, sometimes with minimal distortion of statistics. This is a thriving area of research in the statistics, databases, and cryptography communities. We are excited about a new measure, "differential privacy,'' that quantifies the extent to which a privacy mechanism for responding to queries against a confidential database masks the presence or absence of individual respondents in the data set. (This is a conceptual shift from all previous work, which analyzes the difference between an adversarial user's prior and posterior views of a respondent; we now know that at least theoretically such a definition of privacy cannot be achieved.) The new notion is an ad omnia guarantee and it is useful: there are algorithmic techniques for ensuring differential privacy while carrying out many standard datamining tasks with excellent accuracy.

At the same time there are theoretical limits to what can be done in the context of private data analysis, to wit, theorems along the lines of "Any mechanism that permits too accurate answers to too many questions is blatantly non-private.'' Also, the algorithmic techniques mentioned above do not necessarily extend well to analysis of social networks and other graphs. So in this area there is much to be done, but there is a wealth of sound definitional and algorithmic techniques on which to draw.

We have seen, then, that there is privacy technology for the case that the boundary between Insiders and Outsiders is clear and when it is more murky but nicely definable, eg, through the Differential Privacy definition. When the boundary between Inside and Outside is completely porous, there is as yet no general framework for approaching the problem. Examples in this category include:

1) Outsourcing of confidential data for processing, eg, evaluation of insurance claims.

2) Testing medical database software: the testers cannot be given access to real patient data, yet they must test the software on realistic databases. How can we generate truly realistic data for testing purposes? When is a synthetic database sufficiently protective of the real data on which the model for its generation is based?

3) Bug reporting: this naturally releases information to the software vendor both about the application being run and possibly the data on which it is run.

4) Datamining for counterterrorism, data fusion.


Clearly, privacy preservation in these contexts is of great importance nationally and internationally. There is urgent need for research into the questions of what can and cannot be achieved, algorithmic techniques for solving what can be solved, and social/policy mechanisms for addressing the problems with no technical solution.

Cynthia Dwork

Microsoft Research

 

Biographical Data

 

Cynthia Dwork has made fundamental contributions to complexity theory, distributed computing, and cryptography. Her current focus is the development of a mathematically rigorous framework for the privacy-preserving analysis of data. Together with several collaborators, she has articulated new and powerful privacy definitions, and applied them to data analysis problems ranging from reporting of tabular data to machine learning. This year her work on the inefficacy of anonymization in social network graphs received special recognition at the World Wide Web conference, and her 1984 work on fault-tolerant distributed computing received the ACM Edsger W. Dijkstra "test of time" award.