Data Confidentiality Workshop
Home Workshop Agenda Participants Travel Information

 

Contact

 


WORKSHOP ON DATA CONFIDENTIALITY

September 6-7, 2007 in Arlington, VA

White Paper & Bio


[NB: The use of "privacy" in what follows may more accurately be read as "confidentiality". The CS crowd often uses the former to reference the issue that the Stats crowd references by the latter.]

Rather than attempt to terrify and alarm the reader by describing scenes of impending privacy catastrophes, real or imagined, I try to take a more optimistic forward-looking view. I think there are a great many new opportunities for large scale analysis of information that is currently under privacy lock-and-key. Health information, educational performance, internet activity, and many other rich sources of data are readily available for analysis, but are not broadly analyzed out of respect for privacy. Developing tools that permit privacy-preserving data analysis enables the analysis of new and fresh sources of potentially highly sensitive data.

Of course, when analysis of the data is not a foregone conclusion, the privacy guarantees must be strong and unambiguous. We should aim to guard against all possible privacy "attacks", relying on no assumptions about the data or prior knowledge of the adversaries. By so doing, we can avoid many frustrating issues, including surprising attacks resulting from future disclosures, ill-defined and unforeseeable privacy needs, and negative interactions between privacy technologies (eg: both k-anonymization and controlled tabular adjustment may make legitimate claims of privacy, but taken together could expose the entire data set).

One approach that makes no assumptions is "differential privacy", which does not so much prevent disclosures, as prevent disclosures from occurring as a result of a user's data. The definition requires (in a formal sense) that the behavior of the computation be nearly independent of each of its [many] inputs, the users' private data: the presence or absence of any one user's data may not influence the probability of any output by more than a limited multiplicative factor. The distribution over outcomes is effectively identical to the distribution were any user to withdraw his or her data; any disclosure that is unlikely to occur when the user's data is unavailable is nearly as unlikely (increased by at most the same multiplicative factor) from the output of the computation on the full data set.

Although each user can have limited influence on the outcome of the computation, when taken together a large set of users can bias the distribution substantially, each multiplicative factor resulting in concentration that depends exponentially on the number of users. The result, in many domains, are highly accurate answers that suffer only mildly from the imposed haze of privacy (typically much less than would result from sampling error). Most computations do not have differential privacy, naturally, and the research effort here is to extend the set of problems that can be addressed. Presently, many statistical quantities can be efficiently computed, clustering algorithms run (k-means, spectral, association rules), auctions conducted, and numerous other tasks performed. One line of research is to efficiently enlarge the set of privacy-preserving computations we can do, so that non-specialists can easily gain access to the tools, and the exciting sensitive data that lies behind them.

Frank McSherry

Microsoft Research

 

Biographical Data

Frank McSherry is a researcher at Microsoft Research's Silicon Valley Campus, where he studies issues and algorithms related to privacy-preserving data analysis, large-scale graph analysis, and occasional other randomized algorithms. His work in privacy-preserving data analysis has centered around defining, understanding, and advancing "differential privacy", a privacy definition aimed at preventing arbitrary *new* disclosures through randomized computation. Frank's current interests include extending implications of differential privacy to results in adjacent fields (so far: Game Theory and Machine Learning), expanding the scope of algorithms that can be implemented to satisfy differential privacy, and attempting to articulate privacy guarantees intelligibly and meaningfully to non-experts.