|
White Paper & Bio
[NB: The use of "privacy" in what follows may more accurately be
read as "confidentiality". The CS crowd often uses the former to reference
the issue that the Stats crowd references by the latter.]
Rather than attempt to terrify and alarm the reader by describing
scenes of impending privacy catastrophes, real or imagined, I try
to take a more optimistic forward-looking view. I think there are
a great many new opportunities for large scale analysis of information
that is currently under privacy lock-and-key. Health information,
educational performance, internet activity, and many other rich sources
of data are readily available for analysis, but are not broadly analyzed
out of respect for privacy. Developing tools that permit privacy-preserving
data analysis enables the analysis of new and fresh sources of potentially
highly sensitive data.
Of course, when analysis of the data is not a foregone conclusion,
the privacy guarantees must be strong and unambiguous. We should aim
to guard against all possible privacy "attacks", relying on no assumptions
about the data or prior knowledge of the adversaries. By so doing,
we can avoid many frustrating issues, including surprising attacks
resulting from future disclosures, ill-defined and unforeseeable privacy
needs, and negative interactions between privacy technologies (eg:
both k-anonymization and controlled tabular adjustment may make legitimate
claims of privacy, but taken together could expose the entire data
set).
One approach that makes no assumptions is "differential privacy",
which does not so much prevent disclosures, as prevent disclosures
from occurring as a result of a user's data. The definition requires
(in a formal sense) that the behavior of the computation be nearly
independent of each of its [many] inputs, the users' private data:
the presence or absence of any one user's data may not influence the
probability of any output by more than a limited multiplicative factor.
The distribution over outcomes is effectively identical to the distribution
were any user to withdraw his or her data; any disclosure that is
unlikely to occur when the user's data is unavailable is nearly as
unlikely (increased by at most the same multiplicative factor) from
the output of the computation on the full data set.
Although each user can have limited influence on the outcome of the
computation, when taken together a large set of users can bias the
distribution substantially, each multiplicative factor resulting in
concentration that depends exponentially on the number of users. The
result, in many domains, are highly accurate answers that suffer only
mildly from the imposed haze of privacy (typically much less than
would result from sampling error). Most computations do not have differential
privacy, naturally, and the research effort here is to extend the
set of problems that can be addressed. Presently, many statistical
quantities can be efficiently computed, clustering algorithms run
(k-means, spectral, association rules), auctions conducted, and numerous
other tasks performed. One line of research is to efficiently enlarge
the set of privacy-preserving computations we can do, so that non-specialists
can easily gain access to the tools, and the exciting sensitive data
that lies behind them.
|
|
Frank McSherry
Microsoft
Research
|
|
|
Biographical Data
Frank McSherry is a researcher at Microsoft Research's Silicon
Valley Campus, where he studies issues and algorithms related to
privacy-preserving data analysis, large-scale graph analysis, and
occasional other randomized algorithms. His work in privacy-preserving
data analysis has centered around defining, understanding, and advancing
"differential privacy", a privacy definition aimed at preventing
arbitrary *new* disclosures through randomized computation. Frank's
current interests include extending implications of differential
privacy to results in adjacent fields (so far: Game Theory and Machine
Learning), expanding the scope of algorithms that can be implemented
to satisfy differential privacy, and attempting to articulate privacy
guarantees intelligibly and meaningfully to non-experts.
|
|