|
White Paper & Bio
Data privacy has become a fundamental problem of the modern information
infrastructure. Increasing volumes of information are collected and
archived by health networks, financial organizations, search engines,
intrusion detection systems, social networking systems, retailers
and other enterprizes. The potential social benefits of analyzing
these databases are enormous. An important problem is to learn the
properties of the databases as a whole while protecting the privacy
of individual contributors. There is a vast body of work on the database
privacy problem, both in statistics and computer science. Until recently,
there were two nearly disjoint fields studying the database privacy
problem: ``statistical disclosure limitation'', initiated by the statistics
community in 1960s, and what is now called ``privacy-preserving data
mining'' which was active in the database community during the 1980's
and was rekindled at the turn of the 21st century by researchers in
data mining. In 2003, the database privacy problem caught attention
of cryptographers. The resulting line of work acquired the name ``private
data analysis''. Private data analysis advocates formal privacy guarantees.
Each proposed scheme for releasing information should come with a
clear statement of what it achieves not only in terms of utility,
but also in terms of privacy. Most currently utilized schemes either
have no formal privacy guarantees or ensure security only against
a specific suite of attacks. This leaves them potentially vulnerable
to unforeseen attacks, and makes it difficult to compare different
schemes because each of them is solving a different problem. For example,
one commonly overlooked issue is that people reading the released
statistics might have access to side information, that is, information
that comes from other sources. In the light of the ubiquity of cheap
collections of marketing data and other personal information, ignoring
this issue is dangerous. Another problem is that privacy guarantees
are most commonly stated as conditions on the output ( i.e., at least
3 individuals in the table should have the same attributes), and do
not take the procedures that were used to obtain the output into account.
For example, the fact the output is this particular L-diverse table,
and not another, might leak information. Private data analysis focuses
on pinning down precise mathematical definitions of database privacy,
that provide meaningful privacy in realistic settings. The notion
of differential privacy that emerged from this line of work makes
assumptions neither about what kind of attack might be perpetrated
against the released statistics nor about what side information the
attacker might possess. It limits the incremental information the
attacker might learn in addition to what he knew before seeing the
released statistics. Several works proposed frameworks for computing
and releasing information about a database while provably satisfying
differential privacy. One of the major challenges is to design automatable
protocols that release a variety of statics about a database while
provably preserving privacy. Such protocols would relieve the Census
Bureau from having to manually do a disclosure review on each product,
and enable organizations that collect enormous amounts of data, such
as search engines, to release global information about their data
without violating confidentiality. The first step towards that goal
is to give algorithmic formulations to as many statistical tasks as
possible. This workshop is a wonderful opportunity for researchers
from different communities to exchange ideas and work towards a common
understanding of the problem. Two challenges that I would like to
see addressed are meaningful definitions of privacy and automatable
protocols that satisfy them.
|
|
Sofya Raskhodnikova
Penn State
University
|
|
|
Biographical Data
Sofya Raskhodnikova
is an assistant professor of Computer Science and Engineering at
Penn State University. She received Ph.D. in Computer Science and
Electrical Engineering from MIT in 2003. Her main area of research
is randomized and approximation algorithms and, in particular, algorithms
for massive data sets. Together with Kobbi Nissim and Adam Smith,
she is a co-author of Smooth
Sensitivity and Sampling in Private Data Analysis, Proceedings
of the 39th ACM STOC, 75--84, 2007.
|
|