Data Confidentiality Workshop
Home Workshop Agenda Participants Travel Information

 

Contact

 


WORKSHOP ON DATA CONFIDENTIALITY

September 6-7, 2007 in Arlington, VA

White Paper & Bio


Over the years, private and governmental organizations have recognized the immense value of collecting and studying large amounts of data. The protection of such information has been complicated due to the fact that such organizations seek to increase the value of their information collections by aggregating resources without violating privacy and confidentiality constraints that are imposed by policy and law. The data privacy arena is a relatively young domain that sits at the confluence of statistics, computer science, and policy. The basic problem setting in which my research is situated can be defined as follows: person-specific data is collected and must be released, such that each person's privacy is protected. I investigate data privacy issues in databases of person-specific records from the perspective of statistical theory and data analysis.

After attending the 2003 Privacy in D.A.T.A. Workshop (organized by the Data Privacy Laboratory and the Aladdin Center in the School of Computer Science of Carnegie Mellon University), I started working towards the design of a formal framework that would allow the formulation and analysis of privacy concerns in large databases of person-specific records, and to measure the value of proposed protection methods in terms of the degree of protection granted to the corresponding individuals. The essential elements of such a framework combine statistical and computational disclosure control concepts based on research conducted by Stephen Fienberg, Latanya Sweeney, and Cynthia Dwork.

At a more recent workshop in Bertinoro, Italy (organized by Carnegie Mellon University, Microsoft, and the National Institute of Statistical Sciences), researchers identified a core set of fundamental problems in privacy and confidentiality for which they agreed, mostly, on a common language to describe such problems, issues, and their nuances. The discussion between statisticians and computer scientists has been engaging and has lead to joint efforts such as the newly established Journal of Privacy and Confidentiality (http://jpc.cylab.cmu.edu/). However, how to measure the goodness of solutions remains an open point of discussion.

From a broader point of view, there are certain general principles that are necessary, but remain undefined, to answer questions that involve both aggregate and individual perspectives, to deal with testability issues and, ultimately, to measure progresses of the field as a whole. As part of an ongoing collaboration with Brad Malin (Vanderbilt University) I have sought to identify the essential elements of an abstract theory of record linkage that provides a tool to tackle the missing principles. Such elements bridge perspectives on privacy that consider either aggregate distributions or microdata. They also provide the technical tools necessary for keeping privacy at the forefront of data analysis, while informing the development of a normative theory of privacy.

Consider, for example, the notion of a "linking distribution" of an attribute. Linking distributions homogenize the treatment of categorical and numerical data types, and their estimation from real data allows for the introduction of the complexity of real world problems in the analysis. Another element of such a theory is an "algorithmic approach" for the computation of theoretical bounds on privacy protection of any pair (i.e. <database, protection method>), thus leading to predictions about performance guarantees in real world situations. Given realistic scenarios encoded by our model and the metric for privacy protection, the algorithmic approach is key for introducing testability and measuring progress clearly in the privacy arena.

In particular, we propose an algorithmic solution to the problem of measuring the degree of privacy protection granted to individual records by an arbitrary method. In order to do so, we start from a description of the world in terms of concepts relevant to privacy that the community at large agrees upon, we adopt the notion of re-identifiability, and we add a greedy linkage algorithm and an original description of the re-identification process in terms of few parameters that can be reliably estimated from data. With all the pieces in place, we are ready to show how the analysis of real data may be used to shed light on the theoretical re-identifiability of real world problems, how bounds can be computed for a certain database, and how this work can be extended, and progress measured, in the future.

Edo Airoldi

Princeton

http://www.genomics.princeton.edu/~eairoldi/

Biographical Data

Edo Airoldi received a PhD in Computer Science from Carnegie Mellon University. He is currently a postdoctoral fellow at Princeton University, affiliated with the Department of Computer Science, and the Lewis-Sigler Institute for Integrative Genomics. His research interests include statistical theory and methodology, Bayesian approaches to data analysis, and random graph theory, with application to problems in the social and biological sciences.

As part of an ongoing collaboration with Brad Malin (Vanderbilt University), he investigates data linkage and associated privacy issues that arise in various distributed and biomedical environments. Publications that have derived from this research include:

"Confidentiality preserving audits of electronic medical record access," in Proceedings of the 12th World Congress on Health (Medical) Informatics - Medinfo 2007. Brisbane, Australia. 2007: Forthcoming.

"The effects of location access behavior on re-identification risk in a distributed environment," in Lecture Notes in Computer Science: Proceedings of the 6th Privacy Enhancing Technologies. Technologies Conference (PET), Cambridge, England, Revised Selected Papers. 2006; Vol. 4258: 413-429.

"Configurable security protocols for multi-party data analysis with malicious participants," in Proceedings of the 21st IEEE International Conference on Data Engineering. Tokyo, Japan. 2005: 533-544 (with S. Edoho-Eket and Y. Li).