Over the years, private and governmental organizations have recognized
the immense value of collecting and studying large amounts of data.
The protection of such information has been complicated due to the
fact that such organizations seek to increase the value of their information
collections by aggregating resources without violating privacy and
confidentiality constraints that are imposed by policy and law. The
data privacy arena is a relatively young domain that sits at the confluence
of statistics, computer science, and policy. The basic problem setting
in which my research is situated can be defined as follows: person-specific
data is collected and must be released, such that each person's privacy
is protected. I investigate data privacy issues in databases of person-specific
records from the perspective of statistical theory and data analysis.
After attending the 2003 Privacy in D.A.T.A. Workshop (organized
by the Data Privacy Laboratory and the Aladdin Center in the School
of Computer Science of Carnegie Mellon University), I started working
towards the design of a formal framework that would allow the formulation
and analysis of privacy concerns in large databases of person-specific
records, and to measure the value of proposed protection methods in
terms of the degree of protection granted to the corresponding individuals.
The essential elements of such a framework combine statistical and
computational disclosure control concepts based on research conducted
by Stephen Fienberg, Latanya Sweeney, and Cynthia Dwork.
At a more recent workshop in Bertinoro, Italy (organized by Carnegie
Mellon University, Microsoft, and the National Institute of Statistical
Sciences), researchers identified a core set of fundamental problems
in privacy and confidentiality for which they agreed, mostly, on a
common language to describe such problems, issues, and their nuances.
The discussion between statisticians and computer scientists has been
engaging and has lead to joint efforts such as the newly established
Journal of Privacy and Confidentiality (http://jpc.cylab.cmu.edu/).
However, how to measure the goodness of solutions remains an open
point of discussion.
From a broader point of view, there are certain general principles
that are necessary, but remain undefined, to answer questions that
involve both aggregate and individual perspectives, to deal with testability
issues and, ultimately, to measure progresses of the field as a whole.
As part of an ongoing collaboration with Brad Malin (Vanderbilt University)
I have sought to identify the essential elements of an abstract theory
of record linkage that provides a tool to tackle the missing principles.
Such elements bridge perspectives on privacy that consider either
aggregate distributions or microdata. They also provide the technical
tools necessary for keeping privacy at the forefront of data analysis,
while informing the development of a normative theory of privacy.
Consider, for example, the notion of a "linking distribution"
of an attribute. Linking distributions homogenize the treatment of
categorical and numerical data types, and their estimation from real
data allows for the introduction of the complexity of real world problems
in the analysis. Another element of such a theory is an "algorithmic
approach" for the computation of theoretical bounds on privacy
protection of any pair (i.e. <database, protection method>),
thus leading to predictions about performance guarantees in real world
situations. Given realistic scenarios encoded by our model and the
metric for privacy protection, the algorithmic approach is key for
introducing testability and measuring progress clearly in the privacy
arena.
In particular, we propose an algorithmic solution to the problem
of measuring the degree of privacy protection granted to individual
records by an arbitrary method. In order to do so, we start from a
description of the world in terms of concepts relevant to privacy
that the community at large agrees upon, we adopt the notion of re-identifiability,
and we add a greedy linkage algorithm and an original description
of the re-identification process in terms of few parameters that can
be reliably estimated from data. With all the pieces in place, we
are ready to show how the analysis of real data may be used to shed
light on the theoretical re-identifiability of real world problems,
how bounds can be computed for a certain database, and how this work
can be extended, and progress measured, in the future.