In early July, I purchased an IPhone and went online to activate
my AT&T wireless account. I had to provide my Social Security
number in order to enter the information—there was no alternative—and
after a short delay my account was approved and my IPhone activated.
I presumed my Social Security number was used as part of a credit
check via a major data warehouse. A few days later I tried to activate
international roaming for a trip to the UKvia the telephone and through
a series of questions and answers it became apparent that the credit
report was filled with information from someone else—since the
questions I was improperly answering related to former residences
in Ohio. Fixing these errors appears to be impossible. Others can
tell similar stories with far more disastrous consequences.
In the UK I heard a lecture by Hans Rosling, of the Karolinska Institute,
on how we are moving from closely-held microdata to web-accessible
data through such tools as Swivel, Mapping Worlds, Many Eyes (IBM),
and Trendanalyzer (Google)—featuring several of the co-sponsors
of this workshop. Rosling envisions, as do many companies, a tremendous
new information frontier driven by large integrated shared databases.
These two stories lead me to the challenges confronting our research
communities, especially those assembled by the private sector, often
including government collected data, often under pledges of confidentiality.
There are at least three interlocking components:
• As individual data are selected for release, as in my credit
report, and extracts shared across companies and with the Department
of Homeland Security, what guarantees do we have on the preservation
of privacy and confidentiality, if any? What is the technical basis
of any such guarantees? [Remember that I was in effect forced to surrender
my Social Security number to activate the AT&T account, even though
the law nominal precludes such use.]
• In order to achieve the goal of a large integrated data bases
we need to merge data from disparate sources, using record linkage
methods. Much rests on the accuracy of the individual data components,
on the quality of the matching and the “resolution of discrepancies
due to measurement and other forms of error. What methods are used
for such record linkage? What are there formal properties? And what
are the implications for other peoples use of the data?
• How useful are the merged integrated databases? Do we have
correct methods for their analyses, especially in light of the measurement
error and confidentiality protection techniques (e.g., addition of
noise or other forms of perturbation) that may have been applied?
If we can address some of these challenges with new and creative research,
scaling up to the giant databases that now exist and are envisioned
for our future, we are still left with the questions of how to communicate
with users and the public what we have done and how their private
lives are protected and enhanced.