|
White Paper & Bio
[These are my opinions (and my mistakes)]
The general principles of data confidentiality are determined by law,
regulation, and custom. It is hard to enunciate general principles
that are not hedged by exceptions and qualifications (as in 'to the
extent possible'). There is an uncertain and changing balance between
perceived public good and perceived rights to privacy. An alternative
aspect to rights to privacy is to consider the losses that can result
from revealing personally identifiable data. Unfortunately there are
many shades of gray. Complexities include correlations, erroneous
inference, the trap of small numbers, ancillary information, and regulations
discordant with reality. A few examples follow. If person A's whole
genome is public, then a lot is known (but only probabilistically)
about the genomes of A's close relatives. The current impact is likely
to be less than the future impact, as scientific understanding of
genomes will increase. This is somewhat different from making public
the fact that A has a disease caused by a dominant allele, in that
all the information is released at once. (One of A's parents, we don't
know which, has the disease, and each sibling and child has a 50%
chance of having it.) Is it possible (or useful) to build this distinction
between immediate and future loss into a quantitative theory? Imagine
a database that contains family incomes for four sorts of families
(A, B, c, D) by district, and that in one district all but one family
is of type A. Then if there is any external information to pick out
the missing family, their family income has become public. The database
owners could leave out the missing family's category (presumably so
it isn't identified as to category), but even if there isn't some
simple way of telling which is the correct category, someone could
just announce that the family is category D. Does it matter if that's
right or not? Suppose 3 different people announce B, C, and D. One
of them is right. Is that a problem? Suppose instead there are 4 families
in category D, and they can tell who they are (or 3 of them can).
Then 3 of them can deduce the income of the fourth, without revealing
their own incomes. If the owners of the database have fuzzed the data,
then it's only an estimate. Is there any way this can be avoided?
What do people expect, or want? How wide is the range of expectations?
Is there a generational change? These questions could be approached
by research. The ideal research program would be longitudinal, persisting
over many years, and international. Can one identify sequences of
queries against databases that result in extracting confidential information,
or are intended to do so? How about including the possibility of having
ancillary information? And for you Web site owners, the EU believes
that IP addresses are personally identifiable data, but the US does
not. This is amusing, as it means that ISPs (the folks who provide
your internet service) are sitting on piles of sensitive data. If
you run a web site, you are probably logging what appear to be your
user's IP addresses. To what extent it the EU's position sensible,
or is it a misreading of current technology? (This question ignores
some serious tension between the EU's privacy folks and their security
folks, who would rather have lots of data available, for its use in
investigations.) An IP address at best identifies a computer. If it
is a shared computer, then the confidential information is not that
of an individual, but belongs to a group (usually a family, or roommates).
By the time your web site sees the IP address there is a good chance
(but not near 100 per cent) that the machine's original IP address
has been rewritten once or twice (e.g., by the NAT in your local router,
and by your ISP). Based on this, here is a thought experiment: Is
there any summary of your server's log that you could publish without
revealing EU confidential data? How does it depend on the exact contents
of the data?
|
|
|
|
Biographical Data
Peter Weinberger is presently a software engineer at Google in
New York, where he has been for about 4 years. His professional
life began with a PhD in mathematics (number theory) from Berkely,
after which he taught and did reasearch at the University of Michigan
in Ann Arbor. Then he moved to Bell Labs for the adolescence of
Unix, doing research on various systems topics, and co-authoring
The Awk Book (with Aho and Kernighan). Moving into management, he
ended up as Information Systems Research Vice President. But then
AT&T and Lucent split up, and he went to Renaissance Technologies,
a successful hedge fund, as Head of Technology. And then to Google.
|
|