Excerpt from hunch.net posting by John Langford
The Privacy Problem
Filed under: Research, Machine Learning, Privacy — jl @ 5:26
pm
Machine Learning is rising in importance because data is being collected
for all sorts of tasks where it either wasn’t previously collected,
or for tasks that did not previously exist. While this is great for
Machine Learning, it has a downside—the massive data collection
which is so useful can also lead to substantial privacy problems.
It’s important to understand that this is a much harder problem
than many people appreciate. The AOL data release is a good example.
To those doing machine learning, the following strategies might be
obvious:
1. Just delete any names or other obviously personally identifiable
information. The logic here seems to be “if I can’t easily
find the person then no one can”. That doesn’t work as
demonstrated by the people who were found circumstantially from the
AOL data.
2. … then just hash all the search terms! The logic here is
“if I can’t read it, then no one can”. It’s
also trivially broken by a dictionary attack—just hash all the
strings that might be in the data and check to see if they are in
the data.
3. … then encrypt all the search terms and throw away the key!
This prevents a dictionary analysis, but it is still entirely possible
to do a frequency analysis. If 10 terms appear with known relative
frequencies in public data, then finding 10 terms encrypted terms
with the same relative frequencies might give you very good evidence
for what these terms are.
All of these strategies turn out to be broken. For those not familiar
with Machine Learning, other obvious strategies turn out to not work
that well.
1. Just don’t collect the data. We are not too far off from
a world where setting the “please don’t collect my information”
flag in your browser implies that you volunteer to have your searches
return less relevant results, to not find friends, to not filter spam,
etc… If everyone effectively has that flag set by legislation
the effect would be very substantial. Many internet companies run
off of advertising so eliminating the ability to do targeted advertising
will eliminate the ability of these companies to exist.
2. …Then just keep aggregations of the data! Aggregating data
is very bad for machine learning in general. When we are figuring
out how to do machine learning it’s even worse because we don’t
know in advance which aggregations would be most useful.
3. …Then keep just enough data around and throw out everything
else! Unfortunately, there is no such thing as “enough data”.
More data is always better.
This is a particularly relevant topic right now, because it’s
news and because CMU and NSF are organizing a workshop on the topic
next month, which I’m planning to attend. However, this is not
simply an interest burst—the long term trend of increasing data
collection implies this problem will repeatedly come up over the indefinite
future.
The privacy problem breaks into at least two parts.
1. Cultural Norms. Historically, almost no monetary transactions
were recorded and there was a reasonable expectation that people would
forget a visitor. This is rapidly changing with the rise of credit
cards and cameras. This change in what can be expected is profoundly
uncomfortable.
2. Power Balance. Data is power. The ability to collect and analyze
large quantities of data which many large organizations now have or
are constructing increases their power relative to ordinary people.
This power can be used for good (to improve services) or for bad (to
maximize monopoly status or for spying).
The cultural norm privacy problem is sometimes solvable by creating
an opt-in or opt-out protocol. This is particularly helpful on the
internet because a user could simply request “please don’t
record my search” or “please don’t record which
news articles I read”. Needing to do this for every search or
every news article would be annoying. However, this is easily fixed
by having a system wide setting—perhaps a special browser cookie
which says “please don’t record me” that any site
could check. None of this is helpful for cameras (where no interface
exists) or monetary transactions (where the transaction itself determines
whether or not some item is shipped).
The power balance privacy problem is much more difficult. Some solutions
that people attempt are:
1. Accept the change in power balance. This is the default action.
There are plenty of historical examples where large organizations
have abused their power, so providing them more power to abuse may
be unwise.
2. Legislate a halt. Forbid cameras in public places. Forbid the collection
or retention of data by any organization. The problem with this method
is that technology simply isn’t moving in this direction. At
some point, we may end up with cameras and storage devices so small,
cheap, and portable that forbidding their use is essentially absurd.
The other difficulty with this solution is that it keeps good things
from happening. For example, a reasonable argument can be made that
the British were effective at tracking bomb planters because the cameras
of London helped them source attacks.
3. Legislate an acceleration. Instead of halting the collection of
data, open it up to more general use. One example of this is cameras
in police cars in the US. Recordings from these cameras can often
settle disputes very definitively. As technology improves, it’s
reasonable to expect cameras just about anywhere people are in public.
Some legislation and good engineering could make these cameras available
to anyone. This would involve a substantial shift in cultural norms—essentially
people would always be in potential public view when not at home.
This directly collides with the “privacy as a cultural norm”
privacy problem.
The hardness of the privacy problem mentioned at the post beginning
implies difficult tradeoffs.
1. If you have cultural norm privacy concerns, then you really don’t
appreciate method (3) for power balance privacy concerns.
2. If you value privacy greatly and the default action is taken, then
you prefer monopolistic marketplaces. The advantages of a large amount
of private data are prohibitive to new market entrance.
3. If you want the internet to work better, then there are limits
on how little data can be collected.
All of the above is even murkier because what can be done with data
is not fully known, nor is what can be done in a privacy sensitive
way.