Consider the problem of concealing the identity of a patient who
has AIDS. That person's medical records are to be combined with other
records for use by researchers and insurance companies. An attacker,
who wishes to determine the identity of the individual, gains access
to the sanitized records.
For the moment, we assume that the sanitization is effective. The
attacker cannot determine the identity from the information given.
So the attacker waits, and examines subsequent records handled in
a similar manner. She obtains access to additional datasets. Individually,
the datasets reveal nothing. But over a period of time, combining
information from the datasets enables the attacker to identify the
individual with AIDS.
The trivial answer to this problem is to determine in advance what
information each dataset will contain, and ensure the aggregation
of that information will not enable an attacker to determine any individual
identity. This solution is unsatisfying for at least three reasons,
one organizational and two analytical.
First, the above solution assumes that the contents of the datasets
will be known in advance. However, if the datasets consist of information
gathered over time, the generators of the datasets will not know what
the data in the dataset will be. Further, if the datasets come from
many organizations, co-ordinating the analysis of the datasets poses
management problems. If two datasets taken together enable the identification
of an individual, but neither one alone does, which one is to be released?
Which one is to be withheld?
Second, there is an implicit assumption that the attacker only has
the information in the datasets available. But external knowledge
may enable the attacker to determine the individual. For example,
if the records identify the pharmacy at which the patients purchased
their medication, the attacker can correlate dates of visits to the
pharmacy (external knowledge) with dates prescriptions were filled
(information taken from the records). Unless the organizations know
what the attacker knows, it seems unlikely the above trivial approach
would protect against this attack.
Third, the attacker's goal may not require identification of an individual.
Identifying a small set of people to whom the individual belongs (k-anonymity)
may be enough. As an example, if an insurance company identifies that
one of 3 people has AIDS and therefore will require expensive treatment,
it may refuse to cover all three.
A solution to the above problem requires that a threat model be articulated:
what does the attacker know? It also requires a precise delineation
of what is considered a valid solution: must the attacker identify
a single individual, or is t enough to identify the individual as
one of a set of possible people? From this, the role of the environment
in which both the problem is posed and the solution determined becomes
clear, and its influence critical. Extending the above problem a bit,
as environments change, so will the solutions and problems; and sanitization
requires protecting data in multiple environments, including those
that the sanitizers may not foresee.