|
White Paper & Bio
I will confine my remarks to confidentiality issues affecting researchers
doing approved research with government data. Although I'm not an
expert in the area of confidentiality issues, in my experience, it
would be welcome if more research could be done in three areas: devising
a thorough and realistic assessment of disclosure risks posed by researchers,
identifying microdata that has potential for re-identification of
personal identifiers, and statistical analyses that account for imperfections
in record linkages. A realistic assessment of disclosure risks and
their consequences posed by researchers would dictate the appropriate
measures to account for confidentiality concerns. Otherwise, unreasonable
fears of the worst case scenario could seriously hamper the legitimacy
of research, in particular, the tenet that scientific research must
be reproducible. In one example I have heard of, government researchers
linked two datasets across agencies. In the interest of reproducibility,
the researchers wanted the data to be publicly-available, but one
agency insisted on coarsening a variable. The analysis done on the
original data had p<0.05, but on the coarsened data had p>0.05. Thus
non-governmental researchers, who can only access the coarsened data,
could have grounds to dispute the government's findings. I am unaware
of research that has been done to comprehensively document the types,
frequency, and consequences of disclosures that researchers have committed.
I am not speaking here of sensational breeches, such as lost laptops
with personal identifiers or insecure computer networks; clearly,
if these risks are not minimized, little we say at this workshop will
matter much. Instead, I suspect that the major disclosure risk from
scientific research is the accidental re-identification caused by
the need to make scientific data as publicly-available as possible
to ensure reproducibility of the research. This thorough assessment
should show how serious the disclosure risks posed by researchers
are as compared to other data-users, as researchers are bound by legal
and ethical obligations, should not require personal identifiers in
the final analysis dataset, and most importantly in my opinion, should
have little motive, financial or intellectual, to re-identify individuals
and then cause them harm. Also, I would encourage more work to understand
the public's concerns about confidentiality to insure that researchers
take the appropriate degree of concern into account. Such work would
help to plan the appropriate measures needed to reassure the public
against any unwarranted fears. That being stated, accidental re-identification
of individuals can occur. When I worked at the Bureau of Labor Statistics,
in a price survey for an industry, one company suddenly jacked their
prices up far above that of their competitors, enough to jump the
price index and alarm the industry. Their competitors called one another
and managed to re-identify that company. The Commissioner convened
a study group, which I was on, to study ways of identifying such outliers
before publication as re-identification risks. Our group considered
many ways of identifying outliers, and also of down-weighting or coarsening
outliers in the estimation, but did not believe that any one method
was ready for production-use. I believe that researchers, as opposed
to economic competitors, would have less incentive to try to re-identify
the company; nevertheless, this data is available to more than just
researchers, and so more work is needed to find the best options for
alerting agencies to re-identification risks. Finally, I think more
research on methods for analyzing record-linked data should be called
for. At a recent Federal Committee on Statistical Methodology, we
bemoaned mounting difficulties in linking government databases. In
particular, more survey respondents than ever before are refusing
to release their SSN (out of confidentiality concerns), making record
linkage much more challenging. Thus, an important area of research
would be to extend current methods of statistical analysis to account
for the mis/missed-linkages that are likely to become an ever more
prominent part of record-linkage studies. These opinions are the author's
own and do not necessarily reflect those of the National Cancer Institute.
|
|
|
|
BIO
Hormuzd Katki has spent over 10 years
as a mathematical statistician in the federal government. When at
the Bureau of Labor Statistics, he did research on improving time-series
models of state unemployment rates and was also involved in a Bureau-wide
effort to reduce the risk of re-identification of confidential data
by improving procedures for outlier detection and handling. He is
currently a staff scientist in the Division of Cancer Epidemiology
and Genetics at the National Cancer Institute (NCI). He works on
research involving record-linkage of epidemiologic studies and disease
registries to various administrative data, methodologic issues in
the analysis of record-linkage studies (especially in accounting
for imperfect linkages), and confidentiality issues posed by the
acquisition of genetic data by epidemiologic studies and national
health surveys. He is a member of the Federal Committee on Statistical
Methodology's group on the Statistical Uses of Administrative Records.
|
|