Data Confidentiality Workshop
Home Workshop Agenda Participants Travel Information

 

Contact

 


WORKSHOP ON DATA CONFIDENTIALITY

September 6-7, 2007 in Arlington, VA

White Paper & Bio


I will confine my remarks to confidentiality issues affecting researchers doing approved research with government data. Although I'm not an expert in the area of confidentiality issues, in my experience, it would be welcome if more research could be done in three areas: devising a thorough and realistic assessment of disclosure risks posed by researchers, identifying microdata that has potential for re-identification of personal identifiers, and statistical analyses that account for imperfections in record linkages. A realistic assessment of disclosure risks and their consequences posed by researchers would dictate the appropriate measures to account for confidentiality concerns. Otherwise, unreasonable fears of the worst case scenario could seriously hamper the legitimacy of research, in particular, the tenet that scientific research must be reproducible. In one example I have heard of, government researchers linked two datasets across agencies. In the interest of reproducibility, the researchers wanted the data to be publicly-available, but one agency insisted on coarsening a variable. The analysis done on the original data had p<0.05, but on the coarsened data had p>0.05. Thus non-governmental researchers, who can only access the coarsened data, could have grounds to dispute the government's findings. I am unaware of research that has been done to comprehensively document the types, frequency, and consequences of disclosures that researchers have committed. I am not speaking here of sensational breeches, such as lost laptops with personal identifiers or insecure computer networks; clearly, if these risks are not minimized, little we say at this workshop will matter much. Instead, I suspect that the major disclosure risk from scientific research is the accidental re-identification caused by the need to make scientific data as publicly-available as possible to ensure reproducibility of the research. This thorough assessment should show how serious the disclosure risks posed by researchers are as compared to other data-users, as researchers are bound by legal and ethical obligations, should not require personal identifiers in the final analysis dataset, and most importantly in my opinion, should have little motive, financial or intellectual, to re-identify individuals and then cause them harm. Also, I would encourage more work to understand the public's concerns about confidentiality to insure that researchers take the appropriate degree of concern into account. Such work would help to plan the appropriate measures needed to reassure the public against any unwarranted fears. That being stated, accidental re-identification of individuals can occur. When I worked at the Bureau of Labor Statistics, in a price survey for an industry, one company suddenly jacked their prices up far above that of their competitors, enough to jump the price index and alarm the industry. Their competitors called one another and managed to re-identify that company. The Commissioner convened a study group, which I was on, to study ways of identifying such outliers before publication as re-identification risks. Our group considered many ways of identifying outliers, and also of down-weighting or coarsening outliers in the estimation, but did not believe that any one method was ready for production-use. I believe that researchers, as opposed to economic competitors, would have less incentive to try to re-identify the company; nevertheless, this data is available to more than just researchers, and so more work is needed to find the best options for alerting agencies to re-identification risks. Finally, I think more research on methods for analyzing record-linked data should be called for. At a recent Federal Committee on Statistical Methodology, we bemoaned mounting difficulties in linking government databases. In particular, more survey respondents than ever before are refusing to release their SSN (out of confidentiality concerns), making record linkage much more challenging. Thus, an important area of research would be to extend current methods of statistical analysis to account for the mis/missed-linkages that are likely to become an ever more prominent part of record-linkage studies. These opinions are the author's own and do not necessarily reflect those of the National Cancer Institute.

Hormuzd Katki

NCI

 

BIO

Hormuzd Katki has spent over 10 years as a mathematical statistician in the federal government. When at the Bureau of Labor Statistics, he did research on improving time-series models of state unemployment rates and was also involved in a Bureau-wide effort to reduce the risk of re-identification of confidential data by improving procedures for outlier detection and handling. He is currently a staff scientist in the Division of Cancer Epidemiology and Genetics at the National Cancer Institute (NCI). He works on research involving record-linkage of epidemiologic studies and disease registries to various administrative data, methodologic issues in the analysis of record-linkage studies (especially in accounting for imperfect linkages), and confidentiality issues posed by the acquisition of genetic data by epidemiologic studies and national health surveys. He is a member of the Federal Committee on Statistical Methodology's group on the Statistical Uses of Administrative Records.