Data Confidentiality Workshop
Home Workshop Agenda Participants Travel Information

 

Contact

 


WORKSHOP ON DATA CONFIDENTIALITY

September 6-7, 2007 in Arlington, VA

White Paper & Bio


Microdata X should be of high quality (Winkler 2004 Info. Sys) if they are used in modeling and analyses. Due to confidentiality concerns, public-use microdata X1 needs to minimize the chance of re-identification while still yielding approximately one or two sets of analyses (models) that are allowed by the original, confidential microdata X. Some authors (Palley & Simonoff 1987 TDBS; Lambert 1993 JOS; Fienberg 1997 CNSTAT) have demonstrated that some re-identification can occur based only on analytic properties (even with synthetic data generated from accurate models M on original microdata X). Other authors (Mera 1998; Moore & Lee 1998 JAIR; DuMouchel et al. 1999 KDD) have demonstrated (sometimes approximately) that if there are sufficient analytic restraints on microdata X1, then the microdata X1 must be nearly identical to original microdata X.
If one assures that the released microdata X1 has one or two valid analytic properties, then one can attempt re-identification using a variety of analytic and record linkage techniques (Yancey, Winkler, & Creecy 2002; Evfimievski 2004).
The first issue is: How does one create a modeling framework and software that can be used on a variety of microdata X to assure that certain analytic properties are satisfied and can be use to verify the analytic validity of masked, public-use microdata X1? For discrete data, Winkler (2007a) has created an edit/imputation/modeling framework that allows altering models/data that approximately preserve the models while satisfying additional constraints. The new methods pull together and enhance (Winkler 1990 Ann Prob 1993, 1997, 2003, 2006; Meng & Rubin Biometrika 1993; Little & Rubin 2002; D’Orazio, DiZio & Scanu 2006 JOS).
The second issue is: For analytically valid public-use microdata X1, how does one alter the microdata X1 to produce microdata X2 where X2 has significantly reduced risk of re-identification and allows nearly the same modeling (analytic properties) as X1 (or X)? For discrete data, Winkler (2007b) shows how to create models M2 (and generate synthetic microdata from X2) that approximate models from original microdata X while reducing the risk of re-identification.

William E. Winkler

U.S Census Bureau

 

 

Biographical Data

B.S. Mathematics, Phi Beta Kappa
Ph.D. Probability Theory
Fellow, American Statistical Association
Principal Researcher, U.S. Census Bureau

Expertise: Record Linkage, Edit/Imputation, Multi-way and Multi-purpose Sampling, and Microdata Confidentiality

Author or co-author of 130+ papers and Data Quality and Record Linkage Techniques (2007 – with T. Herzog & F. Scheuren)

Author or co-author of more than 12 generalized computer systems, some of which are used for production in the largest survey situations