Data Confidentiality Workshop
Home Workshop Agenda Participants Travel Information

 

Contact

 


WORKSHOP ON DATA CONFIDENTIALITY

September 6-7, 2007 in Arlington, VA

White Paper & Bio


The Census Bureau collects data from people and establishments under Title 13 of the US code which promises all of our respondents that we will keep their information confidential. At the same time, we are required to publish as much high quality information as possible without violating our confidentiality promise.

Our research staff develops disclosure avoidance techniques that can be used on many types of data, the majority of which are microdata and tables of demographic data and tables of establishment data. Techniques include geographic thresholds, rounding, noise addition, categorical thresholds, topcoding, data swapping, suppression, and synthetic data. Our current major research projects are development of synthetic data, development of a Microdata Analysis System (users can query the system without actually seeing the microdata), and development of a noise technique to protect tables of establishment data.

The Census Bureau’s Disclosure Review Board (DRB) is a group of people that represent various areas of the organization (demographic, decennial, economic, research, and policy). This group reviews all data products before they are publicly released to ensure there are no disclosure problems.

The research staff and the DRB face the same challenge in that we are swamped with work.

I have had an open position on the research staff for quite some time, but young people coming out of college have never heard of disclosure avoidance. The staff has enough money, data, hardware, and software, but I need people to work in this area. How do we encourage universities to have classes on protecting confidentiality? How do we get more people interested in working on this?

The DRB is extremely busy. We meet every Monday, and our recent agendas have had over 10 data requests every week (double of what we were seeing a year or 2 ago). The Census Bureau is placing a great emphasis on confidentiality (which is good), but everyone is now afraid to release any number without DRB approval. Some of the requests clearly have no disclosure problems, and they can waste DRB time and paperwork. How can we train people (not necessarily just at the Census Bureau) to do a disclosure review of a data product?

Working at the Census Bureau brings opportunities as well as challenges for the same reason: we have an enormous amount of data. We must do our best to release as much high quality data as possible without violating confidentiality. Because of the large amount of data and the forms in which they are released, disclosure avoidance procedures that work for many other agencies and organizations will not work for us. We need more and better techniques that can be used for extremely large, interrelated products.

The most obvious example is the decennial census, and it will carry over to the American Community Survey (ACS). From Census 2000, we had data (6 variables) on more than 281 million people, and even more data (61 additional variables) on one sixth of them. From this data, we published standard Summery Files 1 through 4 (over 4 billion tables) as well as special tabulations.

The tables are additive, and they are interrelated in that the same cell can appear in many tables. For a given cell that does appear in many tables, the Census Bureau requires that the same value appears in that cell. For these reasons, we cannot use a table-based disclosure avoidance technique. There are too many tables, and we could never coordinate a suppression or table-based noise technique. Also, the tables are published for at least 29 different types of geographic areas (such as blocks, tracts, zip codes, school districts, congressional districts, consolidated cities, etc.) and their boundaries overlap. This can create what we call “geographic slivers” when overlapping boundaries could be used to compare data from different tables and obtain data from very small geographic areas. Thus we must use a technique that works on the underlying microdata, such as data swapping or generation of synthetic data, and that targets records with disclosure risk for any potential sliver of geography. Then we can create our tables from the perturbed microdata. For Census 2000 and for past ACS, we used data swapping. For future ACS, we are currently doing research on synthetic data but certainly could use help from other researchers.

We are, of course, also concerned about our microdata files from ACS and other surveys. They are much larger and more detailed than those released from other agencies and organizations. My bio page lists examples of disclosure avoidance procedures we use to protect these data. We also perform reidentification studies in which we attempt to link our public use microdata files with other data available to the public that have identifiers. If any problems are found, we change our disclosure avoidance procedures. Another concern for ACS is that we publish tables for small geographic areas and microdata for large areas from the same data. We want to make sure that a user cannot use the tables to attach identifiers of small geographic areas to the microdata. We could use help in all of these areas of disclosure avoidance research for microdata.

We have similar issues with our tables of establishment data in that the very large amount of data and the fact that the tables are additive and interrelated restrict our choices of disclosure avoidance procedures. In the past, we have used cell suppression to protect this type of data. Recently another good procedure called controlled tabular adjustment has been developed, but for the reasons above, we cannot use it. We are beginning to use a noise technique that is applied to the underlying microdata before tabulation. Any other ideas are welcome.

We are also interested in remote access systems where a user does not see the underlying microdata but can query it for tables, regressions, correlation coefficients, etc. We have developed an Advanced Query System for users to create their own tables from Census 2000 and are currently working on a Microdata Analysis System to be used for demographic surveys in the future. Much more research is needed.

I know it would be very helpful for us if students studying statistics would learn about the importance of confidentiality and would work with real, large data sets (such as our public use microdata files).

Laura Zayatz


US Census Bureau

Biographical Data

Laura is the leader of the Statistical Disclosure Avoidance Research Group in the Statistical Research Division at the Census Bureau. She is also the chair of the Census Bureau’s Disclosure Review Board and a member of the American Statistical Association’s Privacy and Confidentiality Committee.