The Census Bureau collects data from people and establishments under
Title 13 of the US code which promises all of our respondents that
we will keep their information confidential. At the same time, we
are required to publish as much high quality information as possible
without violating our confidentiality promise.
Our research staff develops disclosure avoidance techniques that
can be used on many types of data, the majority of which are microdata
and tables of demographic data and tables of establishment data. Techniques
include geographic thresholds, rounding, noise addition, categorical
thresholds, topcoding, data swapping, suppression, and synthetic data.
Our current major research projects are development of synthetic data,
development of a Microdata Analysis System (users can query the system
without actually seeing the microdata), and development of a noise
technique to protect tables of establishment data.
The Census Bureau’s Disclosure Review Board (DRB) is a group
of people that represent various areas of the organization (demographic,
decennial, economic, research, and policy). This group reviews all
data products before they are publicly released to ensure there are
no disclosure problems.
The research staff and the DRB face the same challenge in that we
are swamped with work.
I have had an open position on the research staff for quite some
time, but young people coming out of college have never heard of disclosure
avoidance. The staff has enough money, data, hardware, and software,
but I need people to work in this area. How do we encourage universities
to have classes on protecting confidentiality? How do we get more
people interested in working on this?
The DRB is extremely busy. We meet every Monday, and our recent agendas
have had over 10 data requests every week (double of what we were
seeing a year or 2 ago). The Census Bureau is placing a great emphasis
on confidentiality (which is good), but everyone is now afraid to
release any number without DRB approval. Some of the requests clearly
have no disclosure problems, and they can waste DRB time and paperwork.
How can we train people (not necessarily just at the Census Bureau)
to do a disclosure review of a data product?
Working at the Census Bureau brings opportunities as well as challenges
for the same reason: we have an enormous amount of data. We must do
our best to release as much high quality data as possible without
violating confidentiality. Because of the large amount of data and
the forms in which they are released, disclosure avoidance procedures
that work for many other agencies and organizations will not work
for us. We need more and better techniques that can be used for extremely
large, interrelated products.
The most obvious example is the decennial census, and it will carry
over to the American Community Survey (ACS). From Census 2000, we
had data (6 variables) on more than 281 million people, and even more
data (61 additional variables) on one sixth of them. From this data,
we published standard Summery Files 1 through 4 (over 4 billion tables)
as well as special tabulations.
The tables are additive, and they are interrelated in that the same
cell can appear in many tables. For a given cell that does appear
in many tables, the Census Bureau requires that the same value appears
in that cell. For these reasons, we cannot use a table-based disclosure
avoidance technique. There are too many tables, and we could never
coordinate a suppression or table-based noise technique. Also, the
tables are published for at least 29 different types of geographic
areas (such as blocks, tracts, zip codes, school districts, congressional
districts, consolidated cities, etc.) and their boundaries overlap.
This can create what we call “geographic slivers” when
overlapping boundaries could be used to compare data from different
tables and obtain data from very small geographic areas. Thus we must
use a technique that works on the underlying microdata, such as data
swapping or generation of synthetic data, and that targets records
with disclosure risk for any potential sliver of geography. Then we
can create our tables from the perturbed microdata. For Census 2000
and for past ACS, we used data swapping. For future ACS, we are currently
doing research on synthetic data but certainly could use help from
other researchers.
We are, of course, also concerned about our microdata files from
ACS and other surveys. They are much larger and more detailed than
those released from other agencies and organizations. My bio page
lists examples of disclosure avoidance procedures we use to protect
these data. We also perform reidentification studies in which we attempt
to link our public use microdata files with other data available to
the public that have identifiers. If any problems are found, we change
our disclosure avoidance procedures. Another concern for ACS is that
we publish tables for small geographic areas and microdata for large
areas from the same data. We want to make sure that a user cannot
use the tables to attach identifiers of small geographic areas to
the microdata. We could use help in all of these areas of disclosure
avoidance research for microdata.
We have similar issues with our tables of establishment data in that
the very large amount of data and the fact that the tables are additive
and interrelated restrict our choices of disclosure avoidance procedures.
In the past, we have used cell suppression to protect this type of
data. Recently another good procedure called controlled tabular adjustment
has been developed, but for the reasons above, we cannot use it. We
are beginning to use a noise technique that is applied to the underlying
microdata before tabulation. Any other ideas are welcome.
We are also interested in remote access systems where a user does
not see the underlying microdata but can query it for tables, regressions,
correlation coefficients, etc. We have developed an Advanced Query
System for users to create their own tables from Census 2000 and are
currently working on a Microdata Analysis System to be used for demographic
surveys in the future. Much more research is needed.
I know it would be very helpful for us if students studying statistics
would learn about the importance of confidentiality and would work
with real, large data sets (such as our public use microdata files).