July 22, 2009
Statistician helps agencies protect 'microdata'
By Jerry Oster
Jerry Reiter worries about intruders – not the lock-picking kind, but the kind that prowl public datasets in search of confidential information.
“When statistical agencies release a set of microdata – data about individuals – to the public,” Reiter explains, “they are legally and ethically and practically required to protect the confidentiality of respondents: identities, income, health variables and any other sensitive information. Employees of the Census Bureau, the National Center for Health Statistics, the National Center for Education Statistics and other federal statistical agencies face jail and $250,000 fines under the Confidential Information Protection and Statistical Efficiency Act if they release data in a way that enables identification of an individual.”
Beyond the legal considerations, Reiter continues, “there are serious ramifications for the whole federal system if there are breaches of confidentiality. One of the reasons a potential survey respondent is willing to give truthful answers is his or her belief that they will be kept confidential. I’m not going to tell the truth about my income or my disease status if ill-intentioned users – let’s call them intruders – can figure out my identity from the dataset that’s been released.”
Who are intruders? “Agencies haven’t admitted to actual breaches – and they won’t,” Reiter says. “The apocryphal stories you hear are about, say, parents who try to find information about their kids in a survey of drug use by teenagers; or about lawyers in a divorce case who might try to learn from a survey an individual’s actual income, as opposed to what’s being claimed in court.”
Intruders aren’t only individual people, Reiter says. “Companies want to know what their competitors are doing, how much they’re spending on research and development, what’s their total payroll. For publicly traded companies, that information is available, but for private enterprises it’s not available and could be very useful to know. Or take farms. The National Agricultural Statistics Service worries about this a lot. One farmer doesn’t know how much acreage another is devoting to particular crops and livestock and would like to so he can adjust his own apportionments accordingly to be more competitive.
“There’s more fear now with more detailed information being collected on biomedical and genetic variables. Biomarker data – something as simple as cholesterol levels or blood pressures or as complex as whether someone has a certain disease or a certain genotype – are data an insurance company might not mind having.”
A 1992 Duke graduate with a BS in math -- “there was no statistics major back then” -- Reiter helps statistical agencies evaluate the risks of releasing microdata. “I come up with statistical methods that give agencies a probability that an individual will be identified, given what’s been released about them and various assumptions about what an intruder might know.” He also assesses the effectiveness of steps agencies take to alter data in order to prevent identification, releasing ranges of incomes, for example, rather than exact incomes. “My metrics evaluate the impact such alterations have on inferences and analyses that secondary data-users make downstream.”
Reiter, winner of the 2006-07 Alumni Distinguished Undergraduate Teaching Award, is also working to protect data through what he calls “a radical, crazy idea that’s actually catching on: releasing simulated data rather than genuine data. In its extreme form – and this hasn’t been done yet – we can build a statistical model that tries to capture the relationships in all the data and then simulate new data from that model and release that simulated data. If we do a good job in capturing the relationships between the data, then the person who analyzes the simulated data should still get similar answers.
“A variant approach that’s actually being used is to simulate just the sensitive values in a dataset. The Census Bureau’s Survey of Income and Participation is using simulated datasets to gauge the effectiveness of public-assistance programs. The bureau is going to use simulated values for the data on people living in group quarters – prisons, dormitories, shelters – in its American Community Survey. Agencies in Germany, Canada, New Zealand and Australia are trying these methods out.”
Reiter worked for two years as an actuary before getting his Ph.D. from Harvard. Another of his research interests is missing data. “Pretty much every analysis, especially in the social sciences, is plagued by missing data. Researchers and practitioners often typically pick up the rug and brush the missing data under it, and analyze just the cases where there is complete information. That can cause a lot of problems. For example, if people in a sample who make large incomes decide not to tell you what their incomes are, and you compute the average income from the people who responded, your average is going to be too small.
“An idea I work with is called multiple imputation for missing data. The idea is to build a statistical model that explains the data well and then impute missing values. For example, if you don’t know someone’s income, but you know where they live, and you know the value of houses in that neighborhood, you can impute the value of their income.”
Reiter’s Web site states: “I enjoy collaborating on data analyses with researchers who are not statisticians, particularly in the social sciences.” He is working with Medical School Professor Rochelle Schwartz-Bloom on a study of new ways of teaching high school chemistry and biology and with Nicholas School of the Environment and Earth Sciences Associate Professor Marie Lynn Miranda on a study of disparities in birth outcomes between blacks and whites.
“Statisticians need data,” Reiter says, putting in a plug for his department’s Statistical Consulting Center, where statistics faculty and Ph.D’s offer help to faculty, students and staff on research involving statistical methods. “We’re constantly looking to be involved with social scientists, natural scientists, political scientists. We are open for business.”
Jerry Oster is director of communications for the Trinity College of Arts & Sciences.