Banning noise will be a disaster for statistical data products - Ted is writing things
..@..♦.D.
About
Blog
Recipes
latest —<br>rss —<br>archives
← previous
Last week , the United States Department of Commerce issued an order<br>declaring that "noise infusion" will be banned from all statistical products<br>published by the Census Bureau and the Bureau of Economic Analysis.
What does it mean, and why should you care?
Context
Statistical products are a bunch of numbers published from a secret dataset.<br>Often, that dataset contains confidential information, and it is important that<br>the numbers don't reveal that information. The U.S.<br>Census is a well-known<br>example: the statistics are made public, but the contents of each form filled by<br>individual U.S. residents must stay secret.
Scientists have developed a number of techniques that can be used to publish<br>useful statistics while protecting the privacy of the original data. This field<br>is called disclosure avoidance in statistical communities. Here are a few of<br>these techniques.
Suppression: removing data that doesn't pass certain thresholds (e.g. if a<br>count of people is below 5, we don't publish it).
Coarsening (or generalization): making data attributes less precise (e.g.<br>transform a county into its state, a date of birth into an age range, etc.).
Sampling: randomly removing some records from the dataset.
Swapping: taking attributes from different records and exchanging them<br>randomly.
Contribution bounding: making sure that a single individual cannot<br>contribute "too much" to a statistic by limiting their maximum impact.
Noise addition: adding a random number to statistics to hide their true<br>value.
Some of these techniques, when combined, achieve a definition called<br>differential privacy. This<br>definition has a lot of nice fundamental<br>properties and is widely considered the<br>gold standard of privacy protection among scientists. To achieve it, scientists<br>typically rely on a combination of contribution bounding and<br>carefully-calibrated noise addition.
From 1990 to 2010, the U.S. Census Bureau primarily relied on swapping for the<br>decennial census. Then, they realized that this technique was actually very<br>unsafe, and that it was pretty easy to<br>reconstruct individual records using the published statistics. This is bad,<br>because the Bureau is required by federal law to keep these records<br>confidential. So they tried a few alternative approaches, and decided to adopt<br>differential privacy for the 2020 Census: this was the one that kept the<br>statistics most useful, while preventing these attacks.
It bears repeating: differential privacy wasn't chosen because the math was nice<br>and compelling1. It was selected because among the different options that<br>mitigated the attack, it was the one that preserved the most utility. Its exact<br>privacy parameters were chosen not because they provided rock-solid provable<br>guarantees, but because they squeezed most usefulness out of the data while<br>reaching an acceptable level of privacy protection.
Sadly, "preserved the most utility under newly-discovered privacy constraints"<br>did not mean "preserved as much utility as the 2010 Census": the numbers got<br>less accurate, and the inaccuracies got a lot more transparent, and therefore<br>impossible to ignore. This made a number of people very angry.
Demographers and social scientists could no longer ignore that the data they<br>were working with was noisy data. This required a major<br>shift in how they<br>conceptualized and worked with this data.
People who were using Census data to actually reconstruct records could no<br>longer do so. Demographers admitted that this was common<br>practice. It's also an open secret that<br>this was done by political operatives as part of<br>gerrymandering efforts.
Phew, that was a lot of context.
What does the order say?
The administration has now decided that noise infusion was no longer an<br>acceptable disclosure avoidance technique.
The order clearly targets differential privacy, but also seems to impact other<br>techniques that involve randomness: the text explicitly mentions that coarsening<br>should always be preferred, falling back to suppression as a "last resort". I<br>have no idea why the order is so specific. Maybe they wanted to make sure the<br>scientists working at the U.S. Census couldn't still use similar techniques<br>without calling them differential privacy?
The order also carefully says it "shall not be interpreted to conflict with any<br>constitutional, statutory, regulatory, or other legal provision". So the<br>confidentiality obligations surrounding these statistical products still apply.
What will it mean in practice?
The consequences will be dire for utility or for privacy, and possibly both.<br>It's hard to understate this point: future statistical releases will either be<br>useless compared to past ones, or they will be incredibly unsafe.
For starters, taking away useful tools from the disclosure avoidance toolbox<br>will always lead to more painful privacy/utility trade-offs....