The Privacy Puzzle
ÈËÆÞÓÕ»óies win 1st place in national competition for innovative new privacy algorithm.
A team of ÈËÆÞÓÕ»ó statistics students won first place in a prestigious national competition for an innovative algorithm that helps researchers glean information from datasets—without compromising individual privacy.
Zeki Kazan ’20, Kaiyan Shi ’20, and Simon Couch ’21 won the Undergraduate Statistics Research Project Competition for their project, “,” which outlines a new algorithm for hypothesis testing that upholds the privacy of the underlying data. In fact, their technique is twice as powerful as the standard private method, meaning that it requires less than half as much data to achieve the same statistical power.
Simon was actually sitting in a statistics class with Prof. Kelly McConville when he heard the news. “I was so surprised!” he told us. “I felt so much excitement and pride and thankfulness to have this opportunity.”
The project was advised by Prof. , Prof. Anna Ritz, and Prof. , who says he was not surprised at all. “The productivity of this group was incredible,” he says. “I knew the quality of their work would be immediately apparent to the judges. This is an original solution to a real scientific problem."
Simply put, the problem is that big databases hold immense promise for answering scientific questions, but many organizations won’t allow researchers access to them because of the risk of an inadvertent breach of privacy—even when obvious markers like name and address have been stripped away. In 2014, for example, the New York City Taxi and Limousine Commission released a giant database of taxi rides in response to a freedom-of-information request. The commission attempted to anonymize the data, but enterprising journalists were able to piece together various clues to .
To understand the ÈËÆÞÓÕ»ó project, you need to know that statisticians often compare two sets of data using a tool known as a hypothesis test. Each hypothesis test requires a certain amount of data before it can detect a relationship between the two sets—the less data it needs, the more statistical power it has.
Now to go deeper.
There are many different types of hypothesis tests. The ÈËÆÞÓÕ»ó team focused on the Wilcoxon Signed-Rank Test, which is commonly used when there is paired-sample data—where there is a natural association between the two sets (e.g. a patient’s blood pressure before and after watching a horror movie). It compares the sets in an attempt to determine whether there is a statistically significant relationship.
The team reworked the Wilcoxon test to ensure privacy, and employed an innovative technique to reduce the amount of data it required. With these two seemingly simple tweaks, the enhanced algorithm turned out to be much more powerful, yielding significant real-world implications. When tested, their model had a statistical power that was much closer to public-setting tests: achieving the same statistical power with only 40% of the data required by the earlier private-setting model. Because of this increased efficiency, the ÈËÆÞÓÕ»ó algorithm can be used on smaller datasets, whereas previous models required enormous quantities of data.
Tags: ÈËÆÞÓÕ»ó, Awards & Achievements, Research, Students, Cool Projects