A Crack in the Safe of Genomic Studies

As genotyping becomes cheaper and more routine, the optimism about the medical benefits is laced with paranoia about genetic privacy. Personal genetics businesses promise the tighest security with their customers’ DNA test results, electronic medical records build in layers of encryption and protection, and the 2008 Genetic Information Nondiscrimination Act forbids insurance companies and employers from accessing and using individuals’ genotypes. But scientists have discovered a loophole that could open thousands of study volunteers to violations of their genetic privacy: the very studies that have made it possible to start interpreting the association between genetics, traits, and disease.

Genome-wide association studies, known more concisely as GWAS, are a favorite tool of scientists looking for important genetic variants associated with increased risk for disease. Since the first GWAS was published in 2005, the technique has been used to look for genetic associations with everything from cancer and asthma to body mass index and psychiatric disorders. In a typical GWAS, hundreds or thousands of volunteers consent to the use of their genetic information, in the form of single nucleotide polymorphisms, or SNPs, measured by a gene chip test. That data is anonymized, pooled, and compared (between people who have a particular disease and people who do not, for example) to find gene variants that increase or lower the risk of a particular trait.

Because these studies are costly and time-consuming, the U.S. government decided that the results of all federally-funded GWAS would be made freely available online for scientific use. Genetics researchers could sign on to the dbGaP website, download GWAS results for a given disease, and use that data in their own research. To protect the identity of the study participants, only the results of each study were made available, not the individual genotype data.

But in 2008, a study intended to improve police forensics became a cautionary warning for GWAS researchers. To determine if a particular person had been at a crime scene, Nils Homer and colleagues developed a method of testing whether a single person’s genotype was present in a mixture of genetic information collected at the scene by investigators. Not only did they succeed in finding a way to tell whether a person was at a scene or not, they found it was sensitive enough to detect a single individual in a mixture of as many as 1,000 different people.

“That had immediate implications for GWAS studies, because in 2008, that was a typical study size,” said Hae Kyung Im, research associate (assistant professor) of Biostatistics at the University of Chicago Medicine. “Nobody really thought that by publishing the results from these studies you could be revealing private information about the study participants. This study was the first one that showed that.”

As a result, dbGaP withdrew open access to GWAS study results, requiring researchers to jump through a multitude of bureaucratic hoops to obtain the previously unrestricted data. Meanwhile, the National Human Genome Research Institute looked for a more secure way to once again share GWAS data, settling on a more complex summary of the results: regression coefficients. One of the advisors to the NHGRI, Nancy Cox, professor of human genetics at the University of Chicago, asked Im to test whether sharing regression coefficients would patch up the privacy breach.

“Actually my goal was to show that you couldn’t reconstruct private information, and then I found that you could,” Im said. “Before making regression coefficients publicly available, they wanted to be very sure that it was okay, but unfortunately the result was in the opposite direction.”

In a paper published by the American Journal of Human Genetics, Im found that a person armed with an individual’s genotype and the regression coefficients from a completed GWAS could determine with high certainty whether that individual had participated in the study or not. Even more concerning, the numbers could be untangled to say whether the individual was in the disease or control group, potentially revealing specific medical information. Collecting multiple phenotypes from individuals, which is standard procedure with increasingly popular proteomics and transcriptomics research, made it even easier to unravel an individual’s participation. But even reporting the most simple result — whether a particular gene variant increases or decreases the risk of a disease or trait — was enough of a breach for someone of shady motives to exploit.

The results indicate that genotype information, no matter how it is statistically disguised, is just too rich and unique a “personal barcode” to ever be made foolproof for privacy concerns. Other scientists discussing the topic have compared protecting genetic data to the futility of trying to completely anonymize a person’s picture — “Because digital cameras can in an instant nearly perfectly capture all of the features of one’s face and because it is basically free to collect such information,” they wrote, “there is no way that images of one’s face can be protected.” However, that unavoidable  security flaw shouldn’t completely scuttle the sharing of genomic study data, Im said.

“With all this information out there, it’s just hopeless to try to maintain privacy in that sense,” Im said. “But we want to make research available to the whole community so that science goes faster.”

To find the best balance between privacy concerns and bureaucratic obstacles, Im and her co-authors recommended a data sharing program with a regulatory twist. Instead of “open access,” think of it as “certified access” — scientists would apply once a year for a license to access the GWAS database, promising to use the data only for research purposes and to maintain the anonymity of the study participants.

“Our proposal was to say that as long as a scientist promises to use the data for research and do no evil, they should be granted access,” Im said. “There is risk of losing privacy, but I think it’s a small risk compared to all the gain.”


Im, H., Gamazon, E., Nicolae, D., & Cox, N. (2012). On Sharing Quantitative Trait GWAS Results in an Era of Multiple-omics Data and the Limits of Genomic Privacy The American Journal of Human Genetics, 90 (4), 591-598 DOI: 10.1016/j.ajhg.2012.02.008

About Rob Mitchum (525 Articles)
Rob Mitchum is communications manager at the Computation Institute, a joint initiative between The University of Chicago and Argonne National Laboratory.
%d bloggers like this: