Research at the Petascale: The Challenge of Processing One Million Genomes

Most computer users are familiar with a handful of data storage measurements: kilobytes, megabytes, gigabytes. If you have a big digital music or movie collection, you might even have a hard drive measuring in the terabytes. But what about a petabyte? Even if you know the basic formula for calculating storage sizes (a petabyte is about 1,000 terabytes), it’s hard to imagine how you could ever fill up that much space now that cloud storage and streaming entertainment services such as Netflix and Spotify are gradually eliminating the need to store lots of data on the desktop.

Researchers in biomedicine, the basic sciences, engineering and business, however, are amassing data by the petabyte already, and with it the need for computing infrastructure with the horsepower to process it all. Computing at the “petascale,” which refers to both the storage needs and the processing power to crunch all that data, is one of the big challenges being addressed by the Computation Institute, a joint initiative between the University of Chicago and Argonne National Laboratory to advance science through innovative computing approaches. Earlier this month the Computation Institute hosted a seminar for “Petascale Day,” featuring speakers from a variety of disciplines talking about the promise of computing at a scale roughly one million times bigger and faster than your laptop.

One of the biggest drivers of petascale computing is genetic research. Robert Grossman, professor in the Department of Medicine’s Section of Genetic Medicine and a senior fellow at the Computation Institute, spoke about the “million genome challenge,” or compiling the complete genome of a million patients. Grossman said that this would provide enough variation to begin to understand the genetic causes of disease, but it’s also a huge data problem even bigger than the petascale. One million genomes could generate 1,000 petabytes, or an exabyte of data.

“We can do genomically-inspired diagnosis, genomically-inspired treatment with specifically targeted drugs, and develop a genomically-inspired understanding of the pathways that contribute to this, but it’s a nontrivial, petascale data problem,” he said.

Grossman spoke about some of the projects he and his colleagues are working on to develop the kind of computing power and software to analyze petabyte- and exabyte-sized datasets. One of these, the Open Science Data Cloud, is a cloud-based infrastructure where researchers can upload their data and make it publicly available for analysis. They’re also working on building a secure cloud infrastructure that is compliant with HIPAA and HITECH regulations for protected data.

Grossman said that while the challenge of big data may seem daunting, it’s also an exciting opportunity for researchers to get in at the ground level.

“We have this problem in science right now. If you want to look at exabyte-scale science, you don’t have a lot of choices. You could go to Google. You could go to maybe a couple of other companies and do exabyte-scale computing, but if you want to do exabyte-scale science, it’s much harder,” he said. “If you’re beginning and just looking at how to have an impact, then big data and the million genome challenge is a really interesting way to do that because it’s wide open.”

For more on the Computation Institute and Petascale Day, check out their ScaleOut blog for posts on the hardware needed to compute at this level, and the applications to crunch the data.

About Matt Wood (531 Articles)
Matt Wood is a senior science writer and manager of communications at the University of Chicago Medicine & Biological Sciences Division.
%d bloggers like this: