Shoveling Science’s Data Avalanche

A Google data center in Oregon 


As scientific technology advances in leaps and bounds, the amount of data generated by innovations such as sophisticated gene sequencers, electronic medical records, high-resolution brain scans threatens to drown the scientists who operate them in numbers and bytes. The idea that a researcher can sit down with a spreadsheet, a statistics program and a thermos of coffee to do the data analysis for their experiment may soon be as outdated as running calculations with a quill pen and an abacus. So how will scientists find the needles in data haystacks that grow ever more monstrous? 

The solution, said UIC’s Robert Grossman in a visiting lecture at the University of Chicago Monday, may lie in the same technology that – at it’s most basic – allows you to check your e-mail from every internet-ready computer in the world. Cloud computing, one of the tech industry’s biggest buzzwords at the moment, may be the solution to science’s growing data problem, Grossman said, clearing the way for computation that is powerful, fast, and affordable enough to be useful for research that is frequently becoming too “data-intensive” to be handled in-house.

“Partly what’s going on is the amount of data in all these different dimensions is growing faster than our ability to manage it,” Grossman said. “The first time you can’t turn around and look at imaging data and medical records and all the different modalities of data that you can’t handle in your own lab, then your discipline has become data-intensive.”

Grossman knows of what he speaks; he’s the director of the Laboratory for Advanced Computing at the University of Illinois at Chicago. And speaking to a roomful of scientists as part of the Chicago Biomedical Consortium – an ongoing collaboration between the University of Chicago, Northwestern University, and UIC – Grossman was thrilled to deliver an optimistic message to scientists facing their own data problems that there is a solution, and that he can help.

Unfortunately, science is  a little slow out of the blocks, Grossman said, as large government organizations such as the National Science Foundation and the Department of Energy that usually conduct large-scale computations have until recently retained a focus on more inflexible grid-style computing. Meanwhile, internet giants Google, Yahoo and Amazon have thrown hundreds of millions of dollars into data centers capable of the more nimble forms of computation afforded by cloud computing. All of us are familiar with the results of that investment: every time we log into Gmail, we’re accessing the Google cloud, and every time Amazon predicts a book would be interested in or Yahoo customizes an advertisement based on our browsing history, that’s the cloud too. 

Grossman emphasized that the cloud can help scientists do more than check their e-mail, particularly researchers without easy access to very expensive, massive computer grids. Smaller clouds devoted to scientific computing are forming at Grossman’s UIC center and at national laboratories such as Argonne, and pieces of the massive internet company clouds can be rented out by anyone with a credit card (and considerable experience in computer programming, I imagine). Recently, University of Chicago researcher Kevin White has used Grossman’s Cistrack cloud to analyze the data collected in the modENCODE project, an effort to build a library of regulatory elements that control genetic expression in fruit flies over the course of their entire lives – a data set far too large for traditional analysis. 

While there’s still much to be done to create data centers that for science that rival the data centers for Google ad placement, Grossman said that Chicago’s scientists are privileged to be near the epicenter of computer experts collaborating to tailor both the cloud and the grid to answer research questions.

“We’re the center for this because we all talk to each other and like each other,” Grossman said. “We’re at a critical mass in Chicago.”

Rob Mitchum is communications manager at the Computation Institute, a joint initiative between The University of Chicago and Argonne National Laboratory.

