At its root, cancer is a complex genetic disease. To develop new therapies, researchers must identify the mutations that cause cancer and figure out their biological roles. This is no easy task. There are more than 100 different types of cancer, each with distinct genetic fingerprints that can vary on an individual basis. For significant progress to be made, the genomes of many different types of cancer from many different patients must be sequenced and compared. Advances in sequencing technology, coupled with decreasing costs, have allowed scientists to generate an enormous amount of tumor data. The National Cancer Institute has funded several large scale projects aimed at this task, with great success.
However, this wealth of data has come with limitations. These data are gathered by different research groups, with different technologies and protocols. They’re stored in different locations, using different software and management systems. They’re complex and just plain huge. A cancer researcher would need millions of dollars, several years and a dedicated team to set up the infrastructure necessary to analyze these datasets. Just downloading can take months. This has impeded research at all but the largest groups and institutions, and has stymied collaboration.
This is a big problem, but it’s one with a solution. The University of Chicago and the National Cancer Institute (NCI) are collaborating to establish the Genomic Data Commons, or GDC. It’s a first-of-its-kind facility that will comprehensively store cancer genomic data generated through NCI-funded research programs. Not only will the GDC centralize data, it will harmonize it – making sure that it’s all compatible and can be easily used and shared.
“The Genomic Data Commons has the potential to transform the study of cancer at all scales,” said Robert Grossman, Ph.D., Director of the GDC and professor in the Department of Medicine at the University of Chicago. “It supplies the data so that any researcher can test their ideas, from comprehensive ‘big-data’ studies to genetic comparisons of individual tumors to identify the best potential therapies for a single patient.”
With the GDC, any talented researcher will have access to the data he or she needs to make discoveries. This effectively democratizes this information, and dramatically speeds up the pace of research. Analyses that would have taken years and millions of dollars can be completed in days or even hours.
The GDC will use a modern, expandable informatics framework to handle these data, through an approach similar to what’s used by companies such as Google and Facebook. Initially, it will standardize and house datasets including the entirety of large-scale NCI programs such as The Cancer Genome Atlas (TCGA). These programs represent about 2.2 petabytes of legacy data, and the GDC which will add another petabyte or more of storage each year to accommodate new NCI projects (for reference, one petabyte of music in .mp3 format would take 2,000 years to play).
The GDC is also an essential step on the road to precision medicine for cancer. A primary goal is to eventually allow any doctor the ability to upload and compare tumors from individual patients against the thousands of others in the GDC. By linking genetic data to extensive clinical information from patients and their response to treatment, the GDC will one day help clinicians to find the most effective drug or course of treatment for individual genomes.
The GDC also creates a foundation for cancer research through the cloud, that helps researchers analyze large-scale datasets and perform experiments remotely. And the open-source software being developed by the GDC has the potential to become a model for data-intensive research efforts for other diseases such as Alzheimer’s and diabetes, which desperately need similar large-scale, data-driven approaches to develop cures.
“The GDC is absolutely needed,” said Jean Zenklusen, MS, PhD, director of TCGA program office at NCI. “NCI’s goal for the GDC is to be a resource for all investigators to generate hypotheses and make new discoveries from the data.”
The GDC builds upon the Bionimbus Protected Data Cloud, a pilot cloud-based system developed by Grossman that was the first to be approved by the National Institutes of Health to hold cancer genomic data from projects such as TCGA.
To read more, visit http://www.uchospitals.edu/news/2014/20141202-gdc.html