Skip to content. | Skip to navigation

You are here: Home National Cancer Informatics Program NCI Cancer Genomics Cloud Pilots

NCI Cancer Genomics Cloud Pilots

Bringing data and computation together to create knowledge that accelerates cancer research and enables precision medicine

For upcoming events and conferences or mentions of the NCI Cloud in the news, visit the News and Events page.

The traditional model for analyzing genomic data involves individual researchers downloading data stored at a variety of locations, adding their own data, attempting to harmonize the data, and then computing over these data on local hardware. This model has been successful for many years, but has become unsustainable given the enormous growth of biomedical data since the advent of large-scale scientific programs that use next-generation sequencing technology. The size of the data makes access and analysis difficult for anyone but the best-resourced institutions, in terms of both storage and computing capability.

Goals of the Cloud Pilots

The Cancer Genomics Cloud Pilots are designed to explore innovative methods for accessing and computing on large genomic data. They aim to bring data and analysis together on a single platform by creating a set of data repositories with co-located computational capacity and an Application Programming Interface (API) that provides secure data access. In this model, applications are brought to the data, rather than bringing the data to the applications. The goals of Cloud Pilots are to democratize access to NCI-generated genomic and related data and to create a cost-effective way to provide computational support to the cancer research community.

The Cloud Pilots Program

Three contracts were awarded to develop the Cloud Pilots, to the Broad Institute, the Institute for Systems Biology (ISB), and Seven Bridges Genomics. Each of these groups is developing infrastructure and a set of tools to access, explore, and analyze molecular data. Key design principles for the clouds include: APIs for secure tool and data access, usability for biologists and clinicians as well as bioinformaticists and application developers, scalability, sustainability, extensibility to new data types without major refactoring, and open source, non-viral software licenses.

All three Cloud Pilots have chosen to implement their systems through commercial cloud providers and are collaborating on adopting common standards. Beyond these commonalities, the three project teams have distinct system designs, data presentation, and analysis resources to serve the cancer research community.

The Cloud Pilots will be available to researchers in early 2016.

The Cancer Genome Atlas Data

All three Cloud Pilots will host a core data set from The Cancer Genome Atlas (TCGA). TCGA is a comprehensive effort launched by NCI and the National Human Genome Research Institute in 2006 to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies to matched tumor-normal pairs. TCGA has collected tissue samples from and is characterizing 33 different types of cancer, including 10 rare cancers. TCGA has successfully demonstrated that a national, shared infrastructure for the generation of cancer genomic data, where individual labs pool their efforts and contribute their data, enables researchers to make and validate important discoveries and achieve economies of scale.

All three Cloud Pilots will host these core TCGA data:

Data Type

Description

Clinical

Available clinical information for each participant (may include demographic information, treatment information, survival data, etc)

Biospecimen

Information on how samples from each participant were processed by the Biospecimen Core Resource Center (BCR)

DNA-Seq

Whole exome sequence for both tumor and normal sample for each participant; whole genome sequence for select participants

RNA-Seq

mRNA sequence for each participant's tumor sample

SNP array

Probe signals for each participant's tumor sample

Mutations and Variations

Somatic and germline mutation calls for each participant

By its projected completion in 2016, it is expected that TCGA will generate approximately 2.5 Petabytes (PB) of data. Maintaining local copies of all of the data is not feasible, and downloads can take weeks or months to complete. For precision medicine to move forward, data access and computing resources must be made available to the broadest set of researchers possible.

Meeting the Big Data Challenge: NCI Genomic Data Commons

The NCI Center for Cancer Genomics (CCG) was established to lead the NCI effort in generating critical datasets for cataloging the alteration seen in human tumors, coordinating data unification and sharing efforts, and supporting the development of analytical tools and computation approaches aimed at improving the understanding of large-scale, multidimensional data.  The CCG supports several large-scale cancer genome research programs including The Cancer Genome Atlas (TCGA) and the Office of Cancer Genomics (OCG). OCG includes two initiatives supporting the molecular characterization of cancer including the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative and the Cancer Genome Characterization Initiative (CGCI). 

The GDC is working with the Cloud Pilots to implement a comprehensive ecosystem for cancer genomics data which will serve as a cohesive model for large-scale genomic data management and analysis. The GDC will provide the Cloud Pilots with an authoritative NCI reference data set that can be accessed by the Cloud Pilots for high performance computing. 

Read more about the GDC

Cloud Pilots and the Genomic Data Commons: A Comprehensive Infrastructure for Cancer Genomic Data

The Cloud Pilots and the GDC have complementary goals, and the implementation teams are working together to promote interoperability among these systems.

Together, the systems create a cohesive model for large-scale genomic data management and analysis:

  • Data are generated through the TCGA and other NCI-funded genomics programs.
  • Data are validated, aggregated, harmonized, stored, and made available for query and download through the GDC as the authoritative NCI cancer genomics dataset.
  • The Cloud Pilots provide the computational capacity to effectively analyze these data and allow researchers to bring their own data and tools to the cloud.

Benefits of the Programs

Together, the Cloud Pilots and the GDC provide the research community with many significant benefits:

  • Democratize access to high-quality standardized clinical, biospecimen, and molecular data
  • Enable researchers across the cancer community to access tools and to compute on large volumes of data, regardless of local resource constraints
  • Provide capabilities to search, visualize, and analyze researcher's own data in combination with TCGA data
  • Provide consistent, programmatic access to the data and the ability for researchers to bring their own tools to the data
  • Harmonize data and analysis pipelines for consistency and reproducibility
  • Ensure security and appropriate access to controlled data
Useful Links

NCI Biomedical Informatics Blog
Cancer Genomics Cloud Pilots DREAM Challenge — Leveraging the Wisdom of the Crowd

August 29, 2016 - By Tony Kerlavage, Ph.D., NCI CBIIT, Cancer Informatics Branch In recent years, Challenges have become a popular way to engage and motivate the research and innovation communities to solve difficult problems. Challenges are open competitions where communities are presented with specific and often difficult problems to solve. Participants are given guidelines and test data, and …

Continue reading »

Cancer Data and Computation in the Cloud: One Path to Affordable Genomics Research

July 20, 2016 - By Gad Getz, Ph.D., Broad Institute / MGH The cost of DNA sequencing has dropped more than one million-fold over the last decade, making it increasingly possible to discover the genetic basis of cancer and response to treatment. Three challenges, however, impede this goal: 1) Analysts lack the resources to download, store and compute on …

Continue reading »

Learn more about the Genomic Data Commons

June 14, 2016 - NCI has launched the Genomic Data Commons (GDC), a system that will promote sharing of genomic and clinical data between researchers and facilitate precision medicine in oncology. The GDC was created to centralize, standardize, and broaden access to data from NCI programs such as The Cancer Genome Atlas (TCGA) and its pediatric equivalent, Therapeutically Applicable …

Continue reading »

Collaborating Against Cancer: Inter-Agency Pilots

May 11, 2016 - “We are on the cusp of breakthroughs that will save lives, benefit all of humanity. But we have to work together.” Vice President Joe Biden’s words at the American Association for Cancer Research conference resonate as a clear call to action. When we collaborate and share our expertise, the cancer informatics community can bring a …

Continue reading »

Read more from the blog
Twitter