Skip to content. | Skip to navigation

You are here: Home National Cancer Informatics Program NCI Cancer Genomics Cloud Pilots

NCI Cancer Genomics Cloud Pilots

Bringing data and computation together to create knowledge that accelerates cancer research and enables precision medicine


For upcoming events and conferences or mentions of the NCI Cloud in the news, visit the News and Events page.

The traditional model for analyzing genomic data involves individual researchers downloading data stored at a variety of locations, adding their own data, attempting to harmonize the data, and then computing over these data on local hardware. This model has been successful for many years, but has become unsustainable given the enormous growth of biomedical data since the advent of large-scale scientific programs that use next-generation sequencing technology. The size of the data makes access and analysis difficult for anyone but the best-resourced institutions, in terms of both storage and computing capability.

Goals of the Cloud Pilots

The Cancer Genomics Cloud Pilots are designed to explore innovative methods for accessing and computing on large genomic data. They aim to bring data and analysis together on a single platform by creating a set of data repositories with co-located computational capacity and an Application Programming Interface (API) that provides secure data access. In this model, applications are brought to the data, rather than bringing the data to the applications. The goals of Cloud Pilots are to democratize access to NCI-generated genomic and related data and to create a cost-effective way to provide computational support to the cancer research community.

The Cloud Pilots Program

Three contracts were awarded to develop the Cloud Pilots, to the Broad Institute, the Institute for Systems Biology (ISB), and Seven Bridges Genomics. Each of these groups is developing infrastructure and a set of tools to access, explore, and analyze molecular data. Key design principles for the clouds include: APIs for secure tool and data access, usability for biologists and clinicians as well as bioinformaticists and application developers, scalability, sustainability, extensibility to new data types without major refactoring, and open source, non-viral software licenses.

All three Cloud Pilots have chosen to implement their systems through commercial cloud providers and are collaborating on adopting common standards. Beyond these commonalities, the three project teams have distinct system designs, data presentation, and analysis resources to serve the cancer research community.

The Cloud Pilots will be available to researchers in early 2016.

The Cancer Genome Atlas Data

All three Cloud Pilots will host a core data set from The Cancer Genome Atlas (TCGA). TCGA is a comprehensive effort launched by NCI and the National Human Genome Research Institute in 2006 to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies to matched tumor-normal pairs. TCGA has collected tissue samples from and is characterizing 33 different types of cancer, including 10 rare cancers. TCGA has successfully demonstrated that a national, shared infrastructure for the generation of cancer genomic data, where individual labs pool their efforts and contribute their data, enables researchers to make and validate important discoveries and achieve economies of scale.

All three Cloud Pilots will host these core TCGA data:

Data Type



Available clinical information for each participant (may include demographic information, treatment information, survival data, etc)


Information on how samples from each participant were processed by the Biospecimen Core Resource Center (BCR)


Whole exome sequence for both tumor and normal sample for each participant; whole genome sequence for select participants


mRNA sequence for each participant's tumor sample

SNP array

Probe signals for each participant's tumor sample

Mutations and Variations

Somatic and germline mutation calls for each participant

By its projected completion in 2016, it is expected that TCGA will generate approximately 2.5 Petabytes (PB) of data. Maintaining local copies of all of the data is not feasible, and downloads can take weeks or months to complete. For precision medicine to move forward, data access and computing resources must be made available to the broadest set of researchers possible.

Meeting the Big Data Challenge: NCI Genomic Data Commons

In response to the need for data access, NCI funded the Genomic Data Commons (GDC) program, designed to provide the cancer research community with a single, unified data service that consolidates molecular and clinical data from all current and future NCI cancer genomics projects, including TCGA, and for other cancer genomics projects that endorse broad data sharing. The GDC, developed by the University of Chicago through a subcontract with Leidos Biomedical Research, supports the hosting and harmonization of genomic and clinical data from cancer research programs, and the application of state-of-the art methods for generating derived data (e.g., mutation calls, structural variants). The GDC will be available in Spring, 2016 and will provide the community with resources such as web-based tools and a data portal for retrieving data from and submitting data to the GDC and for processing data through GDC bioinformatics pipelines, as well as APIs for programmatic access to the data. Resources will be maintained in a secure data center that includes user support and documentation.

Cloud Pilots and the Genomic Data Commons: A Comprehensive Infrastructure for Cancer Genomic Data

The Cloud Pilots and the GDC have complementary goals, and the implementation teams are working together to promote interoperability among these systems.

Together, the systems create a cohesive model for large-scale genomic data management and analysis:

  • Data are generated through the TCGA and other NCI-funded genomics programs.
  • Data are validated, aggregated, harmonized, stored, and made available for query and download through the GDC as the authoritative NCI cancer genomics dataset.
  • The Cloud Pilots provide the computational capacity to effectively analyze these data and allow researchers to bring their own data and tools to the cloud.

Benefits of the Programs

Together, the Cloud Pilots and the GDC provide the research community with many significant benefits:

  • Democratize access to high-quality standardized clinical, biospecimen, and molecular data
  • Enable researchers across the cancer community to access tools and to compute on large volumes of data, regardless of local resource constraints
  • Provide capabilities to search, visualize, and analyze researcher's own data in combination with TCGA data
  • Provide consistent, programmatic access to the data and the ability for researchers to bring their own tools to the data
  • Harmonize data and analysis pipelines for consistency and reproducibility
  • Ensure security and appropriate access to controlled data
Useful Links