Skip to content. | Skip to navigation

You are here: Home National Cancer Informatics Program NCI Cancer Genomics Cloud Pilots

NCI Cancer Genomics Cloud Pilots

Bringing data and computation together to create knowledge that accelerates cancer research and enables precision medicine

Upcoming Events and Conferences

December 2-4, Scientific Workshop of the ICGC
The Broad Institute will present at the session 'Precision Medicine and Big Data.' Bring your questions about their cloud platform and pilot.

December 2nd, Medical Populatin Genetics Meeting
Have questions about the Broad Institute's FireCloud? The Broad Institute will be presenting the FireCloud. Find out more at the Broad Institute Workshops page.

Wednesday, December 9th, at 11 a.m. - 12 p.m. ET: CBIIT Speaker Series
Representing Institute for Systems Biology, one of the three cloud pilots,Ilya Shmulevich, Ph.D. joins the series to discuss the ISB-CG platform. Learn more on the CBIIT Speaker Series Wiki.

Useful Links

The traditional model for analyzing genomic data involves individual researchers downloading data stored at a variety of locations, adding their own data, attempting to harmonize the data, and then computing over these data on local hardware. This model has been successful for many years, but has become unsustainable given the enormous growth of biomedical data since the advent of large-scale scientific programs that use next-generation sequencing technology. The size of the data makes access and analysis difficult for anyone but the best-resourced institutions, in terms of both storage and computing capability.

Goals of the Cloud Pilots

The Cancer Genomics Cloud Pilots are designed to explore innovative methods for accessing and computing on large genomic data. They aim to bring data and analysis together on a single platform by creating a set of data repositories with co-located computational capacity and an Application Programming Interface (API) that provides secure data access. In this model, applications are brought to the data, rather than bringing the data to the applications. The goals of Cloud Pilots are to democratize access to NCI-generated genomic and related data and to create a cost-effective way to provide computational support to the cancer research community.

The Cloud Pilots Program

Three contracts were awarded to develop the Cloud Pilots, to the Broad Institute, the Institute for Systems Biology (ISB), and Seven Bridges Genomics. Each of these groups is developing infrastructure and a set of tools to access, explore, and analyze molecular data. Key design principles for the clouds include: APIs for secure tool and data access, usability for biologists and clinicians as well as bioinformaticists and application developers, scalability, sustainability, extensibility to new data types without major refactoring, and open source, non-viral software licenses.

All three Cloud Pilots have chosen to implement their systems through commercial cloud providers and are collaborating on adopting common standards. Beyond these commonalities, the three project teams have distinct system designs, data presentation, and analysis resources to serve the cancer research community.

The Cloud Pilots will be available to researchers in early 2016.

The Cancer Genome Atlas Data

All three Cloud Pilots will host a core data set from The Cancer Genome Atlas (TCGA). TCGA is a comprehensive effort launched by NCI and the National Human Genome Research Institute in 2006 to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies to matched tumor-normal pairs. TCGA has collected tissue samples from and is characterizing 33 different types of cancer, including 10 rare cancers. TCGA has successfully demonstrated that a national, shared infrastructure for the generation of cancer genomic data, where individual labs pool their efforts and contribute their data, enables researchers to make and validate important discoveries and achieve economies of scale.

All three Cloud Pilots will host these core TCGA data:

Data Type



Available clinical information for each participant (may include demographic information, treatment information, survival data, etc)


Information on how samples from each participant were processed by the Biospecimen Core Resource Center (BCR)


Whole exome sequence for both tumor and normal sample for each participant; whole genome sequence for select participants


mRNA sequence for each participant's tumor sample

SNP array

Probe signals for each participant's tumor sample

Mutations and Variations

Somatic nd germline mutation calls for each participant

By its projected completion in 2016, it is expected that TCGA will generate approximately 2.5 Petabytes (PB) of data. Maintaining local copies of all of the data is not feasible, and downloads can take weeks or months to complete. For precision medicine to move forward, data access and computing resources must be made available to the broadest set of researchers possible.

Meeting the Big Data Challenge: NCI Genomic Data Commons

In response to the need for data access, NCI funded the Genomic Data Commons (GDC) program, designed to provide the cancer research community with a single, unified data service that consolidates molecular and clinical data from all current and future NCI cancer genomics projects, including TCGA, and for other cancer genomics projects that endorse broad data sharing. The GDC, developed by the University of Chicago through a subcontract with Leidos Biomedical Research, supports the hosting and harmonization of genomic and clinical data from cancer research programs, and the application of state-of-the art methods for generating derived data (e.g., mutation calls, structural variants). The GDC will be available in Spring, 2016 and will provide the community with resources such as web-based tools and a data portal for retrieving data from and submitting data to the GDC and for processing data through GDC bioinformatics pipelines, as well as APIs for programmatic access to the data. Resources will be maintained in a secure data center that includes user support and documentation.

Cloud Pilots and the Genomic Data Commons: A Comprehensive Infrastructure for Cancer Genomic Data

The Cloud Pilots and the GDC have complementary goals, and the implementation teams are working together to promote interoperability among these systems.

Together, the systems create a cohesive model for large-scale genomic data management and analysis:

  • Data are generated through the TCGA and other NCI-funded genomics programs.
  • Data are validated, aggregated, harmonized, stored, and made available for query and download through the GDC as the authoritative NCI cancer genomics dataset.
  • The Cloud Pilots provide the computational capacity to effectively analyze these data and allow researchers to bring their own data and tools to the cloud.

Benefits of the Programs

Together, the Cloud Pilots and the GDC provide the research community with many significant benefits:

  • Democratize access to high-quality standardized clinical, biospecimen, and molecular data
  • Enable researchers across the cancer community to access tools and to compute on large volumes of data, regardless of local resource constraints
  • Provide capabilities to search, visualize, and analyze researcher's own data in combination with TCGA data
  • Provide consistent, programmatic access to the data and the ability for researchers to bring their own tools to the data
  • Harmonize data and analysis pipelines for consistency and reproducibility
  • Ensure security and appropriate access to controlled data

NCI Biomedical Informatics Blog
The Oncology Models Forum

December 02, 2014 - My previous post highlighted how the imaging community is leveraging NCIP Hub’s capabilities to run its image analysis needs and to collaborate on tool development.  This post discusses how the NCI plans to use NCIP Hub to address the need for robust, reliable translational use of mouse and human-in-mouse models. The NCI established the Mouse …

Continue reading »

Three-Dimensional (3D) Printing: A Gateway to Precision Cancer Medicine

August 20, 2014 - Researchers are using 3D printing to gain insights that contribute to advances in basic biomedical research and the development of precision medical therapies by creating 3D models of pathogens, tumors, normal tissues, cells, and biomolecules. Dr. Sriram Subramaniam, principal investigator in the Laboratory of Cell Biology at the NCI Center for Cancer Research (CCR), uses …

Continue reading »

NCIP Hub: Addressing Imaging Community Needs

August 19, 2014 - In his earlier post, Ishwar Chandramouliswaran introduced the objectives of the NCIP Hub, an online resource for research and collaboration in cancer informatics. As a scientific repository, the NCIP Hub can store community-generated data, tools, and other resources. Members can upload tools, conduct analyses, and collaborate, giving researchers the opportunity to engage with and leverage …

Continue reading »

NCIP Hub: A Platform for Scientific Collaboration, Resource Sharing, and Education

May 12, 2014 - One of the major goals of the NCIP is to help facilitate open innovation and scientific collaboration in the cancer research and informatics community.  To that end, we have undertaken several projects, two of which were the focus of previous NCIP blog posts:  The Cancer Genomics Cloud Pilot, an initiative to democratize access to and …

Continue reading »

Read more from the blog