NCI Biomedical Informatics Blog
|On This Page|
The National Cancer Informatics Program (NCIP) will continue to make data portals and collections developed or supported as part of the NCI cancer Biomedical Informatics Grid® (caBIG®) program available to the biomedical-informatics and cancer-research community.
In some instances, the data portals and collections listed here were created using specific caBIG-developed applications as the basis for their structure and functionality. These include the caArray Data Portal, the caIntegrator Data Portal, the Cancer Genome Workbench, and the National Biomedical Imaging Archive (NBIA).
In other instances, caBIG program staff collaborated with other NCI divisions, offices, or centers, other NIH entities, academic centers, or private organizations to provide the informatics infrastructure needed to manage and disseminate large, complex data collections. This group includes the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) study, The Cancer Genome Atlas (TCGA), the Investigation of Serial Studies to Predict Your Therapeutic Response with Imaging And moLecular Analysis (I-SPY I) trial, the Cancer Models Database (caMOD), the cancer Nanotechnology Laboratory (caNanoLab) Portal, the REpository for Molecular BRAin Neoplasia DaTa (REMBRANDT), the Cancer Genome Anatomy Project (CGAP), and the Pathway Interaction Database (PID).
Because several of these collections are dynamic — investigators are continuing to submit data to them — this list will be updated on a regular quarterly basis. Any new data collections created using NCIP resources will be listed here as well, so monitor this page frequently.
The NCI CBIIT instance of the caArray Data Portal provides researchers with access to public data sets generated by 183 microarray experiments. While most of the data sets are derived from samples of human disease, other organisms are represented in the collection: Drosophila melanogaster, Mus musculus, and Rattus norvegicus. Assay types include
- Gene expression
- Comparative genomic hybridization (copy-number changes)
- Single-nucleotide polymorphism (SNP)
- Exon (alternative splicing)
Two experiments (woost-00035 for SNP profiling data; woost-00041 for transcript profiling data) provide gene-expression data for a panel of more than 300 cancer cell lines that GlaxoSmithKline (GSK) released to NCI in 2008. The SNP data and transcript data can be downloaded via ftp.
Cancer types represented in the caArray data sets are too numerous to be listed here, but researchers can quickly identify those of interest through a simple search. The caArray Data Portal search function is based on entering a keyword in conjunction with choosing a field from one of two categories: "Experiments" or "Samples." See the caArray User's Guide.
caArray is continuing to accept data submitted by members of the community.
The NCI CBIIT instance of the caIntegrator Data Portal provides researchers with centralized access to public genomic, clinical, and imaging data drawn from the following studies, each of which focuses on specific disease types:
- TARGET: childhood acute lymphoblastic leukemia (contains subject-annotation data focusing on survival for 255 subjects and microarray-based gene-expression data mapped to 207 subjects). TARGET is a project of the NCI Office of Cancer Genomics.
- TCGA Radiology Project: human glioblastoma multiforme (GBM) (contains subject-annotation data focusing on survival for 196 subjects, microarray-based gene-expression data mapped to 199 subjects, and 787 mapped MRI images). TCGA is jointly supported by the NCI and the National Human Genome Research Institute (NHGRI) and overseen by the NCI Office of Cancer Genomics, TCGA Program Office.
- I-SPY I Trial: locally advanced breast cancer (contains subject-annotation data focusing on survival for 149 subjects, and two sets of microarray-based gene-expression data, the former mapped to 129 subjects and the latter to 20 subjects). The I-SPY 1 trial was a collaboration between NCI and 10 academic Cancer Centers.
- The Director's Challenge Lung Study (DCLS): human lung adenocarcinoma (contains subject-annotation data focusing on survival for 497 subjects and microarray-based gene-expression data mapped to 462 subjects with associated clinical and pathological annotations). DCLS is a project of the NCI Division of Cancer Treatment and Diagnosis (DCTD) Cancer Diagnosis Program.
- TCGA GBM Project: human GBM (contains subject-annotation data focusing on survival for 557 subjects and five sets of microarray-based gene-expression data obtained using various platforms — the first set contains data from 100 samples mapped to subjects; the second, 275; the third, 226; the fourth, 296; and the fifth, 539). TCGA is jointly supported by the NCI and the National Human Genome Research Institute (NHGRI) and overseen by the NCI Office of Cancer Genomics, TCGA Program Office.
- Colon Cancer kNowledge Utility Toolbox (CoCANUT): primary human colon cancer, polyps, metastases, and matched normal mucosa (contains subject-annotation data focusing on survival for 390 subjects and microarray-based RNA expression data mapped to 390 subjects). CoCANUT has concluded but was supported by the NCI Division of Cancer Treatment and Diagnosis (DCTD) Cancer Diagnosis Program.
The caIntegrator search function enables researchers to conduct searches across the studies listed above and across disparate data types in order to facilitate integration. Consult the caIntegrator User's Guide for details.
CGAP, an initiative of the NCI Office of Cancer Genomics, offers researchers access to publicly accessible databases containing a wide variety of data types. Researchers can download much of the CGAP data in a tab-separated ASCII format via ftp.
- cDNA libraries: data resources include The Digital Gene Expression Displayer; the cDNA Library Finder; the cDNA xProfiler; and the SAGE Digital Gene Expression Displayer.
- Gene information: data resources include the Wellcome Trust Cancer Gene Census and Catalogue of Somatic Mutations in Cancer (COSMIC); the Gene Ontology (GO) Browser; and the Gene Library Summarizer.
- Gene expression: data resources include The Digital Gene Expression Displayer; the SAGE Digital Gene Expression Displayer; and the SAGE Anatomic Viewer.
- Transcriptome: data resources include the NCI/Affymetrix Human Transcriptome Project and the SAGE Absolute Level Lister.
- Chromosomal aberrations: data resources include the Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer and FISH-mapped BAC Clones.
- Single-nucleotide polymorphisms (SNPs): data resources include SNP500 Cancer and GeneWindow.
- short hairpin RNAs (shRNAs): data resources include the short hairpin RNA (shRNA) Clone Library and the shRNA Validation Project.
- Pathways: data resources include BioCarta Pathways on CGAP.
Cancer types represented in the CGAP data collections are too numerous to be listed here. The CGAP website does not offer a global search function, but many of the data-mining tools available on the site perform searches across multiple databases.
With the NCI CBIIT instance of CGWB, researchers can visualize and analyze in an integrated manner, somatic-mutation, gene-expression, copy-number variation, next-generation sequencing, and methylation data generated by multi-platform genome-wide assays. CGWB provides access to data drawn from a number of projects including The Cancer Genome Atlas (TCGA), the Therapeutically Applicable Research to Generate Effective Treatments (TARGET), the Tumor Sequencing Project (TSP), and the Catalogue of Somatic Mutations in Cancer (COSMIC), as well as data relating to the NCI60 cell lines. The application offers investigators three main analytical views: an integrated view using the University of California at Santa Cruz Genome Browser; a heat-map view that associates gene-expression and copy-number data with clinical data; and Bambino, an alignment viewer for next-generation sequencing data.
The NCI CBIIT instance of caMOD is a publicly accessible database that provides researchers with detailed information about animal models of human cancers. It was developed in collaboration with the Mouse Models of Human Cancer Consortium (MMHCC) in the Division of Cancer Biology. The database contains 6,084 records. Species represented are Mus musculus, Rattus norvegicus, Rattus rattus, Danio rerio, Felis catus, Canis familiarus, Capra hircus, Mesocricetus auratus, Equus caballus, Oryctolagus cuniculus, and Ovis aries. Information elements include organism, strain, genetic profile, histopathology, derived cell lines, images, carcinogenic agents, and therapeutic trials in which the models were used.
The animal models represented in the database exhibit neoplastic disease types classified as follows:
- Primary tumors in major organ systems: cardiovascular, digestive, endocrine gland, integument, lymphohematopoietic, musculoskeletal, nervous, reproductive, respiratory, special sensory organs, and urinary systems
- Tumors in specific anatomical sites: head or neck, prostate, mammary gland, brain, and liver
caMOD also contains data characterizing 106 animal models with transplantations.
Most of the data in caMOD has been extracted from the literature or submitted by investigators who have bred the animals or used them in research. The database also contains data from the Mouse Tumor Biology Database, part of the Mouse Genome Informatics Program at The Jackson Laboratory.
caMOD offers basic and advanced search functions. Researchers can also perform drug-screening searches, which employ data from the NCI Developmental Therapeutics Program. Consult the User's Guide for details.
caMOD is continuing to accept data submitted by members of the community.
The NCI CBIIT instance of caNanoLab is a nanoparticle annotation and data-sharing portal that provides researchers with data relating to the characterization of nanoparticle samples, protocols involving nanoparticles, and associated publications. It was developed for the NCI Nano Alliance, part of the NCI Center for Strategic Scientific Initiatives. The portal contains data describing a total of 989 samples of 17 distinct nanomaterial entities: biopolymers, carbon black particles, carbon nanotubes, carbon particles, dendrimers, emulsions, fullerenes, liposomes, metal oxides, metal particles, metalloids, nanohorns, nanorods, nanoshells, polymers, quantum dots, and silica particles. Sample characterizations include these information elements:
- Material composition
- Nanomaterial functions such as "therapeutic," "targeting," or "diagnostic imaging"
- Physico-chemical characteristics, including size, molecular weight, shape, physical state, surface chemistry, purity, solubility, and relaxivity
- In-vitro characterization, including cytotoxicity, blood contact properties, oxidative stress, and immune-cell functions
- In-vivo characterization, including pharmacokinetics and toxicology
The caNanoLab portal provides both basic and advanced search capabilities. Consult the caNanoLab User's Guide for further details.
caNanoLab is continuing to accept data submitted by members of the community.
The NCI CBIIT instance of NBIA is an image repository that provides researchers with access to medical imaging libraries (also known as "collections") obtained from patients evaluated for multiple types of cancer (or, in one case, osteoarthritis). Images are stored in the Digital Imaging and Communications in Medicine (DICOM) standard and accompanied by image mark ups, annotations, and metadata. A wide range of imaging modalities is represented in the collections: computed radiography (CR), computed tomography (CT), digital radiography (DX), hard copy (HC), histopathology, magnetic resonance (MR), nuclear medicine (NM), ophthalmic photography (OP), presentation state (PR), positron emission tomography (PT), radiographic imaging (conventional film/screen) (RG), radiotherapy dose (RTDOSE), radiotherapy plan (RTPLAN), radiotherapy treatment record (RTRECORD), radiotherapy structure set (RTSTRUCT), secondary capture (SC), SR document, ultrasound (US), and X-ray angiography (XA).
Many NBIA collections have recently been moved to The Cancer Imaging Archive, a project of the NCI Division of Cancer Treatment and Diagnosis Cancer Imaging Program. Those which remain a part of NBIA are
- FDG-PET Lymphoma: 28,461 PET or CT images of 14 human lymphoma cases using (limited access)
- I-SPY: 5,054 MR or HC images of six human breast-cancer cases (limited access)
- Mouse Astrocytoma: 20,434 MR images of 48 mouse high-grade astrocytoma cases (limited access)
- Mouse Mammary: 25,998 MR images of 32 pre-invasive or invasive mouse mammary cancer cases (limited access)
- United Kingdom National Cancer Research Institute (NCRI): 985 MR or histopathology images of six prostate cancer cases (public access)
- Quantitative Imaging Biomarker Alliance (QIBA): 23,074 MR or PR images of 10 cases (cancer type not specified) (limited access)
- Roswell Strong: 17,691 thin-cut spiral CT images of 31 cases of metastatic non small cell lung cancer (limited access)
- Virtual Colonoscopy: 686,257 abdominal CT images of 808 patients to detect colon polyps (public access)
NBIA provides researchers with the ability to conduct simple, advanced, or dynamic searches, which can be combined with keyword searches of curated annotation data. Consult the NBIA User's Guide for details.
Developed by NCI CBIIT in collaboration with the Nature Publishing Group (NPG), the PID provides researchers with information about known biomolecular interactions and cellular processes assembled into signaling pathways. It contains data describing 137 human pathways curated by NCI and NPG as well as 322 human pathways imported from Biocarta/Reactome. Pathway categories are provided courtesy of the Rat Genome Database Pathway Ontology.
Regulatory pathways are classified according to biological process:
- Cell death
- Immune response
- Replication, repair, gene expression, protein biosynthesis
Signaling pathways are classified according to molecular family or biological process:
- Calcium/calmodulin dependent signaling pathway
- Cell-adhesion signaling pathways
- Cytokine and chemokine mediated signaling pathways
- Glycoconjugated protein signaling pathway
- Growth factor signaling pathways
- Hormone signaling pathways
- Mitogen-activated protein kinase signaling pathways
- Phosphatas-mediated signaling pathways
- Ras superfamily mediated signaling pathways
- Signaling pathways involving second messengers
- Signaling pathways pertinent to development
- Signaling pathways pertinent to the brain
- Transcription factor mediated signaling pathways
- mTOR signaling pathways
Researchers can browse pathways by category or alphabetically. Advanced search capabilities and batch queries are available. Consult the PID User Guide for more information.
Links to other public data resources are provided: BioCarta; Entrez Gene; Gene Ontology; the
Human Protein Reference Database; KEGG; Molecule Pages; Pathway Commons; PharmGKB;
Reactome; The Cancer Genome Project; and UniProt.
PID will resume accepting data from members of the community in the near future.
The REMBRANDT data portal provides researchers with access to data characterizing human brain neoplasias, including astrocytoma, GBM, oligodendroglioma, and other less-well-defined types (mixed, unknown, unclassified) as well as normal controls. The portal contains subject-annotation data focusing on survival for 668 subjects and two sets of microarray-based data sets, the first with copy-number data mapped to 241 subjects and the second with gene-expression data mapped to 579 subjects. A joint initiative of NCI and the National Institute of Neurological Disorders and Stroke (NINDS), REMBRANDT is a project of the NCI Division of Cancer Treatment and Diagnosis (DCTD) Cancer Diagnosis Program.