A multi-institutional team of scientists has developed a free, publicly accessible resource to aid in classification of patient tumor samples based on distinct molecular features identified by The Cancer Genome Atlas (TCGA) Network.
The resource comprises classifier models that can accelerate the design of cancer subtype-specific test kits for use in clinical trials and cancer diagnosis. This is an important advance because tumors belonging to different subtypes may vary in their response to cancer therapies.
The resource is the first of its kind to bridge the gap between TCGA’s immense data library and clinical implementation.
A paper describing the tools published online today in Cancer Cell.
“TCGA defined molecular subtypes for each major type of cancer. With this resource, we aimed to provide the clinical and scientific communities with the tools to assign a newly diagnosed tumor to one of these established subtypes,” said Peter W. Laird, Ph.D., the Peter and Emajean Cook Endowed Chair in Epigenetics at Van Andel Institute and the study’s lead corresponding author. “Our new resource will be a powerful asset for creating clinical assays based on the diverse molecular variations between cancers.”
TCGA was a decade-long, National Cancer Institute-led effort to create detailed molecular maps of 33 cancer types. Unlike traditional approaches that define cancers based on the organ or tissue in which they arise, TCGA identified nuanced genomic, epigenomic, proteomic and transcriptomic characteristics that more precisely describe cancer subtypes.
Andrew D. Cherniack, Ph.D., of the Broad Institute of MIT and Harvard and Kyle Ellrott, Ph.D., of the Knight Cancer Institute at Oregon Health & Science University also are corresponding authors of the paper, which represents a collaborative effort between scientists from more than a dozen research organizations.
“Since many TCGA molecular subtypes were generated using hundreds or thousands of features from multiple data types, scientists and physicians have asked us for help subtyping their samples,” Cherniack said. “Our resource greatly simplifies this process.”
The team created the new resource by leveraging data from 8,791 TCGA cancer samples that represented 26 cancer cohorts and 106 cancer subtypes. They then used existing machine learning tools to develop and test nearly half a million models across six categories — gene expression, DNA methylation, miRNA, copy number, mutation calls and multi-omics — and selected those that performed best for inclusion in the online resource.
In total, the resource contains 737 ready-to-use models, which represent the top models from each of the 26 cancer cohorts, the five training algorithms and six data types.
“A major element of this effort was working to ensure that these models could be deployed by other groups onto new datasets,” Ellrott said. “All too often this type of work is difficult to replicate or apply to new samples.”
The resource may be accessed at https://github.com/NCICCGPO/gdan-tmp-models.