ETL pipeline development for TCGA data from GDC Portal
- Mentors
- Avery Wang, Angelica Ochoa, Kyle Hernandez, Zhenyu Zhang
- Organization
- cBioPortal for Cancer Genomics
The Genomic Data Commons (GDC) Portal serves as a large-scale genomic data repository hosting data from NCI cancer genome projects in standardized formats amenable to programmatic access. As one of the major goals of the cBioPortal is to serve cancer genomic data from a wide range of sources in a easy to analyze manner, the creation of an Extraction Translation (ET) pipeline between the two platforms would greatly benefit the cancer genomic research community as a whole.
A previous GSOC project laid the foundation for this pipeline, developing a Spring Batch pipeline that takes a GDC manifest file specifying the data desired and creates a clinical file and a Genome Nexus annotated MAF suitable for cBioPortal import. This project will expand on the pipeline, creating new Spring Batch reader/processor/writers and corresponding pipeline logic for CNA and mRNA expression data, two data types similar in format and useful in conjunction to analyze over and under expressed genes in a given sample.