Data pipeline for exchange of genomic variation between public repositories
- Mentors
- Sundar Venkataraman, Cristina Yenyxe Gonzalez
- Organization
- Global Alliance for Genomics and Health
The main goal of this project is to implement a mechanism to be in sync with the latest human data submitted to dbSNP. Once imported, this information can be distributed via EVA implementations of the GA4GH APIs htsget and Beacon specifications, as well as the EVA website.
Acceptance criteria
Given a dbSNP FTP directory with the human variant information, the pipeline should parse JSONs for each chromosome and write the variants from the JSONs to the EVA archive
Needed tasks
- Construct an object model for dbSNP and parse the JSONs to objects in that model
- Convert the objects in the dbSNP model to the EVA variant object model
- Construct a variant object by including its component objects from above. Use the Variant writer of variation-commons to write the variants to the EVA archive