Contributor
Mihai Popescu

Data streaming in scientific workflows, implementation for Toil


Mentors
Lon Blauvelt, Michael R. Crusoe
Organization
Open Bioinformatics Foundation

Toil is an open-source Python workflow engine that lets people write data analysis pipelines in Python, CWL, and WDL. Toil has support for common workflow language (CWL), an open standard for describing analysis workflows. The power of Toil was demonstrated in “Toil enables reproducible, open source, big biomedical data analyses” paper published in Nature Biotechnology volume where it is described how well it scaled for a dataset of 108 terabytes on 32,000 cores on a public cloud.

This project aims to implement data streaming to speed up the analysis by avoiding slow disk/storage IO and speeding up the start of tool execution when it isn't required to wait for data to download. The main focus is to implement this first in AWS S3.