Contributor
Purav Biyani

A Nextflow Pipeline for Repeat Annotation


Mentors
Fergal Martin, Leanne Haggerty, Thiago Genez
Organization
Genome Assembly and Annotation
Technologies
python, perl, nextflow
Topics
bioinformatics, cloud, containerisation
My proposal is to develop a NextFlow pipeline that will efficiently and accurately perform repeat annotation and masking on large genome sequences that are filled with repetitive elements. The pipeline will be designed to handle genome chunking and multiprocessing to ensure efficient use of computational resources. The pipeline will take a genome sequence in FASTA format as input and use the RepeatModeler tool to generate a de novo repeat library for the input genome sequence. It will then use RepeatMasker to mask and annotate the repeats in the input genome sequence. Additionally, it will use Dust to mask and annotate low complexity sequences and TRF to mask and annotate tandem repeats in the input genome sequence. The pipeline will combine the results from these steps to output a masked genome sequence in FASTA format and annotated repeats in GTF format. Furthermore, the pipeline will use the tool RED to perform additional masking and output an additional masked genome sequence in FASTA format. The pipeline will be deployed in the cloud using the Embassy Cloud within the EMBL-EBI infrastructure to allow for testing and scaling, with the aim of determining the cost of running it at scale. The main problem that this proposal aims to solve is the challenge of identifying and masking repetitive elements in large and complex genome sequences. The pipeline will provide a detailed and informative annotation of the repeats within the genome, making it easier for researchers to analyze the non-repetitive regions of the genome. The deliverables of this proposal will include a NextFlow pipeline for repeat annotation and masking, a de novo repeat library for the input genome sequence, a masked genome sequence in FASTA format with repeat, low complexity, and tandem repeat annotations, an annotated repeats file in GTF format, and an additional masked genome sequence in FASTA format generated using the tool RED.