Differentiating Real and Misaligned Introns with Machine Learning
- Mentors
- Jose Gonzalez, Adam Frankish
- Organization
- Genome Assembly and Annotation
- Technologies
- python
- Topics
- machine learning, genomics, bioinformatics
The advancement in the accuracy of long-read sequencing technology has allowed us to explore novel transcript variants of known genes. Preventing potentially wrong transcripts and gene annotation is essential to the science community as many rely on the annotation for decision-making. Automated workflow with a has been developed to minimise the time needed to verify and annotated those transcript variants. However, current workflows are developed using a very strict rule-set and hence many of the novel transcript variants were rejected.
To address issue that strict filters rejects most of the legitimate introns,
we developed IntronOrNot (ION) - a machine learning to differentiate that predicts
if the intron is real or misaligned. The model accepts coordinates, .bed, and .gtf file as input. The prediction script is easy to use and achieved comparable results
to sequence-based deep learning intron predictor. A standalone script
that extracts introns from .gtf files is also developed.