Using Deep Learning to Identify Features of Protein-Coding Genes
- Mentors
- Fergal Martin, Leanne Haggerty
- Organization
- Genome Assembly and Annotation
- Technologies
- python, tensorflow, CNN, Transformer
- Topics
- machine learning, bioinformatics, genetics
Accurate gene annotation in eukaryotes solely based on genomic data has been a significant obstacle in biology since the introduction of next-generation sequencing technologies and thus the rapid increase of available data. Traditional methods either rely on homology searches to map the open reading frames to previously identified protein-coding genes or utilize additional experimental data, e.g., transcriptomics data. The first approach produces potentially inaccurate results if the genome of interest is not at least somewhat related to an already annotated genome. The second approach is hindered because gathering transcriptomic data is labor-intensive and expensive. For that reason, there is a high demand for models that predict the location of protein-coding genes solely from inherent features of the DNA sequence of the gene. Although theoretically possible, methods that use, for instance, Hidden Markov models to detect protein-coding genes based on known gene features are often inaccurate. In this project, we will train a Deep Learning Transformer model to extract features of protein-coding genes to gain deeper insight into their exact properties that lead to translation. The whole workflow will include first training a Conditional Random Field model to recognize candidate gene regions and then using these as input for a more fine-grained Transformer - Convolutional Neural Network hybrid model. The final pipeline will be tested against a benchmark of gold standard annotations as well as various test sets to evaluate the influence of different parameters like genome sequence quality, protein length or gene structure complexity.