Applying machine learning techniques to characterising and naming lncRNA genes
- Mentors
- Daniel Zerbino
- Organization
- Genes, Genomes and Variation
Advances in RNA sequencing technologies have revealed the complexity of our genome. Long non-coding RNAs (lncRNAs) make up the majority of the non-coding transcriptome. Understanding the significance of this RNA world is one of the most important challenges faced in biology today, and the lncRNAs within it represent a gold mine of potential new biomarkers and drug targets. Its discovery is still at a preliminary stage. To date, very few lncRNAs have been characterized in detail. However, it is clear that lncRNAs are important regulators of gene expression, and lncRNAs are thought to have a wide range of functions in cellular and developmental processes. There are many specialized lncRNA databases (like RefSeq, GENCODE, Ensembl, SGD, tair). We will use Machine Learning techniques to highlight and compare two sets of calls (of Ensembl / GENCODE and RefSeq) and determine which calls are incorrect. Goal of the Project: Implement a machine learning model (a 2nd pass filter) which will predict / validate credible calls (true positive/false positive cases) produced by RefSeq and GENCODE (or Ensembl).