Investigating and Implementing Compact Data Representation of Homology Relationship
- Mentors
- Ivana Piližota, David Thybert
- Organization
- Genome Assembly and Annotation
- Technologies
- python, c++, Data representation, Succinct data structures, Data structure and algorithm design
- Topics
- bioinformatics, data structure, Data representation
A key challenge surrounding modern bioinformatics is to manage and store the growing amount of biological data with both space efficiency and scalability. Traditionally, biological data are often stored as human-readable flat files or as entries in a conventional relational database. However, a drawback of such approaches is that the space required to maintain these data is becoming increasingly unmanageable, significantly reducing the scalability. Additionally, along with the size of the database, the time to query the database also increases. As stated in the problem statement, on Ensembl, homology data are currently stored as tuples in a relational database, resulting in the whole database being large and hard to scale. A natural way to compactly store the data is to exploit the intrinsic hierarchical structure of homology relationships. We propose multiple hierarchical data structures and formatting methods to improve the space efficiency of homology databases as well as important metrics to consider when designing such data structures and formats. We propose these potential approaches with the application in actual gene homology databases in mind. As part of the project, we will implement the representations using Python and/or C++ and evaluate them using proposed metrics.