Contributor
Gabriel Simonetto

BioSerDe: (De)Serializing bioinformatics file formats to alternative representations


Mentors
mmilton, Marko Malenic, brainstorm
Organization
Global Alliance for Genomics and Health
Technologies
rust
Topics
bioinformatics, Data representation, Serialization/Deserialization, data formats
The field of bioinformatics has used many different extensions to express genomics data throughout its history, most of them focusing on simplicity to work and on human-readability. However, with the recent research on big data analysis, formats that focus on performance would greatly improve the potential to work with big genomics datasets. This project aims to be a step towards discovering the advantages of the many possible formats in many possible scenarios, by making it easy to translate any one of them to another. Such an objective seems useful for: Discovering the optimal format for performance critical applications. Easy conversion from one tool to another, for bioinformatics researchers and hackers. We will achieve that by using an Internal Representation of the different possible semantic data represented in each format, that will then allow translation from one format to another. This IR will probably be built in one of the widely used standard file formats, such as Protocol Buffers or Amazon’s Ion, since these technologies are already being widely used and already have great performance and a lot of features built into them.