The current representation of reference genomes is as a sequence of nucleotides akin to a long string. Intuitively, this doesn’t represent a genome but rather a consensus. A workaround used today is holding variation data in VCF files that don’t update the reference meaning the reference will always represent the genome as it was not as it is or as it’s evolving. It’s clear that the current method of representing genomes is not ideal. There is a need for a representing the reference in a data structure that contains its inherent variation. Different methods have been tried and the variant graph is a promising one. The way the graph works is by representing the variation within the genome as alternative paths one can traverse the graph through and conserved regions as nodes without alternative paths to the get to the next node; we then index the nodes for querying and alignment. Moreover, variation graphs hold an advantage with rapidly evolving genomes and short read data that could get thrown out when it doesn’t have a place to align to in the reference; with the variant graph short reads should reads could align to alternative nodes.


Njagi Mwaniki


  • George Githinji
  • Pjotr Prins