When using mass spectrometry to analyze protein samples it is key to use the appropriate protein database. Most of the time a sample from only one organism, of which the genome (and therefore the proteasome) is known, is analyzed. In this case choosing a suitable database does not pose a problem.
However if the sample contains one or multiple unknown organisms, it can be difficult to choose an adequate database. Nevertheless there are multiple approaches to solve this, but those methods frequently depend on assumptions and are therefore not guaranteed to generate a suitable database.
Because of that it would be valuable to be able to score those databases. Thus ensuring the most suitable one is used.
To achieve this, de novo sequences are derived for mass spectra and appended to the given protein database. Subsequent a standard protein search engine is run. The number of de novo and database sequences are counted after some false discovery rate determinations. The quality of the database can now be calculated as follows:
database quality = # database sequences / (# database sequences + # de novo sequences)
This project aims to implement this metric in OpenMS.