The aim of this project is to write support code for obtaining accurate transcriptions, exemplar pronunciations, and phonetic and part-of-speech labeling.

One of the important concerns in Natural Language Processing is the availability of gold standard information such as transcripts of speech, part-of-speech tagged sentences, word pronunciation examples etc., which aid in training and evaluating the performance of speech and text processing algorithms. This requires extensive collection, processing and validation of information from a variety of sources.

The data needs to be normalized across these sources, evaluated based on quality, quantity and other metrics. This requires a structured system which allows for the data collection and management tasks to be done effectively. This is especially important to CMU Sphinx contributors who require such data for their work, such as training and evaluating their speech and text models.

The project involves the integration of a Flask and SQLite based leaderboard web application with a transcription record system into a broader data aggregation and processing framework for the aforementioned tasks.



Rishi Rajasekaran


  • lanceculnane
  • James Salsman