Contributor
Yifan Xiong

Collect Pronunciation Dictionaries from Wiktionary


Mentors
John Mark Vandenberg, Imran Sheikh, Arseniy Gorin
Organization
CMU Sphinx

This Collect Pronunciation Dictionaries from Wiktionary project aims to expand pronunciation dictionaries in CMUSphinx for new words and multiple languages from Wikimedia Foundation projects. Current Sphinx dictionaries only support limited words and languages, which is difficult to meet the needs of applications, so expanding dictionaries in Sphinx is an urgent need. It's critical to reuse existing pronunciation dictionaries to improve system performance, support new words appeared recently and more languages. A valuable pronunciation source is Wiktionary, which is a multilingual, web-based project to create a free content dictionary of all words in all languages. Although Wiktionary contains pronunciations for many words and multiple languages in a standard format like IPA, it’s not easy to parse those pronunciations from sources in different page formats and convert phonemes in different languages to one common format like CMUBET which can be used by Sphinx. This project will solve those problems and form at least 10 pronunciation dictionaries which will be tested on several ASR benchmarks for Sphinx.