In the past decade Deep Neural Networks (DNNs) have been be demonstrated to excel to wide range of learning tasks, including speech recognition. More recently, robust DNN-based speech recognition techniques have been developed that have been adopted by all major commercial systems including those from Google and Amazon. In such a setting, the GMM-HMM based speech recognition framework used by Sphinx is severely outdated and requires an upgrade to perform at par with the state-of-the-art systems.
I propose a two-tiered approach for acoustic scoring that uses concepts from Convolutional Neural Networks (CNNs) for feature extraction and Gate Recurrent Units (GRU) for acoustic modelling. The first level consists of convolutional filters and pooling layers that produce convolutional features from each audio frame. The second level consists of a single GRU-based DNN that estimates the output probabilities for each HMM. The optimal state sequence can then be found using Viterbi decoding. Over the course of the project this setup will be optimised by experimenting with different structural parameters for the DNNs as well as different learning techniques.