Current Red Hen Lab’s Audio Pipeline can be extended to support speech recognition. This project proposes the development of a deep neural-net speech to text module for the pipeline, based on the paper Deep Speech. The aim is to use both audio and visual modalities for achieving speech recognition.
The initial goal is to extend current Deep Speech model (audio only) to Red Hen lab's TV news videos datasets. The next goal is to develop a multi-modal Speech to Text system (AVSR) by extracting visual modalities and concatenating them to the previous inputs.
I plan to develop four versions of Speech to Text during 12 weeks time.
- Version 1: Rewriting the Deep Speech model to support audio inputs from Red Hen Lab datasets.
- Version 2: Improving results using either N-gram model or a spell check system
- Version 3: Extracting visual features, concatenate them with Audio features and modify Deep Speech’s Input
- Version 4: Improving results using the same approach as in 2, tracking actual speaker’s lips, etc.