Contributor: Karolina Stosio

Audio embedding space in a MultiTask architecture

Mentors: Karan Singla
Organization: Red Hen Lab

Auditory stimuli like music, radio recordings, movie soundtracks or the regular speech are widely used in research. While it is easy for a human to recognize the emotional load of Bach symphonies, how can it be done by a computer? Currently, algorithms are able to analyze low level features like signals energy. Those features are far from capturing how does the stimuli actually sound to us, and the best we can do is to ask a human subject to judge. The following project aims to bridge the gap between human- and machine-like sound understanding by building an audio embedding space. Such embeddings have been proven extremely successful for texts (word2vec) and images (image recognition, convolutional neural networks, vgg). The aim of this project is to offer similar insight into the nature of the audio stimuli. The embedding space will be obtained by a machine learning model trained to perform on various audio task. Deliverables are: the architecture (data loading and preprocessing pipeline, deep neural network for universal feature extraction and classifiers train separately for each task), ready-to-load optimized parameters, and a use case tutorial (github repository).