This project aims to build a large-scale speaker recognition system for tagging speakers in CNN news recordings upon the existing Red Hen audio processing pipeline. The current speaker recognition system of the pipeline although fully functional, is not yet ready for processing news videos archived in the NewsScape. The main reason is lack of training data from relevant speakers to enroll into the system. Manually extracting such training data by human experts is inefficient, time-consuming or even infeasible at large scale. On the other hand, for a significant part of the archive (videos recorded over a decade since 2007), there are transcripts (the tpt files) that contain caption and speaker information saved along with the videos. These transcripts can potentially be used to automatically extract speaker training data, if they are accurately aligned with the audios. Therefore, the core idea of this proposal is to establish a workflow in the current audio pipeline from Gentle alignment to accurate timestamps of speech boundaries, then to speaker training data, and finally to identified speakers.





  • Mark Turner
  • Jacek Wozny