Red Hen gathers Chinese broadcasts to make data sets for NLP, OCR, audio, and video pipelines. Currently, Red Hen have a preliminary ASR pipeline but it needs great improvement. This proposal is divided into 2 parts. The first one is to improve the ASR pipeline which contains 3 steps: find a source of correct transcript of the shows;use a different way to cut the audios; use new models to train the data. The second part is to build a CONCRETE Chinese NLP pipeline which includes basic tasks like data ingest, word segmentation, part-of-speech tagging,etc.



Ziyi Liu


  • Zhaoqing Xu