Contributor
Saurabh Shah

Improve the OCR subsystem


Mentors
Abhinav Shukla, Anshul Maheshwari
Organization
CCExtractor Development

The current text extraction system of CCExtractor for burned in subtitles depends on the input parameters like conf_thresh, subcolor, whiteness_thresh etc which are rather arbitrary and might vary from one video to another. Also, the text localization algorithm gives terrible results in many cases due to inefficient detection of regions as text/non-text. The ticker text extraction feature must also be added to the current hardsubx system. There are some cases in which the DVB subtitle extraction gives poor results.

The goal of this project is to implement a text localization and binarization pipeline which is independent of any input parameter(other than the video file). This localization algorithm would also improve the OCR results and the classification of the frames into text and non text regions would become efficient. This project also aims at adding tickertext extraction feature to the current hardsubx system. The DVB subtitle extraction causes noise to be generated on the text regions and an additional filtering step needs to be added to improve the results in the case of DVB subtitles too.