Experiment with tesseract OCR and ccextractor on hard-coded video streams

A lot of Chinese video that can be found in various places has hard-coded subs, including on informally sourced DVDs. For transcrobed subtitles to be useful, these subs will need to be removed. Worse, finding text versions of the subtitles can also be difficult. What is needed is for the system to detect hard-coded subtitles (there may be more than one stream) and then OCR them (the Chinese ones at least), blur and write a new, enriched subtitle stream over the blur.

CCExtractor claims to be able to OCR hard-coded subs, using Tesseract OCR internally but it appears that there are bugs (https://github.com/CCExtractor/ccextractor/issues/1008). The code that calls Tesseract looks a little dubious (to a non-C++ dev, admittedly!) but this is a truly massive feature for Transcrobes, at least for L2 Chinese, so is definitely worth taking a look.

If this can be made to work reasonably well, then all that is left is to reliably (and automatically) find the subtitle streams. This may be necessary for the OCR part, and/or provided by Tesseract/CCExtractor (coordinates of the characters found) or maybe a combination of the two. In any case, the system will also need to know what to blur, so this is important.

Admin message

Experiment with tesseract OCR and ccextractor on hard-coded video streams