Adds regex testing script
We could use this MR for collab @fvpotvin. Not seeing it as super necessary that it ends up getting merged until we nail down the solution, so any changes should/could be commits.
Process:
- Download & extract the following OCR archive link
- run
jq
against one of the files with a command similar to
jq '.[].text_annotations[]?.text' yt_unfiltered_3QJggoLsCxE.mp4.json > yt_unfiltered_3QJggoLsCxE.txt
- ingest the file via something like:
ruby ./scanner-ingest-all.rb yt_unfiltered_3QJgggoLsCxE.txt
To Do:
- figure out why we can't just ingest the individual files of the archive before formatting (ie
yt_unfiltered_3QJggoLsCxE.mp4.json
). - automate ingesting archives
- compare results against our known truth file
Edited by Greg Alfaro