Skip to content

Adds regex testing script

Greg Alfaro requested to merge galfaro-branch into main

We could use this MR for collab @fvpotvin. Not seeing it as super necessary that it ends up getting merged until we nail down the solution, so any changes should/could be commits.

Process:

  1. Download & extract the following OCR archive link
  2. run jq against one of the files with a command similar to
jq '.[].text_annotations[]?.text' yt_unfiltered_3QJggoLsCxE.mp4.json > yt_unfiltered_3QJggoLsCxE.txt
  1. ingest the file via something like:
ruby ./scanner-ingest-all.rb yt_unfiltered_3QJgggoLsCxE.txt

To Do:

  • figure out why we can't just ingest the individual files of the archive before formatting (ie yt_unfiltered_3QJggoLsCxE.mp4.json).
  • automate ingesting archives
  • compare results against our known truth file
Edited by Greg Alfaro

Merge request reports