Adds regex testing script (!5) · Merge requests · GitLab.com / GitLab Security Department / Security Research / Video Scanner / YouTube Video Scanner

Greg Alfaro requested to merge galfaro-branch into main Sep 06, 2023

We could use this MR for collab @fvpotvin. Not seeing it as super necessary that it ends up getting merged until we nail down the solution, so any changes should/could be commits.

Process:

Download & extract the following OCR archive link
run jq against one of the files with a command similar to

jq '.[].text_annotations[]?.text' yt_unfiltered_3QJggoLsCxE.mp4.json > yt_unfiltered_3QJggoLsCxE.txt

ingest the file via something like:

ruby ./scanner-ingest-all.rb yt_unfiltered_3QJgggoLsCxE.txt

To Do:

figure out why we can't just ingest the individual files of the archive before formatting (ie yt_unfiltered_3QJggoLsCxE.mp4.json).
automate ingesting archives
compare results against our known truth file

Edited Sep 07, 2023 by Greg Alfaro

Adds regex testing script

Process:

To Do:

Merge request reports