Implement Pattern Matching for Personal Data
What?
After extracting text from videos, check for patterns of personal data leaks (phone numbers and personal email addresses).
Why?
Phone numbers and email addresses are commonly leaked types of personal data which can (somewhat) reliably be detected via Regular Expressions.
The bug bounty program awards people for reporting personal data is leaked (via YouTube or elsewhere), so this could save our company money and work.
How?
Use big ugly regular expressions to match phone numbers and emails. Here are some examples:
Phone number:
-
(\b(\+\d{1,2}\s)?\(?\d{3}\)?[\s+.-]\d{3}[\s+.-]\d{4}\b)|((?:\+|%2B)[1-9]\d{6,14}\b)
(https://github.com/ankane/pdscan/blob/master/internal/rules.go#L57) -
\\b(\\+\\d{1,2}[-\\. ])?\\(?\\d{3}\\)?[-\\. ]\\d{3}[-\\. ]\\d{4}\\b
(https://github.com/americanexpress/earlybird/blob/main/config/rules/content.yaml#L56) -
((\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{4}
(https://stackoverflow.com/a/56450924)
Email:
-
\b[\w][\w+.-]+(@|%40)[a-z\d-]+(\.[a-z\d-]+)*\.[a-z]+\b
(https://github.com/ankane/pdscan/blob/master/internal/rules.go#L53) -
([a-zA-Z0-9_\\-\\.]+)@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\\]?)
(https://github.com/americanexpress/earlybird/blob/main/config/rules/content.yaml#L56)
Or the following regex from StackOverflow (https://stackoverflow.com/a/201378):
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Additional Info/Thoughts
I suspect we have a bunch of email addresses exposed in YouTube videos where alerting would not be helpful or necessary (e.g. @gitlab.com
(can be obtained in Git history), @example.com
(used in demos)), so we'd want to refine this.