Skip to content

Redact PII during data preprocessing

Alexander Chueshev requested to merge preprocessing-redact-pii into main

To redact PII, we follow the same approach as the HF SantaCoder project.

Entities we're able to identify with the target change:

  • emails using reg. expression
  • IP v4/v6 using reg. expression and additional filters to remove false positives
    • we mask only public IP addresses
    • we require that the IP address is not a popular DNS address like 8.8.8.8
  • secrets using detec_secrets
    • we require the secret to sound like gibberish
    • we do not mask hashes if there are keywords like hash, md5 in the context of the secret

More details in https://arxiv.org/abs/2301.03988 (Section 4)

Masks we apply to redact PII:

  • emails => random example email with the format xxxx@example.com
  • public IPs => private IP addresses (v4 or v6, depending on the target) randomly selected from the predefined list
  • secrets => random string of the same length

Ref: ai-assist#22 (closed)

Edited by Alexander Chueshev

Merge request reports