# PII Indexer Be able to search a data dump without storing **personally-identifiable information**. ## Installation (From repo root) `pip3 install .` ### Uninstallation `pip3 uninstall pii-indexer` ### Upgrading (From repo root) `git pull && pip3 install --upgrade .` ## Usage ### Dependencies For any file type listed in the [textract docs](https://textract.readthedocs.io/en/latest/#currently-supporting) that does not say "via python builtins", you will need to install the dependency listed on your system. ### Indexing (From repo root) `pii_indexer [input files]` or if your PATH is not fully configured: `python3 -m pii_indexer [input files]` For example: PowerShell: `pii-indexer (Get-ChildItem -Path .\dump\folder1 -Filter *.csv -Recurse -File -Name) (Get-ChildItem -Path .\dump\folder2 -Filter *.csv -Recurse -File -Name)` Bash: `pii-indexer ./dump/folder1/**/*.csv ./dump/folder2/**/*.csv` (you may need to run `shopt -s globstar` first) Sometimes, there are too many files to do this all at once. In this case, you can use [GNU Parallel](https://www.gnu.org/software/parallel/) to run one instance of pii-indexer per folder, using all available CPU cores: ```bash cd mkdir -p ../output ls -d * | parallel "pii-indexer -d ../output/{}.sqlite ./{}/**/*.csv > ../output/{}.log" ``` ### Exporting Once you have SQLite files, you can generate a single SQL command file by running: `pii-export -o my_export.sql ./output/*.sqlite`