README.md 1.46 KB
Newer Older
Andy Castille's avatar
Andy Castille committed
1 2 3 4
# PII Indexer

Be able to search a data dump without storing **personally-identifiable information**.

Andy Castille's avatar
package  
Andy Castille committed
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## Installation

(From repo root)

`pip3 install .`

### Uninstallation

`pip3 uninstall pii-indexer`

### Upgrading

(From repo root)

`git pull && pip3 install --upgrade .`

Andy Castille's avatar
Andy Castille committed
21 22
## Usage

Andy Castille's avatar
Andy Castille committed
23 24 25 26 27
### Dependencies

For any file type listed in the [textract docs](https://textract.readthedocs.io/en/latest/#currently-supporting)
that does not say "via python builtins", you will need to install the dependency listed on your system.

Andy Castille's avatar
export  
Andy Castille committed
28 29
### Indexing

Andy Castille's avatar
Andy Castille committed
30 31
(From repo root)

Andy Castille's avatar
package  
Andy Castille committed
32 33 34 35 36
`pii_indexer [input files]`

or if your PATH is not fully configured:

`python3 -m pii_indexer [input files]`
Andy Castille's avatar
Andy Castille committed
37 38 39

For example:

Andy Castille's avatar
Andy Castille committed
40 41
PowerShell: `pii-indexer (Get-ChildItem -Path .\dump\folder1 -Filter *.csv -Recurse -File -Name) (Get-ChildItem -Path .\dump\folder2 -Filter *.csv -Recurse -File -Name)`

Andy Castille's avatar
Andy Castille committed
42
Bash: `pii-indexer ./dump/folder1/**/*.csv ./dump/folder2/**/*.csv`
Andy Castille's avatar
Andy Castille committed
43
(you may need to run `shopt -s globstar` first)
Andy Castille's avatar
Andy Castille committed
44

Andy Castille's avatar
Andy Castille committed
45 46 47
Sometimes, there are too many files to do this all at once. In this case, you can
use [GNU Parallel](https://www.gnu.org/software/parallel/) to run one instance of pii-indexer
per folder, using all available CPU cores:
Andy Castille's avatar
Andy Castille committed
48 49

```bash
Andy Castille's avatar
Andy Castille committed
50 51 52
cd <the extracted archive directory>
mkdir -p ../output
ls -d * | parallel "pii-indexer -d ../output/{}.sqlite ./{}/**/*.csv > ../output/{}.log"
Andy Castille's avatar
Andy Castille committed
53
```
Andy Castille's avatar
export  
Andy Castille committed
54 55 56 57 58 59

### Exporting

Once you have SQLite files, you can generate a single SQL command file by running:

`pii-export -o my_export.sql ./output/*.sqlite`