Commit 1270952f authored by Andy Castille's avatar Andy Castille

use textract

parent 061f847a
Pipeline #160606240 failed with stage
in 49 seconds
......@@ -20,6 +20,11 @@ Be able to search a data dump without storing **personally-identifiable informat
## Usage
### Dependencies
For any file type listed in the [textract docs](https://textract.readthedocs.io/en/latest/#currently-supporting)
that does not say "via python builtins", you will need to install the dependency listed on your system.
### Indexing
(From repo root)
......@@ -35,6 +40,7 @@ For example:
PowerShell: `pii-indexer (Get-ChildItem -Path .\dump\folder1 -Filter *.csv -Recurse -File -Name) (Get-ChildItem -Path .\dump\folder2 -Filter *.csv -Recurse -File -Name)`
Bash: `pii-indexer ./dump/folder1/**/*.csv ./dump/folder2/**/*.csv`
(you may need to run `shopt -s globstar` first)
Sometimes, there are too many files to do this all at once. In this case, you can
use [GNU Parallel](https://www.gnu.org/software/parallel/) to run one instance of pii-indexer
......
import textract
from pii_indexer.database import Database
from pii_indexer.patterns import PATTERNS, normalize
......@@ -10,9 +12,8 @@ class Scanner:
self._current_file_name: str = str()
def scan(self, file_path: str):
file = open(file_path)
self._current_file_name = file_path
file_contents = file.read()
file_contents = textract.process(file_path)
print(
f"Scanning file: {file_path} [{len(file_contents)}] ".ljust(
FORMAT_WIDTH, "-"
......@@ -31,5 +32,3 @@ class Scanner:
self.database.add_data(
data_type, text, self._current_file_name, index,
)
file.close()
......@@ -20,4 +20,5 @@ setuptools.setup(
"pii-export=pii_indexer:export",
],
),
install_requires=["textract"],
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment