D

document_analyzer

Mayan EDMS document analyzer Define analyzer classes and assign them to document types. results will be stored in a param/value table related to the document version.

Name Last Update
contrib Loading commit data...
document_analyzer Loading commit data...
requirements Loading commit data...
CREDITS Loading commit data...
HISTORY.rst Loading commit data...
LICENSE Loading commit data...
MANIFEST.in Loading commit data...
README.rst Loading commit data...
requirements.txt Loading commit data...
setup.cfg Loading commit data...
setup.py Loading commit data...
tox.ini Loading commit data...
https://gitlab.com/startmat/document_analyzer/raw/master/contrib/art/logo.png

Description

Refactored Mayan exif app (https://gitlab.com/mayan-edms/exif) to build an generic document analyzer app. The app makes it easy to create document analyze functionality in Mayan EDMS and store the result in a generic table. The results can be used in the mayan indexes to structure your documents.

Analyzer is started after OCR (post_document_version_ocr.connect(..))

Available Analyzers

At the moment there are two Analyzers available:

  • document_analyzer.backends.exiftool.EXIFTool

This is the reused exiftool extentision (https://gitlab.com/mayan-edms/exif)

  • document_analyzer.backends.regex.RegexTool

This is a simple regex based analyser. It makes is possible to configure an python regex with an named group to parse the content of an document. The group name is used as an attribute name to store the found string in the result table. example config strings:

  • Put the first date found in the content of the Document into the result attribute "date":

first;(?P<date>(?:3[0-1]|[0-2]?[d])(?:/|\-|.)(?:1[0-2]|[0-1]?[d])(?:/|\-|.)(?:[d]{4}))

  • Put first Name (Tele2 or Apple or Microsoft or Billa) found in the content of the Document into the result attribute "Creator":

first;(?i)(?P<Creator>Tele2|Apple|Microsoft|Billa)

Analyzer ideas

  • find empty pages
  • b/w or color document
  • create a fingerprint for an document (for duplicate search ..)

Configure an new Analyzer

Setup->Analyzers->Actions->create Analyzer

Where to find Document Analyzer in the Frontend

  • Setup->Analyzers ... create/edit/delete Analyzers
  • Tools->Analyze all documents ... run all Analyzers over all documents (be aware could be time and resource consuming)
  • Document->Versions->Analyzer result

Create an index based on the Analyzer

{{ document.analyzer_value_of.Name_of_the_Analyzer_result_parameter }}

Contribute

It is easy to write your own analyzer class, configure and make it available in the frontend. In the backends folder you find two examples exiftool.py (reused from Roberto Rosario https://gitlab.com/mayan-edms/exif app) and regex.py which is a simple configurable regular expression analyzer (e.g. configure a regex to find dates in your document text (ocr result). The result of an analyzer class has to be a list of param, value tuples (e.g. [(parm1,value1), (param2, value2), ...])

License

This project is open sourced under the MIT License.

Installation

  • clone the sources from gitlab to you local env.

  • add an link from your mayan/apps folder to the document_analyzer folder:

    cd /yourmayanroot/apps
    ln -s /yourgitroot/document_analyzer/document_analyzer/ .
    

In your settings/local.py file add document_analyzer to your INSTALLED_APPS list:

INSTALLED_APPS += (
    'document_analyzer',
)

Run the migrations for the app:

mayan-edms.py migrate

Settings

There are two analyzer classes developed for now: - exiftool: all stuff reused from https://gitlab.com/mayan-edms/exif - regextool:

Requirements

ExifTool http://www.sno.phy.queensu.ca/~phil/exiftool/