DPI value for OCR not taken from document metadata
This was discussed in https://groups.google.com/forum/#!topic/mayan-edms/Bfb3v2C_9ks
I have observed two behaviors:
- Embedding a JPEG into a PDF worsens the OCR recognition quality, even if the underlying JPEG data is not changed and scaling is preserved according to JPEG metadata. This was fixed by changing line 37 of /usr/local/lib/python2.7/dist-packages/mayan/apps/converter/backends/python.py to
pdftoppm = pdftoppm.bake('-jpeg', '-r', '300')
(adding the expected DPI value via the "-r" parameter when calling pdftoppm)
This of course was just a dirty hack. I propose to solve this either by
- Exposing the DPI as a per-document-type setting
- Acutally extracting the DPI value from PDF metadata (I think this requires to convert the page dimensions to DPI, this could create rounding issues) per document, maybe falling back to a global or per-document-type value if we are not able to determine the DPI value.
Maybe this is also a nice two-stage solution for later development?
- The hack from above even improves the recognition quality over that of the original JPEG. Since the original JPEG had its DPI value set in the metadata I think that the DPI values for image files aren't carried over to OCR as well. Same solutions should apply.
I'll attach test files and comparisons later today.