Many companies need to process large amount of documents, in many cases not digitalized. When scanning (or sending a scanned image), plenty of distortions can arise – some examples being skewing, noise, random order of pages, etc. The project was to map scanned images to a certain (pre-defined) document and page.
The data was provided in the form of scanned images. The first part was to convert these to some form of readable format. We chose to simply use OCR to convert the images to text (bad quality of OCR was okay as we'll show later, as long as at least several of the words were detected properly). Many of the distortions applied to documents were okay (like noise, barrel, slight rotations), since there were at least some parts of the documents, that were readable, thus the OCR could detect properly at least several words.
The only major problem we faced was 180-degree rotated images (scanned upside-down). This wasn't automatically handled by the OCR software we used (Open-source library called Tesseract)
The other problem was the large amount of data – over 16000 images – which took a lot of time to digitalize.
We started the OCR process relatively late, however since we modularized the process it was okay to develop the other logic while the OCR process was running. It took around 10s for each image to go through the OCR process. This ran over 12 hours on 4 cores (3GHz each).
Since the flipped images were a problem, we designed a simple AI to figure out whether the OCR-ed image was flipped or not. Since flipped images didn't produce almost any valid words, we used an English dictionary to count the number of "proper" words detected on each page. If this percentage was low (less than 20%), we flipped the image and tried again. In case it produced better results (and in >99% it did), we saved the flipped version. If it didn't, we reverted back to the original one (apparently with very bad quality).
We also converted the given PDF forms to text. Since the number of documents was very low, we decided to not automate this process and used an online tool and then split it into pages using a custom script. This can, however, be easily automated if need be (e.g. more forms).
The main idea we used was to match words in the (pages of) forms to words extracted by the OCR. We had three major improvements:
Use only words with 3 or more characters (other are very likely to be in many documents)
Use unique words only (a long, but common words like "and" and "the" may skew the results)
Calculate certainty score (also known as "confidence level") for each image.
Without any of the improvements we had accuracy of around 98%, which was very high to begin with. Using the first two optimizations brought it up to 99.4%, with only few samples being wrong. If we could know which ones are likely to be wrong, we could fix these by hand (there are less than 100 samples wrong with that level of accuracy).
The third optimization is important in many prediction problems, as these cases may be treated differently. We started by using just the number of words in the document which are also in the form, divided by all words in the document (or the form).
Although this produced relatively good results, there are cases in which two pages are almost the same, and although the number of words matching is very high (~80% in some good scans), we are actually very uncertain which of the pages is that (document "CIT0001E-2", pages 3 and 4 being an example). Our approach was to normalize the confidence by taking the best and second-best matches and subtract their percentages. So, a document, which had low score (e.g. 12%), but no other page was remotely as accurate (< 2%), was considered still okay (12% - 2% = 10%). Documents, which had very high confidence, but also other documents had high confidence (e.g. 80% and 79%) would yield low confidence (80% - 79% = 1%). This gave us clear indications which classifications may have been wrong, so we can check them by hand.
Once the data is prepared (images are OCR-ed and forms are converted to text), the evaluation itself is very quick. In terms of time complexity our solution is O(N * M), where N is the number of characters in the OCR-ed document, and M is the number of pages in the forms.
Since the number of pages was only 73, and a typical document doesn't have more than 1000 words (few thousand characters), the complete processing of all 16425 documents takes just a few seconds.
We used the training data for evaluation of our model (but not training per se). The results there proved to be extremely good (>99.5% accuracy with most of wrong samples having very low confidence, thus can be reviewed by hand).
We noticed that most errors we get are caused by two almost identical pages in one of the documents. We decided to write a custom logic for these cases, which uses specific features of these documents (e.g. having a unique word in one of them, which is not present in the other and also the string "page 3" in one compared to "page 4" in the other. We used this heuristic to automatically distinguish between the two, so we don’t have to check these by hand.
We ran the algorithm (few seconds) on all the test data and created a CSV with the required structure. We did a manual check on the 30 results with lowest confidence score to detect errors (and, as expected, found some which we manually corrected).
Sources should be available in GitLab. We decided to use separate scripts for separate tasks, since this allowed us to work on specific tasks (e.g. logic) while other are being executed in the background (OCR). The sources used are:
Splitter.py – Splits the output text document from the PDF forms into pages
OCRTrainData.py – Runs OCR on random files from the training data
OCRTestData.py – Runs OCR on all test files, using a dictionary to determine whether rotations are needed
TrainingClassifier.py – Classifies all OCR-ed files from the training data and validates accuracy using file paths as labels