In any business workflow, handling documents of different types and quality is an integral part. So it is essential to have a robust document comprehension system (OCR + NER).
But, unfortunately none of the commercial offerings are a silver bullet solution for practical RPA implementations. Only spot automations are possible without a comprehensive document comprehension system.
In our experience with multiple complex RPA implementations, we evaluated the popular commercial OCR offerings. We are sharing our findings here.
When it comes to the real world use cases, the documents can be images, less accurate, noisy etc. There are two types of document comprehension: image and text. The documents that are images are extracted through the OCR technology. When we were automating the insurance use cases where the customer declaration forms, Acord forms and quotations had to be read and extracted, we benchmarked various OCR providers in the market. For the benefit of others, we have presented the same here.
This blog is intended for developer/project managers/entrepreneurs who want to understand how different OCR services performs in terms of accuracy for image based documents. This will also help them understand which OCR is useful for their requirement.
This blog will cover:
We hereby are not getting into the details of
If you want to prioritize OCR solution which has less restriction for gaining business traction?
Then read on.
The challenges faced in the process of identifying an OCR and doing entity extraction are:
Tesseract is the leading open source OCR solution. When someone wants to get started with an open source OCR to build an MVP, they can pick Tesseract as their first try. Tesseract is actively developed by a community and it is supported by Google (As of June 2019).
Recently neural net based OCR engine mode is made available on Tesseract 4.0 which gives improved accuracy for image documents that have high noise (Not well scanned document).
Will Tesseract help with all problems and all domains?
Tesseract 4.0 gives decent accuracy for well scanned image documents but still that accuracy might not be enough for gaining business traction. For example implementing OCR based solution to banking domain will have restriction. Since Tesseract still have error on determining financial number/currency/kyc information from document, it might have a huge impact for errors in finance domain.
Also before feeding input image documents to Tesseract we have to preprocess documents. Although some of the preprocessing logic are common (Increase dpi, grayscale,skewing or deskewing, e.t.c), we have to do a lot of preprocessing specific to document noise type. For instance we have to apply filter to either increase blur effect or decrease blur effect based on how the image document is generated. We shouldn’t apply all preprocessing logic to one document which will decrease the accuracy. We can even use opencv and imagemagick tools to achieve pre processing logic.
Apart from preprocessing we have to choose model parameters like page segmentation mode and OCR engine mode which are specialized to solve different document specification and noise.
We have to derive an automated workflow solution to pick preprocessing steps and model parameters specific to document specification. If we can’t solve automated workflow solution to pick preprocessing step and model parameters then we will end up with a lot of configuration specific to every document specification in your application.
Choosing best preprocessing step and model parameter will improve accuracy of Tesseract. But this accuracy might not be enough to solve some business problems!
Training Tesseract with lot of image document per document type (e.g license, invoice, bill) with manual marking of text will improve accuracy. The challenge here is for someone to have huge amount of image document and knowledge to train the Tesseracts neural net.
Google Vision Vs Microsoft OCR Vs Nuance Omnipage SDK vs ABBYY Finereader:
Both Google vision and Microsoft OCR are leading online OCR service provider. They both provide common features like text detection, object detection, document label detection, landmark detection, logo detection, e.t.c. Both have the capability of yielding high accuracy text extraction from noisy image document (smudge, unclear text, skewed) like Identity document.
Both Google vision and Microsoft OCR services provide API endpoint to send the image document and returns with JSON output which contains coordinate information along with text extracted. They both have good word segmentation and line segmentation. Due to decent word segmentation and improved accuracy of text detection it will help the entity extraction module on the pipeline if any.
Both Nuance Omnipage and ABBYY finereader are majorly used for on-prem OCR solution. Most clients prefer this on-prem OCR because they don’t want their data to be transferred out of their firewall.
Nuance Omnipage SDK comes with add-on features like OMR, document classification, ICR and entity extraction RPA kind of tool. Whereas ABBYY Finereader is a plain vanilla OCR tool for text detection. So the pricing differ.
Both these applications are designed to support OCR extraction through graphical user interface. Majority of the OCR work has to be run inside their desktop application. Nuance Omnipage SDK provides API to be integrated with your application. ABBYY finereader provides hot folder functionality to batch process OCR files which might restrict the developer to achieve real time extraction from image documents.
We have tried two types (Ordinary document and Identity Document) of document on all five OCR. Documents like scanned insurance copy, invoice document are called Ordinary document in our context. Whereas Identity document are like driving license, passport document.
Identity document will have intended synthetic noise, so OCR accuracy will be less compared to ordinary document.
Below figure shows accuracy stats for all five OCR service performed on different document types.
For both type of document Google and Microsoft OCR performs well. For ordinary document Nuance and ABBYY finereader performs well on par with Google and Microsoft OCR.
Below figure shows error classification for all five OCR service performed on different document types.
Microsoft OCR fails to detect some of the region completely and unable to recognize any text information from that region on the high noisy image document. This can be a restriction when we rely only on Microsoft OCR for our application.
Microsoft OCR has 85% errors due to “missing character detection”. Whereas google vision has 53% errors due to “missing character detection”. These errors are hard to be solved by NLP programs.
Microsoft has 15% errors due to “similar/special character replacement” whereas Google vision has 47% errors are due to “similar/special character replacement”. These errors can be reduced using NLP program.
Both Nuance and ABBYY finereader yield less accuracy for high noise document like Identity document. When compared with Tesseract (Applying proper filter and model parameter for Tesseract) yield similar accuracy for high noise document.
ABBYY Finereader has 62% error due to “missing character detection” whereas Nuance Omnipage SDK has 52% error due to “missing character detection”. These errors are hard to be solved by NLP programs.