Document Scanner(📃🔍) - OCR & Information Extraction

CONTENT

Introduction
The OCR Landscape
Limitations of current OCR APIs
Should I even consider using OCR then?
What defines a good OCR product?
OCR & Document Scanner
Document Scanner Information Extraction Pipeline
How we use Document Scanner at Box8 ?
Where to next?

1. Introduction

Simply defined, OCR is a set of computer vision tasks that convert scanned documents and images into machine-readable text. It takes images of documents, invoices and receipts, finds text in it and converts it into a format that machines can better process. You want to read information off of ID cards or read numbers on a bank cheque, OCR is what will drive your software.

You might need to read the different characters from a cheque, extract the account number, amount, currency, date etc. But how do you know which character corresponds to which field?

2. The OCR landscape

The APIs provided by many are limited to solving a very limited set of use cases and are averse to customizations.

alt_text

More often than not, a business planning to use OCR technology needs an in-house team to build on the OCR API available to them to apply it to their use case. The OCR technology available in the market today is mostly a partial solution to the problem.

3. Limitations of current OCR APIs

Require a considerable amount of post-processing
Work well only in specific constraints
Tilted text in images
Handwritten text, cursive fonts, font sizes
Noisy/blurry images

4. Should I even consider using OCR then?

Short answer is Yes.

Anywhere there is a lot of paperwork or manual effort involved, OCR technology can enable image and text based process automation. Being able to digitize information accurately can help business processes become smoother, easier and a lot more reliable along with reducing the manpower required to execute these processes. For big organizations that have to deal with a lot of forms, invoices, receipts, etc, being able to digitize all the information, storing and structuring the data, making it searchable and editable is a step closer to a paper-free world.

Think about the following use cases -

Legal documents - Dealing with different forms of documents - affidavits, judgments, filings, etc. digitizing, databasing and making them searchable.
Table extraction - Automatically detect tables in a document, get the text in each cell, column headings for research, data entry, data collection, etc.
Banking - analyzing cheques, reading and updating passbooks, ensuring KYC compliance, analyzing applications for loans, accounts and other services.
Healthcare - have patients medical records, history of illnesses, diagnoses, medication, etc digitized and made searchable for the convenience of doctors.
Invoices - automating reading bills, invoices and receipts, extracting products, prices, date-time data, company/service name for retail and logistics industry.

5. What defines a good OCR product?

How it deals with the images coming in
How it performs in real-world problems
How it uses the machine-readable text

6. Document Scanner & OCR

Dcoument Scanner is a data capture solution built to retrieve data from image documents. It takes an image and extracts the data required in near real-time.

Document Scanner was built to solve the above mentioned problems. We have been able to productize a pipeline for OCR by working with it not just for character recognition but getting structured usable information.

7. Document Scanner Pipline

To extract information from an image Document Scanner needs a set of rules that tell it what data points to look for and where to look for in the image document. These rules generally contain information as to how a particular data point looks like and where one can locate it in the document.

Benefits of Document Sanner

Document Scanner saves precious time by providing structured usable information that can be used for auto-filling forms etc.
Prevents error due to User entry.
Increases overall productivity.
Compatibility with local languages.

8. How we use Document Scanner at Box8 ?

Employee Onboarding Using Aadhar

Employee Creation requires filling of multiple fields and is quite a time taking. It is also expected that the user input is accurate as fields like Document No (Aadhar No.) needs to be unique. Failure to do so means refilling the form which further increases the time taken to create an employee. Using Document Scanner with Aadhaar helps speed up the process and helps in eliminating errors due to user input .

Digitzation of Vendor Invoices

We use Document Scanner to extract data from a vendor invoice , like GST info, invoice number ,information of items purchased , the total amount to be paid etc. Further it also helps us to identify if any of the above information is missing or incorrect than the data entered manually and automatically raise disputes.

9. Where to next?

Automated intelligent structured field extraction

As we discussed above to get structured information from Document Scanner we use a set of rules that tell it which data points and where to look for them in the document. Now think of a system that can take it to the next level. What if instead of giving rules we just tell it what data points we want and it looks for them automatically?

Sounds interesting right and it might seem simple at first, let me break it down for you . Not only does it need to understand what a data point is and differentiate data points among each other, but it also needs to understand what the user wants. It’s not so simple now is it?😛

If you have any ideas as to how we can approach this problem dont hesistate to contact us. 📧

Thanks for reading. 👍