OCR and Text Extraction

Affinda's solution is designed to work effectively on both machine-readable documents and scanned images and files. 

Machine-readable document

To reduce processing times and to ensure accuracy, Affinda only applies Optical Character Recognition (OCR) technology to documents that are not machine-readable. When a document is uploaded, Affinda will check to see if this document has a 'text layer' associated with it. If it does (and there are over 25 words in this text layer), then no OCR is applied as it assumes that this text layer matches the document.

Documents without a text layer

Where there is no text layer (e.g. a scanned invoice or photo of an invoice), Affinda will apply OCR technology as the first step to extract the text from the document. Once extracted, Affinda will be able to extract the key data from the files as it would if it were submitted as a machine-readable document. Due to the additional step required in applying this OCR, the processing times for these documents will be higher.

The OCR technology is not always 100% accurate and will be impacted by scan quality, stamps and other imperfections in the document. The confidence levels that are used to auto-validate fields will take into consideration the confidence the OCR technology has that it has extracted the text correctly. 

What if the text layer on a document is incorrect?

From time to time, a document may be submitted that has a text layer that does not match fully the data in the document itself. Whilst this is uncommon, this will mean that Affinda has not applied OCR technology and thus we will not be able accurately to extract the data.

In the minority of cases where this occurs, users are able to select 'Run OCR' which will apply OCR to the document and re-parse the data.