OCR and Text Extraction

Affinda's solution is designed to work effectively on both machine-readable documents and scanned images and files. For scanned images or documents where textual information is contained within images (e.g. logos), Affinda uses Optical Character Recognition (OCR) to convert the images into machine-readable data ahead of classification and data extraction.

OCR Options

Affinda provides 4 different options for customers that affect when and if OCR is applied on each document. While applying OCR on documents can increase overall performance, it will add additional processing cost and time (0.5 - 1 second per page) so applying OCR might not suitable for all use cases.

By default, new Workspaces will have 'Partial' OCR enabled.

1. Skip

OCR is never applied, even if no text layer is found. Suitable for use cases where speed / cost is most important. Not recommended for most use cases.

2. Auto Detect

Applied on documents where no text layer is found within the document. Affinda will apply OCR over the entire document if less than 25 words are in the text layer of the document. The text extracted from the document will overwrite any existing text layer. If a text layer with over 25 words is found, OCR will not be applied.

Suitable for a use case like resume parsing where processing speed is typically important and there is rarely any text contained in images within the document.

3. Partial (recommended)

OCR is applied to elements of the document without a text layer to all documents uploaded. This preserves the original machine-readable text but also extracts additional information from images and pages in the document without this text layer. A typical example is an invoice where the supplier name and business number are contained within the header image / logo. Combining both the text layer and OCR-extracted text ensures comprehensive results.

4. Always Full

OCR is applied to all documents and is used in place of any existing machine-readable text layer. Typically only recommended when the text layer in a document is frequently incorrect and needs to be corrected.

What if the text layer on a document is incorrect?

From time to time, a document may be submitted that has a text layer that does not match fully the data in the document itself. Whilst this is uncommon, this will mean that Affinda has not applied OCR technology and thus we will not be able accurately to extract the data.

In the rare cases where this occurs, users can select 'Apply OCR' which will apply OCR to the document and re-parse the data.