> ## Documentation Index
> Fetch the complete documentation index at: https://docs.affinda.com/llms.txt
> Use this file to discover all available pages before exploring further.

<AgentInstructions>

## Submitting Feedback

If you encounter incorrect, outdated, or confusing documentation on this page, submit feedback:

POST https://docs.affinda.com/feedback

```json
{
  "path": "/configuration/preprocessing",
  "feedback": "Description of the issue"
}
```

Only submit feedback when you have something specific and actionable to report.

</AgentInstructions>

# Pre-Processing Overview

> Overview of pre-processing steps Affinda applies to incoming documents, including format conversion, deskewing, OCR, splitting, and classification.

Pre-processing involves the automated cleaning, organizing, and structuring of uploaded files to prepare the documents for data extraction.

## Importance of Pre-Processing

By ensuring documents are well-prepared before extraction, pre-processing

* Reduces the likelihood of errors in later stages
* Improves the speed and accuracy of data extraction
* Enables the seamless handling of various document types and formats

## Key Pre-processing Actions:

* **Invalid File Handling:** Identify issues that mean a document cannot be processed, such as insufficient text in the document, unsupported file types, corrupted files, or password-protected documents.
* **File Format Conversion:** This involves converting file formats (e.g., images, PDFs with embedded data) into a PDF format suitable for processing.
* **Remove Duplicates:** Workspaces can be configured to identify and reject documents that have already been processed. This can be configured by the user in Workflow Settings; see [Remove Duplicates](/configuration/duplicates) for more information.
* **OCR (Optical Character Recognition)**: Extracting text from scanned or image-based documents using advanced OCR technology, ensuring high accuracy for all document types and formats. See [OCR and Text Extraction](/configuration/ocr) for more information.
* **Language Detection:** Automatically identifies the language of the document to ensure high-accuracy extraction.

Pre-processing settings can be found in your Workspace *Workflow Settings*.

<img className="block dark:hidden border-2 border-gray-300 rounded-lg" src="https://mintcdn.com/affinda-44/8O48gu_z8QeuNsDM/images/preprocessingsettingslight.png?fit=max&auto=format&n=8O48gu_z8QeuNsDM&q=85&s=6d2ab6f04b87139d9bfd1846fef7f009" alt="Pre-processing Settings" style={{ width:"70%" }} width="1676" height="1198" data-path="images/preprocessingsettingslight.png" />

<img className="hidden dark:block border-2 border-gray-300 rounded-lg" src="https://mintcdn.com/affinda-44/8O48gu_z8QeuNsDM/images/preprocessingsettingsdark.png?fit=max&auto=format&n=8O48gu_z8QeuNsDM&q=85&s=7213822d315d621e405c6b3c89ea8b5c" alt="Pre-processing Settings" style={{ width:"70%" }} width="1676" height="1192" data-path="images/preprocessingsettingsdark.png" />

## Advanced Pre-processing Settings:

**Reading Order Model:** The Affinda Platform uses our proprietary reading order algorithms by default to capture word sequences in visually rich documents in a way that aligns with human comprehension. This ensures that text is processed in the same order a human would read it, leading to more accurate extractions.

**Split Words:** Ensures words that are incorrectly combined are separated for extraction. Default is on.
