Pre-Processing Overview
Pre-processing involves the automated cleaning, organizing, and structuring of ingested files to optimize documents for extraction. Key actions include:
- Invalid File Handling: Identify issues that mean a document cannot be processed, such as not enough text found in the document, unsupported file types, corrupted document and password protection
- File Format Conversion: Converting file formats (e.g., images, PDFs with embedded data) into PDF format suitable for processing.
- De-duplication: Workspaces can be configured to identify and reject documents that have already been processed.
- OCR (Optical Character Recognition): Extracting text from scanned or image-based documents using advanced OCR technology, ensuring high accuracy for all document types and formats.
- Language Detection: Automatically identifies the language of the document to ensure high accuracy extraction.
Importance of Ingestion & Pre-Processing
By ensuring documents are well-prepared before extraction, pre-processing:
- Reduces the likelihood of errors in later stages.
- Improves the speed and accuracy of data extraction.
- Enables the seamless handling of various document types and formats.
Updated about 10 hours ago