Overview

When setting up document extraction in Affinda, one of the most important decisions you’ll make is how to configure your schema.
Your schema is the set of fields the system extracts and returns for each document. It refers to not only the list of fields, but also their structure and the formatting applied.
This guide explains how to think about field configuration, what trade-offs to consider, and how to use advanced options carefully.

Why Schema Design Matters

The schema is more than just a list of fields. It determines:
  • What data will the model try to extract
  • How easy it is for your team to review and correct that data
  • How cleanly the extracted data can be passed into your other systems
A poorly designed schema can lead to confusing user experiences, low accuracy, or messy downstream integrations. A well-designed one avoids all that.

What to Consider When Designing Your Schema

Here are the key trade-offs to consider when deciding what fields to include, and how to configure them.

End-Process Requirements

Not every piece of data needs to come from the document.
  • Some fields might be optional or easier to fill in later.
  • If a field is nice to have but not essential, consider leaving it out.
  • Prioritize fields that are critical, hard to get elsewhere, or needed to trigger automation.

Model Accuracy

Some field types are harder for the AI to extract accurately—especially complex, nested, or grouped fields.
  • Simple fields = better accuracy
  • Every field adds risk: If the model struggles with a field, it may make errors—even if the rest of the document is simple.
  • More fields = more training needed: If you want to improve accuracy for complex fields, it may require a custom model or more training data.

User Review & Validation

Your team (or users) will often review documents and confirm or edit extracted data. The easier this is, the faster the process.
  • Simpler is better: Flat schemas with clear field names are easier to check.
  • Avoid overcomplication: Deeply nested fields or grouped data can slow reviewers down and increase mistakes.

Integration Complexity

Think about how you plan to use the extracted data.
  • Is your downstream system expecting a specific format?
  • Will someone need to clean or restructure the data before using it?
If your schema aligns with your downstream format, you’ll avoid a lot of post-processing work.

Higher complexity field options

The simplest, and most common, field type is a single, flat value (like text, number, or date) from a clearly defined location on the document. They’re the easiest for the model to extract accurately and review in the UI. Below, we’ll walk through when some more advanced field configuration options could be used and the potential downside that needs to be managed when introduced. These options are powerful—but should be used with care. For information on how to configure these settings, see our Configuration Guide.

Final Tips for Success

  • Start simple: Begin with just the essential fields. Add more later if needed.
  • Review early: Test with real documents and see how the data looks.
  • Trim the fat: Remove unused or low-accuracy fields.
  • Talk to us: Affinda can help guide your configuration for optimal results.

Need Help?

If you’re unsure how to configure a field, want advice on best practices, or are working with complex document types, our team is here to help. Reach out any time.