Extraction best practices

Follow these practices to get consistent, high-quality extraction output and to troubleshoot issues quickly.

Schema design

Start small — Define a few fields, run extraction, then add fields as needed. Large schemas are harder to tune.
Use optional fields — Mark fields optional where the source may not always have the data; avoid failed runs on missing values.
Nested structures — Use nested objects when the document has clear structure (e.g. sections, repeated blocks). Don’t over-nest for flat content.
Version schemas — Use schema versioning so you can track which runs used which schema and roll back if needed.

Monitor extraction confidence — Use confidence scores (where available) to find low-quality extractions. Low confidence often means poor source quality or a schema mismatch.
Sample and review — Periodically sample output and compare to source documents. Spot-check edge cases (tables, multi-column, scanned PDFs).
Iterate on problem docs — If certain document types or sources consistently score low, refine the schema or preprocessing for those types.

When extraction confidence is low for some fields or documents:

Check the source — Is the document readable (not corrupted, not image-only without OCR)? Are tables or layout complex?
Review the schema — Are field types and expectations aligned with the content? Try making optional fields or relaxing constraints.
Use enrichment — Enable metadata extraction and NER where relevant; they can improve downstream quality even if raw field confidence is moderate.
Narrow the scope — Run extraction on a small set of representative docs, fix schema or sources, then scale back up.

If confidence remains low after tuning, consider splitting document types into separate schemas or pipelines so you can optimize per type.

Batch when possible — Use batch endpoints for many documents instead of one-by-one to reduce overhead.
Use workflows — Schedule extraction so it runs automatically; avoid ad-hoc large runs during peak hours if you have rate limits.
Cache when appropriate — If the same document is extracted repeatedly with the same schema, cache results to avoid re-running.