Schema design

A well-designed schema ensures schema-aware extraction produces consistent smart bites with high extraction confidence. This page covers design principles: start small, use optional fields wisely, structure nesting, and when to version.

What a schema defines

Fields — Names, types (text, number, date, nested object, array), and whether they are required or optional.
Structure — How extracted content maps to your downstream systems (e.g. RAG, agents, databases). The schema is the contract between Bundata and your app.

Start small and iterate

Begin with a few fields — For a contract schema, start with e.g. title, parties, effective_date, body. Run extraction on a small set of docs and inspect output.
Add fields as needed — Once baseline quality is good, add optional fields (e.g. termination_clause, governing_law). Avoid designing a large schema up front before validating on real documents.
Match document type — Use one schema per document type (contracts vs. invoices vs. policies) so extraction and confidence behave predictably.

Optional vs. required fields

Required fields — Use only for data that appears in every document of that type. Missing required data can cause run failures or low confidence.
Optional fields — Use for data that may be absent (e.g. optional clauses, attachments list). Optional fields improve robustness across document variants and reduce failed runs.

For contracts, for example, make body and parties core; make amendments or exhibits optional if not every contract has them.

Nested structures

Use nesting when the document has clear hierarchy — e.g. sections[] with title and content, or line_items[] for invoices.
Don’t over-nest — Deep nesting can make extraction and downstream consumption harder. Prefer a flat structure unless the domain clearly benefits from nested objects.
Arrays — Use arrays for repeated blocks (line items, signatories, sections). See Field types for array and object types.

Real-world examples

Contract schema (minimal)

title (text, optional)
parties (text or array of strings)
effective_date (date, optional)
body (text)
metadata.source (text, optional) for source lineage

Invoice schema (minimal)

vendor, invoice_number, date, total
line_items (array of { description, quantity, amount })
Optional: tax, payment_terms

Versioning and change management

Version schemas when you add or remove fields or change types. Tie extraction runs to a schema version so source lineage and reprocessing are traceable. See Versioning.
Backward compatibility — Prefer adding optional fields over renaming or removing fields so existing pipelines and the Vector Catalog keep working.

Common pitfalls

Too many required fields — Causes failures on valid documents that omit some data. Prefer optional and validate downstream if needed.
Schema doesn’t match document type — Using an invoice schema on contracts hurts confidence and completeness. Use type-specific schemas.
No versioning — Changing a schema without versioning makes it hard to debug and replay runs.

Next steps

Field types — Available types and usage.
Versioning — Schema versions and runs.
Extraction best practices — Quality and troubleshooting.