Build your first schema

This tutorial walks you through creating your first schema in Bundata so that schema-aware extraction produces output that matches your downstream systems. You will use Schema Studio in the platform, add fields, save a version, and test with a sample document (e.g. an invoice or contract).

What a schema is in Bundata

Schema-aware extraction pulls information from documents according to a defined output structure. Instead of generic text, Bundata identifies the right fields from context, document layout, and schema logic — so outputs are predictable, validated, and ready for the Vector Catalog, search, agents, and workflows. A schema is that definition: field names, types (text, number, date, object, array), and whether each field is required or optional. Extraction runs use a specific schema version so every run produces smart bites that conform to the same structure.

Prerequisites

  • A Bundata account. See Sign up and Quickstart if needed.
  • Access to the platform. From the dashboard, open Schema Studio at Platform → Schema Studio (or your organization’s URL).
  • (Optional) A sample document (e.g. a PDF invoice or contract) to test extraction after you create the schema.

Step 1: Open Schema Studio

  1. Sign in to the Bundata Platform.
  2. In the app, go to Schema Studio (e.g. /platform/schema-studio).
  3. Confirm you are in the correct workspace if your account has multiple.

Schema Studio is where you design and version schemas that extraction uses. Product overview: Product → Schema Studio.

Step 2: Create a new schema

  1. Click Create schema (or equivalent).
  2. Give it a name that reflects the document type or use case, e.g. Invoice fields, Contract metadata, or Policy summary.
  3. Save so the schema exists before adding fields.

Using names tied to real document types (invoices, contracts, policies, procurement files) makes it easier to remember which schema to use for which runs.

Step 3: Add fields

Add the fields you want extraction to produce. Start small and expand later.

  • Common starting fields: title (text), body or content (text), page_number (number), source_path (text). For invoices: vendor, amount, date; for contracts: parties, effective_date, termination_date.
  • Field types: Use text for titles and content, number for counts and amounts, date for dates. See Field types for the full list and when to use each.
  • Required vs optional: Prefer optional for fields that may be missing in some documents (e.g. optional amount if not every page has it). Required fields can cause failed extractions or empty values when the source doesn’t contain the data. See Schema design.

Example for a simple invoice-style schema: vendor (text, optional), invoice_date (date, optional), total (number, optional), line_items (array of objects, optional).

Step 4: Save and version

  1. Save the schema.
  2. Create or note the schema version. Versioning ties extraction runs to a specific structure so you can reproduce results and maintain source lineage. See Schema versioning.

When you run extraction, you will select this schema (and optionally a version) so all output conforms to it.

Step 5: Test with a sample document

  1. In Schema Studio, use the preview or test flow if available: run extraction on a sample document and inspect the output.
  2. Confirm that the output shape matches your schema and that key fields are populated as expected.
  3. If fields are missing or wrong, adjust the schema (e.g. add optional fields, fix types) and test again.

Testing before running at scale helps avoid low extraction confidence or mismatched downstream systems.

What happens behind the scenes

When you run extraction with this schema, Bundata partitions the document, applies context-aware extraction to fill each field from layout and content, and produces smart bites that conform to the schema. Those outputs are vector-ready intelligence: they can be indexed in the Vector Catalog, used for Vector Search, and fed to agents for grounded answers with source lineage.

Common mistakes

  • Too many required fields — Documents that omit a required field can cause failed extractions or empty values. Prefer optional fields and validate downstream if needed.
  • Schema doesn’t match document type — Using an invoice schema on contracts (or vice versa) hurts confidence and completeness. Use one schema per document type or use case.
  • Skipping versioning — Changing a schema without versioning makes it hard to know which runs used which structure. Always version before big changes.
  • No testing — Run extraction on a few sample documents (contracts, receipts, policies) and inspect output before scaling.

Next steps