Schema versioning

Schema versioning lets you track changes to schemas over time and associate each extraction run with a specific version. That supports source lineage, reprocessing, and debugging when output shape or extraction confidence changes.

Why version schemas

  • Reproducibility — Re-run extraction with the same schema version to get consistent output.
  • Lineage — Know which schema version produced which smart bites in the Vector Catalog or downstream systems.
  • Rollback — If a new version causes regressions, you can revert to a previous version and re-run.

How versioning works

  • When you save a schema in Schema Studio, you can create a new version (e.g. v1, v2). Versions are immutable once created.
  • Each extraction run is tied to a schema (and optionally a version). Run metadata and source lineage can include the schema version.
  • When you query the Vector Catalog or inspect run results, you can filter or display by schema version to understand what structure to expect.

When to create a new version

  • Adding optional fields — New optional fields are backward compatible; creating a version helps track when they were introduced.
  • Changing types or required fields — Can break downstream consumers. Create a new version and update pipelines to use it; consider re-running extraction for affected documents.
  • Deprecating fields — Prefer keeping the field optional and unused rather than removing it in the same version; introduce a new version when you remove or rename fields.

Best practices

  • Version before big changes — Before a major schema change, save the current state as a version so you can compare or roll back.
  • Document versions — In Schema Studio or your runbooks, note what changed in each version (e.g. “v2: added line_items.tax”).
  • Re-run selectively — When you need new fields or fixes, re-run extraction with the new schema version only for the documents that need it, or use workflow orchestration to refresh the Vector Catalog over time.

Common pitfalls

  • Changing schema without versioning — Makes it hard to know why output changed or how to reproduce old results.
  • Removing required fields without a new version — Downstream code may assume the field exists; use a new version and migrate.

Next steps