Schema versioning

Schema versioning lets you track changes to schemas over time and associate each extraction run with a specific version. That supports source lineage, reprocessing, and debugging when output shape or extraction confidence changes.

Why version schemas

Reproducibility — Re-run extraction with the same schema version to get consistent output.
Lineage — Know which schema version produced which smart bites in the Vector Catalog or downstream systems.
Rollback — If a new version causes regressions, you can revert to a previous version and re-run.

How versioning works

When you save a schema in Schema Studio, you can create a new version (e.g. v1, v2). Versions are immutable once created.
Each extraction run is tied to a schema (and optionally a version). Run metadata and source lineage can include the schema version.
When you query the Vector Catalog or inspect run results, you can filter or display by schema version to understand what structure to expect.

When to create a new version

Adding optional fields — New optional fields are backward compatible; creating a version helps track when they were introduced.
Changing types or required fields — Can break downstream consumers. Create a new version and update pipelines to use it; consider re-running extraction for affected documents.
Deprecating fields — Prefer keeping the field optional and unused rather than removing it in the same version; introduce a new version when you remove or rename fields.

Best practices

Version before big changes — Before a major schema change, save the current state as a version so you can compare or roll back.
Document versions — In Schema Studio or your runbooks, note what changed in each version (e.g. “v2: added line_items.tax”).
Re-run selectively — When you need new fields or fixes, re-run extraction with the new schema version only for the documents that need it, or use workflow orchestration to refresh the Vector Catalog over time.

Common pitfalls

Changing schema without versioning — Makes it hard to know why output changed or how to reproduce old results.
Removing required fields without a new version — Downstream code may assume the field exists; use a new version and migrate.

Next steps

Schema design — Design and optional vs. required.
Extraction runs — How runs reference schemas.
Vector Catalog: Lineage — Lineage and freshness.