Schema versioning
Schema versioning lets you track changes to schemas over time and associate each extraction run with a specific version. That supports source lineage, reprocessing, and debugging when output shape or extraction confidence changes.
Why version schemas
- Reproducibility — Re-run extraction with the same schema version to get consistent output.
- Lineage — Know which schema version produced which smart bites in the Vector Catalog or downstream systems.
- Rollback — If a new version causes regressions, you can revert to a previous version and re-run.
How versioning works
- When you save a schema in Schema Studio, you can create a new version (e.g. v1, v2). Versions are immutable once created.
- Each extraction run is tied to a schema (and optionally a version). Run metadata and source lineage can include the schema version.
- When you query the Vector Catalog or inspect run results, you can filter or display by schema version to understand what structure to expect.
When to create a new version
- Adding optional fields — New optional fields are backward compatible; creating a version helps track when they were introduced.
- Changing types or required fields — Can break downstream consumers. Create a new version and update pipelines to use it; consider re-running extraction for affected documents.
- Deprecating fields — Prefer keeping the field optional and unused rather than removing it in the same version; introduce a new version when you remove or rename fields.
Best practices
- Version before big changes — Before a major schema change, save the current state as a version so you can compare or roll back.
- Document versions — In Schema Studio or your runbooks, note what changed in each version (e.g. “v2: added line_items.tax”).
- Re-run selectively — When you need new fields or fixes, re-run extraction with the new schema version only for the documents that need it, or use workflow orchestration to refresh the Vector Catalog over time.
Common pitfalls
- Changing schema without versioning — Makes it hard to know why output changed or how to reproduce old results.
- Removing required fields without a new version — Downstream code may assume the field exists; use a new version and migrate.
Next steps
- Schema design — Design and optional vs. required.
- Extraction runs — How runs reference schemas.
- Vector Catalog: Lineage — Lineage and freshness.