Lineage & quality
The Vector Catalog preserves source lineage for every smart bite: which document and extraction run produced it. That supports grounded answers, audit, and quality. This page covers lineage, freshness, and how to use quality signals.
Source lineage
Source lineage is the link from a catalog result back to:
- Source document — Original file or record (e.g. S3 key, document ID).
- Extraction run — Run ID and schema version used.
- Collection — Which collection the bite was indexed into.
Use lineage to:
- Cite sources — Show users which document and section a grounded answer came from.
- Debug — When search returns wrong or stale content, trace back to the run and document.
- Compliance — Prove where data came from for audit and governance.
Lineage is stored as metadata on each bite and returned with search results. Expose it in your app or agent UI so answers are traceable.
Freshness
- Freshness is how up-to-date the catalog is relative to your source systems. Workflows that run on a schedule or on source events keep the catalog fresh.
- Stale data — If extraction or ingestion hasn’t run recently, new or updated documents won’t appear in search. Use Workflows and Triggers & scheduling to keep collections updated.
- Re-indexing — When you change schema or re-run extraction, re-ingest the new smart bites into the collection so search reflects the latest output.
Quality metrics
- Extraction confidence — Carried from extraction runs. Low-confidence bites may be filtered out or flagged in the UI. See Extraction best practices.
- Completeness — Whether required schema fields were populated. Gaps may indicate document quality or schema mismatch.
- Coverage — Proportion of source documents that have been successfully extracted and indexed. Monitor failed runs and fix connectors or schemas.
Use these signals to tune extraction, schema, and workflows so the catalog stays accurate and trustworthy for RAG and agents.
Agents and workflows usage
- Agents — Use lineage in agent responses so users see which document and run support each part of a grounded answer. See Grounding from catalog.
- Workflows — Pipeline runs that write to the catalog should attach run and document IDs so lineage is end-to-end. See Workflows overview.
Next steps
- Collections — Configure collections and metadata.
- Extraction runs — Run metadata and confidence.
- Agents grounding — Use lineage in agent answers.