Vector Catalog overview

The Vector Catalog is Bundata’s managed store for vector-ready intelligence: embeddings and indexed smart bites from your extraction and enrichment pipelines. It powers Vector Search (semantic retrieval) and grounded answers for agents, with source lineage and metadata for filtering, citations, and audit. This page explains what the Vector Catalog is, when to use it, how it fits the rest of the platform, and how to get the most out of it.

What the Vector Catalog is

The Vector Catalog is a managed service that:

  • Stores smart bites — Chunks of document content plus metadata (source, date, document type, etc.) produced by context-aware extraction. Each bite can carry source lineage (document ID, run ID, schema version).
  • Stores embeddings — Vector representations of that content, generated during ingestion or supplied via API. Embeddings enable semantic similarity search.
  • Organizes by collections — You create collections (e.g. “contracts”, “invoices”, “policies”) and send ingestion output to a collection. Search and agents query one or more collections. See Collections.

The catalog is the central place your document intelligence layer feeds into: extraction and workflow orchestration write to it; Vector Search and agents read from it.

When to use the Vector Catalog

  • RAG — Retrieve relevant chunks for an LLM so answers are based on your documents. The catalog is the retrieval backend. See Vector Search overview.
  • Agents — Ground agent responses with retrieved chunks and source lineage so answers are traceable. See Agents grounding.
  • Semantic search — Let users ask questions in natural language over contracts, policies, or operational docs. See Vector Search overview and Semantic retrieval.
  • Unified index — Instead of maintaining your own vector store and ETL, use the catalog so Bundata handles indexing, embedding, and source lineage in one place.

If you need to keep vectors entirely in your own infrastructure, use destination connectors to write smart bites (and optionally embeddings) to your storage and build your own index. See Destination connectors.

How it fits the rest of Bundata

  • Extraction — Extraction runs produce smart bites. Workflows or API steps send that output to a catalog collection. See Extraction overview and Extraction runs.
  • Schemas — Collections can be associated with a schema so ingested content is validated and metadata is consistent. See Schema Studio overview and Collections.
  • Vector Search — You query the catalog by collection (and optional metadata filters) to get ranked smart bites for RAG and agents. See Vector Search overview.
  • Workflows — Workflow steps can write to the catalog on a schedule or when new documents arrive, so the catalog stays fresh. See Workflows overview and Triggers & scheduling.
  • Lineage and quality — The catalog preserves source lineage and can reflect extraction confidence. Use for grounded answers, audit, and quality tuning. See Lineage & quality.

Key concepts

  • Smart bites — The unit of storage: a chunk of text, metadata, and (optionally) an embedding. Produced by extraction; consumed by search and agents.
  • Collections — Named containers for bites and embeddings. One collection per use case or document type (e.g. contracts, invoices) keeps search and agents scoped. See Collections.
  • Source lineage — Traceability from a bite back to the source document and extraction run. Essential for citations and compliance. See Lineage & quality.
  • Freshness — Keep the catalog up to date with workflow orchestration so new and updated documents are searchable. Stale data leads to missing or outdated context. See Lineage & quality.

Common mistakes

  • One giant collection — Prefer multiple collections (or strong metadata) by document type and use case so search and agents don’t mix unrelated content.
  • No lineage — Ensure extraction and ingestion preserve document and run IDs so grounded answers can cite sources. See Lineage & quality.
  • Stale catalog — Schedule or trigger workflows so ingestion runs regularly. Monitor run history. See Workflows monitoring.

Next steps