Collections

A collection in the Vector Catalog is a named container for smart bites and their embeddings. You create collections to organize content by use case or schema (e.g. contracts, invoices, policy docs) and to scope Vector Search and agent grounding to the right document set.

What a collection contains

Indexed smart bites — Chunks of text plus metadata from your extraction runs. Each bite can include source lineage (document ID, run ID, schema version).
Embeddings — Vector representations of the content for semantic retrieval. Bundata generates embeddings during ingestion or you can supply them via API where supported.
Metadata — Fields you define (source, date, document type, etc.) for filtering and display in search results and grounded answers.

Creating a collection

In the Platform, open the Vector Catalog and choose Create collection.
Name the collection (e.g. contracts-prod, invoices-2024).
Associate a schema (optional but recommended) so ingested documents are validated and metadata is consistent.
Configure indexing — Embedding model, chunk size, and metadata fields to index for filtering. See product-specific docs for options.
Connect a pipeline — Point extraction or workflow output to this collection so new smart bites are ingested automatically.

When to use one vs. many collections

One collection per document type or use case — e.g. one for contracts, one for invoices. Simplifies filtering and keeps search and agents scoped.
Separate collections per environment — e.g. contracts-staging and contracts-prod so testing doesn’t affect production search.
Avoid one giant collection unless you always filter by metadata; smaller, focused collections often give clearer source lineage and better relevance.

Metadata and filtering

Index metadata that you will filter on in Vector Search (e.g. source, document_type, effective_date). See Vector Search: Filtering.
Source lineage — Preserve document and run identifiers so grounded answers can cite the exact source.

Common pitfalls

No schema or inconsistent metadata — Hard to filter and explain results. Align collection schema with extraction schema.
Too few or too many collections — Balance between “one big bucket” and “a collection per document”; use document type and environment as a guide.

Next steps

Vector Catalog overview — Concepts and platform fit.
Lineage & quality — Source lineage and freshness.
Vector Search overview — Querying collections.