What is Bundata?

Bundata is a document intelligence platform that turns unstructured documents—PDFs, emails, web pages, Word files, and more—into smart bites: schema-aware, vector-ready units optimized for AI. The goal is to maximize accuracy for retrieval-augmented generation (RAG), agents, search, and automated workflows.

When to use Bundata

Use Bundata when you need to:

Power RAG or semantic search — Ingest and chunk documents, enrich with metadata, and index in the Vector Catalog for semantic retrieval.
Build agents — Give agents grounded, traceable context from your document base with source lineage and extraction confidence.
Automate document workflows — Run extraction and enrichment in pipelines, with triggers, scheduling, and downstream integrations.
Replace DIY pipelines — Avoid maintaining custom ETL for parsing, chunking, and embedding; use production-ready connectors and schemas instead.

How it fits together

Bundata sits between your raw documents and your AI applications. You connect sources (S3, SharePoint, Confluence, etc.), define schemas in Schema Studio, and run extraction. Output flows to the Vector Catalog for search and agents, or to your own storage and APIs. The full lifecycle: ingest → transform → enrich → chunk → embed → deliver.

Key concepts

Smart bites — Chunks of document content plus metadata and (optionally) embeddings, ready for vector search and RAG.
Vector Catalog — Bundata’s managed store for embeddings and indexed smart bites; supports semantic search and agent grounding.
Schema-aware extraction — Extraction respects your schema so output is consistent and predictable for downstream systems.
Source lineage — Trace results back to the original document and run for audit and debugging.

Common mistakes

Skipping schema design — Without a clear schema, extraction output can be inconsistent. Start with a small schema and iterate.
Ignoring extraction confidence — Low-confidence fields may need schema or source improvements; use confidence scores to tune quality.
Treating Bundata as a generic vector DB — Bundata is a document intelligence layer: it prepares and enriches content; pair it with your vector store or use the Vector Catalog.

Next steps

Quickstart — Run your first extraction.
Core concepts — Accounts, workspaces, connectors, and pipelines.
Extraction overview — How extraction runs and what you get back.