Glossary
Definitions for key terms used in Bundata documentation and the platform. Aligned with the product and website so the transition from marketing site to docs is consistent.
Core product terms
Bundata — A document intelligence platform that turns unstructured documents into structured, usable layers for AI systems. It helps organizations extract context-aware fields, define output schemas, generate smart bites, and use those outputs across search, agents, and workflows on one governed foundation.
Document intelligence layer — The platform layer that turns raw documents into structured, usable data for AI. It includes ingestion, context-aware extraction, schema-aware extraction, vectorization, and delivery to the Vector Catalog, search, agents, and workflows.
Unstructured documents — Input content that is not rigidly structured (e.g. PDFs, emails, contracts, invoices, policies, procurement files, operational records). Bundata processes these into smart bites and vector-ready intelligence.
Schema-aware extraction — Extraction that pulls information from documents according to a defined output structure (schema). Instead of generic text, Bundata identifies the right fields from context, document layout, and schema logic so outputs are predictable, validated, and ready for downstream systems.
Context-aware extraction — Extraction that uses document context and layout (e.g. headings, tables, sections) to produce structured output and smart bites. Complements schema-aware extraction by applying the schema with awareness of how the document is organized.
Smart bites — Bundata’s context-rich units of extracted intelligence. Each is a structured field or semantic unit from a document, with meaning, source context, and how it can be used downstream. Smart bites are schema-aware, vector-ready, searchable, linkable, and produced by extraction for the Vector Catalog, RAG, and agents.
Vector-ready intelligence — Structured output prepared for retrieval, semantic search, and AI workflows. Bundata turns extracted fields into schema-shaped units that can be embedded, indexed, linked, and used across agents, search systems, and operational pipelines.
Catalog and search
Vector Catalog — Bundata’s managed store for embeddings and indexed smart bites. It supports Vector Search (semantic retrieval) and grounded answers for agents, with source lineage and metadata for filtering and citations.
Vector Search — Semantic retrieval over the Vector Catalog using natural language or embedding queries. Returns ranked smart bites for RAG, agent grounding, and search over your document base.
Grounded answers — Agent or RAG responses that cite specific retrieved chunks and source documents. Grounding uses source lineage so users can verify answers and auditors can trace decisions.
Quality and traceability
Source lineage — Traceability from a result (e.g. a smart bite or search hit) back to the original document and extraction run. Essential for citations, compliance, and debugging.
Extraction confidence — A measure of how reliable an extracted field or run is. Used for quality tuning and troubleshooting; low confidence may indicate schema mismatch or poor source quality.
Schema and workflow
Schema — The definition of structure and fields for extraction output. Designed in Schema Studio or via API; extraction runs use a specific schema (and optionally version) so output is consistent and schema-aware.
Workflow — An orchestrated pipeline of steps (e.g. ingest, extract, validate, deliver) that can run on a schedule or trigger. Workflows implement workflow orchestration for the document intelligence layer.
Workflow orchestration — Running ingestion, extraction, validation, and delivery as a coordinated pipeline so document processing is repeatable and production-ready. Bundata Workflows provide this for the platform.