Extraction overview

Bundata extraction turns unstructured documents into structured, schema-aware output. It runs context-aware parsing, optional enrichment (metadata, NER, image and table descriptions), and produces smart bites ready for the Vector Catalog, RAG, and agents.

What extraction does

The extraction pipeline has four main stages:

Partition

Break the document into elements: titles, paragraphs, tables, images. Layout-aware parsing preserves structure so that tables and sections are correctly identified.

Extract

Pull fields and metadata according to your schema. Schema-aware extraction ensures output matches the structure you defined in Schema Studio.

Enrich (optional)

Add named entities (NER), image descriptions, table descriptions, and table-to-HTML. Enrichment improves retrieval and grounding for RAG and agents.

Chunk (optional)

Split content into chunks for embedding and retrieval. Chunking strategies affect how smart bites are indexed in the Vector Catalog.

Output is normalized (e.g. canonical JSON) so downstream systems get a consistent structure. Each output item includes source lineage and, where applicable, extraction confidence.

When to use extraction

  • You need structured data from PDFs, DOCX, or other formats for RAG, search, or agents.
  • You want source lineage and extraction confidence for audit and quality.
  • You are building document workflows that trigger on new files or run on a schedule.

How it fits with the rest of Bundata

  • Schema Studio — Define the schema that extraction uses.
  • Vector Catalog — Ingestion pipelines often send extraction output to the catalog for search.
  • Workflows — Orchestrate extraction, validation, and delivery.
  • Agents — Use extracted content and catalog results to ground answers.

Extraction runs

An extraction run is one execution of the pipeline on a set of documents. You can run extraction on demand (UI or API) or via a scheduled workflow. Results include the extracted fields, metadata, and confidence scores where applicable.

Common mistakes

  • Schema too loose or too strict — Start with a small set of fields and expand; use optional fields where appropriate.
  • Ignoring low-confidence fields — Use confidence to find bad sources or schema gaps. See Troubleshooting low-confidence extraction.
  • Skipping enrichment — For RAG and agents, metadata and NER improve retrieval and grounding.

Troubleshooting

  • Low extraction confidence — Check schema fit, source quality, and optional enrichment settings. See Best practices and troubleshooting low-confidence extraction.
  • Missing or wrong fields — Verify schema design (required vs optional, field types) and that the document type is supported. See Supported file types.
  • Slow or failed runs — Check Extraction runs for status, retries, and limits.

Next steps