Extraction & Enrichment

Beyond partitioning and chunking, Bundata enriches your documents so that smart bites carry the right context for RAG, agents, and search. Enrichment adds structured metadata, entities, and descriptions that improve downstream accuracy and retrieval.

Metadata extraction

Automatically capture essential metadata and attach it to document elements and chunks. This provides richer context and improves downstream retrieval and filtering.

Source details — Origin file, path, connector, and timestamps.
Document structure — Page numbers, section headings, and element types.
Custom metadata — Any fields you define in your schema or pipeline.

Metadata is especially useful for filtering, faceted search, and giving LLMs and agents clear provenance.

Named Entity Recognition (NER)

Identify and classify key entities in your text:

People — Names, roles, and related mentions.
Organizations — Companies, institutions, and groups.
Locations — Addresses, regions, and places.
Dates and times — When events occurred or deadlines apply.

NER is critical for building GraphRAG, knowledge graphs, and structured search. Entities can be stored as structured fields alongside your chunks so agents and search can use them precisely.

Image description

Generate detailed textual descriptions for images embedded in your documents. This makes visual content:

Accessible — Descriptions can be used for search and screen readers.
Searchable — Text descriptions are indexed and retrieved like other content.
Usable by LLMs — Models receive a text summary of the image instead of or in addition to the raw asset.

Use image descriptions when your documents contain figures, diagrams, or photos that matter for reasoning and retrieval.

Table description

Automatically summarize the contents and structure of tables. Tables are often dense and hard for models to interpret as raw markup. Descriptions help by:

Translating table structure and key values into prose or structured summaries.
Making table content searchable and retrievable as context for RAG.

Table description is useful for financial reports, specifications, and any document where tables carry critical information.

Table to HTML

Convert tables into clean, standards-compliant HTML that preserves their structure. This allows:

Accurate interpretation — LLMs can reason over well-formed table markup.
Consistent formatting — Same table structure across different source formats (PDF, DOCX, etc.).
Downstream use — HTML tables can be rendered in UIs or passed to tools that expect structured data.

Table-to-HTML is part of making table data first-class in your document intelligence pipeline.

How enrichment fits in the pipeline

A typical flow:

Partition — Extract document elements from the raw file.
Enrich — Add metadata, run NER, generate image and table descriptions, convert tables to HTML as needed.
Chunk — Split enriched content into smart bites (see Partitioning & Chunking).
Embed and index — Generate embeddings and store in your vector store or Vector Catalog.

Enrichment runs before or in tandem with chunking so that each chunk carries the right metadata and entity context.

Availability

Production (UI/API) — Full enrichment types (metadata, NER, image descriptions, table description, table-to-HTML) are available in the Bundata UI and API.
Open source — Some enrichment types may be limited or require additional setup. See Open Source vs Enterprise for details.

Partitioning & Chunking — How documents become chunks.
Overview — Key functionality and use cases.
API Reference — Enrichment and extraction endpoints.