Create your first extraction

This tutorial walks you through running your first extraction in Bundata. You will use the Platform UI to upload a document (or connect a source), choose or create a schema, run extraction, and review the output: smart bites with metadata and extraction confidence scores.

What extraction means in Bundata

Extraction turns unstructured documents (PDFs, contracts, invoices, policy docs) into structured, schema-aware output. Each run:

Partitions the document into elements (titles, paragraphs, tables, images).
Extracts fields and metadata according to your schema.
Optionally enriches with named entities, image descriptions, and table descriptions.
Produces smart bites: chunks of content plus metadata, ready for the Vector Catalog, RAG, or agents.

Output includes source lineage (which file and page the content came from) and extraction confidence (how reliable each extracted value is), so you can audit and tune quality.

Prerequisites

First time?

If you don’t have an account yet, see Sign up and Quickstart to get API keys and platform access.

A Bundata account. If you do not have one, see Sign up and Quickstart.
A document to extract: for example a PDF, DOCX, or image. See Supported file types for the full list.
(Optional) A schema. You can create one in Schema Studio or use a default/sample schema for your first run.

Sign in to the Bundata Platform (or your organization’s Bundata URL).
From the dashboard, open Extraction Studio (or the extraction area in the Platform).
Confirm you are in the correct workspace if your account has multiple workspaces.

Step 2: Provide input (upload or connect a source)

You can run extraction on a single file or on files from a connected source.

Option A: Upload a document

Use Upload or drag-and-drop to add a file (e.g. a PDF contract or invoice).
Bundata supports 65+ file types; see Supported file types.

Option B: Connect a source

Connect a source (S3, Google Drive, SharePoint, etc.) from the source picker.
Select one or more files from that source to process. See Integrations overview and Source connectors.

Step 3: Choose or create a schema

The schema defines what fields extraction produces (e.g. title, body, metadata, custom fields).

Use an existing schema — Select a schema you or your team already created in Schema Studio.
Create a new schema — In Schema Studio, add fields (text, number, date, nested objects) and optionally mark fields as required. Start simple (e.g. title, body, source path) and expand later. See First schema and Schema design.

Extraction is schema-aware: output structure matches the schema so downstream systems (Vector Catalog, agents, workflows) get consistent data.

Step 4: Run extraction

With the document(s) and schema selected, click Run extraction (or equivalent in your UI).
Wait for the job to complete. You can monitor progress in the extraction runs view. See Extraction runs for details on status, logs, and retries.
When the run finishes, open the results.

Step 5: Review output (smart bites and confidence)

After the run completes, you will see:

Smart bites — Chunks of content plus metadata. Each bite typically includes the extracted fields from your schema, source path, page or section, and optional enrichment (entities, image/table descriptions).
Extraction confidence — Per-field or per-bite confidence scores where applicable. Use these to spot low-quality extractions or schema gaps. See Best practices and troubleshooting low-confidence extraction.
Source lineage — Traceability back to the original file and location for audit and citations.

Review a few results to confirm the schema and quality. If confidence is low or fields are missing, adjust the schema or document quality and re-run.

What happens behind the scenes

Bundata runs a pipeline that:

Parses the document (layout-aware for PDFs and similar formats).
Applies OCR if needed for scanned content.
Extracts fields according to your schema and optional enrichments.
Produces normalized output (e.g. canonical JSON) with metadata and lineage.

The output is vector-ready: you can send it to the Vector Catalog for indexing and then use Vector Search for semantic search and Agents for grounded answers.

Common mistakes

Schema too strict — Required fields that are often missing cause failed extractions or empty values. Prefer optional fields and add required only when necessary.
Ignoring confidence — Low confidence can indicate poor source quality or schema mismatch. Use confidence to tune schemas and sources. See Best practices.
Skipping schema — Running without a schema (or with a generic one) may produce output that does not match your downstream systems. Define a schema that matches your use case.

Next steps

First schema — Build a schema tailored to your documents.
Extraction runs — Monitor and interpret extraction jobs.
Vector Catalog overview — Index extraction output for search.
Vector Search overview — Query with semantic search.
Best practices — Schema design and troubleshooting.