Extraction runs

An extraction run is one execution of the Bundata extraction pipeline on a set of documents. You can run extraction on demand from the Platform UI or API, or trigger it via workflow orchestration (schedule or event). This page covers how to start runs, monitor them, and use extraction confidence and source lineage in your workflows.

How to run extraction

From the Platform UI

Open Extraction Studio and select a source (upload or connected connector).
Choose the schema to use. Ensure the schema matches the document type (e.g. contracts, invoices).
Optionally configure enrichment (metadata, NER, image/table description).
Click Run extraction. The job is queued and runs asynchronously.
View run status and results in the runs list; drill into source lineage to see which document produced which smart bites.

Via API

Use the extraction or batch endpoint with your API key. Example pattern:

Single document — POST to the extraction endpoint with file content or a reference (e.g. S3 URI).
Batch — POST a list of document references; the API returns a job ID. Poll the job status endpoint until complete, then fetch results.

See REST API overview and API reference for request/response shapes and error handling.

Via workflows

Configure a workflow with an extraction step and a trigger (schedule or source event). Workflows run extraction on new or updated documents and send output to the Vector Catalog or destinations. See Workflows overview and Triggers & scheduling.

Monitoring and status

Run status — Runs move through states such as queued, running, completed, failed. Use the UI or the jobs API to check status.
Failure reasons — Failed runs return error codes and messages (e.g. invalid schema, unsupported file type, size limit). See Error codes and Error handling.
Partial success — For batch runs, some documents may succeed and others fail. Results include per-document status so you can retry or fix only the failed items.

Extraction confidence

Extraction confidence indicates how reliable the extracted output is for a given field or document. Use it to:

Tune quality — Low confidence often means poor source quality, complex layout, or a schema mismatch. Refine the schema or preprocessing. See Best practices.
Filter downstream — In RAG or agents, you may filter out low-confidence chunks or surface confidence in the UI for human review.
Audit — Confidence plus source lineage supports compliance and audit trails.

Output and lineage

Smart bites — Run output is a set of smart bites (chunks plus metadata) conforming to your schema. Send them to the Vector Catalog or your own store.
Source lineage — Each bite can be traced back to the source document and run. Preserve lineage for grounded answers and troubleshooting.

Common pitfalls

Wrong schema — Using a schema designed for contracts on invoices can yield low confidence or missing fields. Match schema to document type.
No retries — Transient failures (e.g. rate limits) can be retried with backoff. See Rate limits and Error handling.
Ignoring failures — Monitor run history and set up alerts so failed runs are fixed or retried.

Next steps

Best practices — Schema design and troubleshooting.
Workflows overview — Automate extraction runs.
Reference: API — Endpoints and response formats.