Supported file types

Bundata extraction supports 65+ file types so you can process the documents your teams actually use: contracts, invoices, policy PDFs, Word docs, HTML, images, and more. Content is normalized into schema-aware output for the Vector Catalog and downstream systems.

Document formats

Category	Examples	Notes
PDF	`.pdf`	Native and scanned; OCR applied when needed. Layout-aware parsing for tables and sections.
Word	`.docx`, `.doc`	Headings, paragraphs, tables, and lists extracted. Page metadata used for page-based billing where applicable.
Presentations	`.pptx`, `.ppt`	Slides as logical units; text and structure preserved.
Web / markup	`.html`, `.htm`, `.md`, `.mdx`	Cleaned and structured; links and structure retained where useful.
Plain text	`.txt`, `.csv`	Chunked and optionally enriched; CSV can be parsed as tabular data.

Images

Format	Notes
`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`	Supported for extraction. OCR is applied for text; optional image description enrichment generates textual descriptions for RAG and accessibility.

Use image extraction for scanned forms, receipts, and diagrams so they become searchable and usable in vector-ready intelligence pipelines.

Emails and messages

Format	Notes
`.eml`, `.msg`	Headers, body, and attachments can be processed. Attachments are handled according to their own file type.

Useful for processing support tickets, procurement threads, and legal correspondence.

When to use which format

Contracts and legal — Prefer PDF or DOCX; use schema fields for parties, dates, and clauses. See Extraction overview.
Invoices and receipts — PDF or image; enable table extraction and optional image description for line items.
Policy and internal docs — HTML, PDF, or DOCX; use metadata (source, date) for source lineage and filtering in Vector Search.

Limits and sizing

Per-file size limits depend on your plan. See Limits and quotas.
Very large PDFs or high-resolution images may take longer; use batch APIs and workflow orchestration for predictable throughput.

Common pitfalls

Scanned PDFs without OCR — Ensure OCR is enabled for image-only pages so text is extracted.
Encrypted or password-protected files — Must be decrypted before upload; Bundata does not strip passwords.
Unsupported or corrupt files — Check the Connectors reference and error responses; unsupported types return clear errors.

Next steps

Extraction runs — How to run extraction on your files.
Extraction overview — Enrichment and output shape.
Reference: Connectors — Connector-specific format support.