Supported file types

Bundata extraction supports 65+ file types so you can process the documents your teams actually use: contracts, invoices, policy PDFs, Word docs, HTML, images, and more. Content is normalized into schema-aware output for the Vector Catalog and downstream systems.

Document formats

CategoryExamplesNotes
PDF.pdfNative and scanned; OCR applied when needed. Layout-aware parsing for tables and sections.
Word.docx, .docHeadings, paragraphs, tables, and lists extracted. Page metadata used for page-based billing where applicable.
Presentations.pptx, .pptSlides as logical units; text and structure preserved.
Web / markup.html, .htm, .md, .mdxCleaned and structured; links and structure retained where useful.
Plain text.txt, .csvChunked and optionally enriched; CSV can be parsed as tabular data.

Images

FormatNotes
.png, .jpg, .jpeg, .tiff, .bmp, .webpSupported for extraction. OCR is applied for text; optional image description enrichment generates textual descriptions for RAG and accessibility.

Use image extraction for scanned forms, receipts, and diagrams so they become searchable and usable in vector-ready intelligence pipelines.

Emails and messages

FormatNotes
.eml, .msgHeaders, body, and attachments can be processed. Attachments are handled according to their own file type.

Useful for processing support tickets, procurement threads, and legal correspondence.

When to use which format

  • Contracts and legal — Prefer PDF or DOCX; use schema fields for parties, dates, and clauses. See Extraction overview.
  • Invoices and receipts — PDF or image; enable table extraction and optional image description for line items.
  • Policy and internal docs — HTML, PDF, or DOCX; use metadata (source, date) for source lineage and filtering in Vector Search.

Limits and sizing

  • Per-file size limits depend on your plan. See Limits and quotas.
  • Very large PDFs or high-resolution images may take longer; use batch APIs and workflow orchestration for predictable throughput.

Common pitfalls

  • Scanned PDFs without OCR — Ensure OCR is enabled for image-only pages so text is extracted.
  • Encrypted or password-protected files — Must be decrypted before upload; Bundata does not strip passwords.
  • Unsupported or corrupt files — Check the Connectors reference and error responses; unsupported types return clear errors.

Next steps