Supported file types
Bundata extraction supports 65+ file types so you can process the documents your teams actually use: contracts, invoices, policy PDFs, Word docs, HTML, images, and more. Content is normalized into schema-aware output for the Vector Catalog and downstream systems.
Document formats
| Category | Examples | Notes |
|---|---|---|
.pdf | Native and scanned; OCR applied when needed. Layout-aware parsing for tables and sections. | |
| Word | .docx, .doc | Headings, paragraphs, tables, and lists extracted. Page metadata used for page-based billing where applicable. |
| Presentations | .pptx, .ppt | Slides as logical units; text and structure preserved. |
| Web / markup | .html, .htm, .md, .mdx | Cleaned and structured; links and structure retained where useful. |
| Plain text | .txt, .csv | Chunked and optionally enriched; CSV can be parsed as tabular data. |
Images
| Format | Notes |
|---|---|
.png, .jpg, .jpeg, .tiff, .bmp, .webp | Supported for extraction. OCR is applied for text; optional image description enrichment generates textual descriptions for RAG and accessibility. |
Use image extraction for scanned forms, receipts, and diagrams so they become searchable and usable in vector-ready intelligence pipelines.
Emails and messages
| Format | Notes |
|---|---|
.eml, .msg | Headers, body, and attachments can be processed. Attachments are handled according to their own file type. |
Useful for processing support tickets, procurement threads, and legal correspondence.
When to use which format
- Contracts and legal — Prefer PDF or DOCX; use schema fields for parties, dates, and clauses. See Extraction overview.
- Invoices and receipts — PDF or image; enable table extraction and optional image description for line items.
- Policy and internal docs — HTML, PDF, or DOCX; use metadata (source, date) for source lineage and filtering in Vector Search.
Limits and sizing
- Per-file size limits depend on your plan. See Limits and quotas.
- Very large PDFs or high-resolution images may take longer; use batch APIs and workflow orchestration for predictable throughput.
Common pitfalls
- Scanned PDFs without OCR — Ensure OCR is enabled for image-only pages so text is extracted.
- Encrypted or password-protected files — Must be decrypted before upload; Bundata does not strip passwords.
- Unsupported or corrupt files — Check the Connectors reference and error responses; unsupported types return clear errors.
Next steps
- Extraction runs — How to run extraction on your files.
- Extraction overview — Enrichment and output shape.
- Reference: Connectors — Connector-specific format support.