Partitioning & Chunking
Bundata turns raw, unstructured documents into structured content and then into chunks optimized for retrieval-augmented generation (RAG) and large language models (LLMs). Instead of relying only on character or token counts, Bundata uses document structure and semantics to produce more accurate, context-preserving units.
Why partitioning and chunking matter
- Better retrieval — Chunks that respect document structure (headings, lists, tables) and meaning improve relevance when your RAG system retrieves context.
- Fewer broken thoughts — Splitting in the middle of a sentence or paragraph hurts model performance. Document-aware partitioning reduces mid-sentence splits.
- Consistent output — A canonical structure (e.g. document elements or JSON) makes it easier to build pipelines, embeddings, and agents on top.
Partitioning
Partitioning is the first step: turning a raw file (PDF, DOCX, HTML, etc.) into document elements. Bundata extracts:
- Titles and headings
- Paragraphs and lists
- Tables (structure and content)
- Images (with optional descriptions)
- Metadata (page numbers, file info, etc.)
Partitioning functions handle format-specific logic (e.g. PDF vs HTML) so you get a uniform representation across file types. This is the foundation for cleaning, chunking, and enrichment.
Chunking
Chunking takes partitioned content and splits it into units suitable for embedding and retrieval. Bundata supports:
- Semantic chunking — Group content by meaning and document structure (e.g. by section or topic) instead of fixed character limits.
- By-page and by-similarity — In production (UI/API), you can use by-page or by-similarity strategies for finer control and better RAG results.
- Configurable strategies — Choose chunk size and overlap, or use Bundata’s recommender to suggest an optimal configuration for your data.
Unlike naive text splitting, chunking that understands document hierarchy and semantics produces smarter bites and more accurate AI usage.
Cleaning (optional)
Before or after chunking, you can run cleaning steps to:
- Remove boilerplate, headers, footers
- Normalize whitespace and formatting
- Strip content that hurts NLP or embedding quality
Cleaning improves the signal-to-noise ratio in your chunks and downstream models.
Staging and destination
Once partitioned and chunked, data can be staged for downstream systems. Bundata can:
- Write to vector stores and databases
- Send to the Vector Catalog for search and agents
- Export via destination connectors (e.g. S3, data warehouses)
For full pipeline control, use Workflows to chain partitioning, chunking, enrichment, and embedding, then route results to the right destination.
Related
- Extraction & Enrichment — Add metadata, NER, and image/table descriptions to your content.
- Quickstart — Run your first extraction and chunking pipeline.
- Open Source vs Enterprise — Chunking and recommender availability in open source vs UI/API.