Architecture

Structured Chunking

The practice of splitting S3-stored structured and semi-structured data (Parquet files, JSON documents, CSV records) into semantically meaningful chunks for embedding and retrieval, preserving row boundaries, schema context, and relational structure.

5 connections 3 resources

Summary

What it is

The practice of splitting S3-stored structured and semi-structured data (Parquet files, JSON documents, CSV records) into semantically meaningful chunks for embedding and retrieval, preserving row boundaries, schema context, and relational structure.

Where it fits

Structured chunking connects lakehouse data to vector indexing pipelines. Unlike unstructured document chunking (which splits by character or sentence), structured chunking respects data boundaries — a chunk might be a group of rows from a Parquet file, a JSON object with its schema, or a table partition with column metadata attached.

Misconceptions / Traps
  • Naive fixed-size chunking destroys tabular structure. Splitting a Parquet row group mid-row produces meaningless chunks. Chunking must respect record boundaries.
  • Including schema metadata in each chunk (column names, types, descriptions) improves retrieval relevance but increases embedding cost and storage.
  • Chunk size must balance retrieval precision (smaller chunks) against context completeness (larger chunks). For tabular data, a chunk per logical group (partition, date range, entity) often works better than a fixed token count.
Key Connections
  • scoped_to Vector Indexing on Object Storage, S3 — chunking S3-stored structured data
  • enables Embedding Generation — chunks are the input to embedding models
  • enables RAG over Structured Data — chunked structured data feeds RAG retrieval
  • depends_on Apache Parquet — the source format for most structured data on S3

Definition

What it is

The practice of splitting S3-stored documents into semantically meaningful chunks (sections, paragraphs, tables) with preserved structural metadata, optimized for embedding generation and retrieval-augmented generation.

Why it exists

Naive fixed-size chunking of documents stored on S3 loses context boundaries, splits tables mid-row, and produces embeddings that mix unrelated content. Structured chunking respects document semantics, producing higher-quality embeddings and more accurate retrieval.

Primary use cases

Document preparation for RAG systems, embedding-optimized document splitting, structured parsing of PDFs and HTML from S3 corpora.

Recent developments

Latest signals
  • Semantic chunking improved retrieval accuracy 18% vs fixed-size — Kapa.ai 2026 benchmark. Sentence embeddings detect topic transitions while preserving logical document structure; produces variable-length chunks aligned with the document's logical structure during ingestion. Per Firecrawl — Best Chunking Strategies for RAG (and LLMs) in 2026.
  • Three categories: fixed-size, dynamic splitting, hybrid approaches. 2026 taxonomy: fixed-size (token-bounded), dynamic (sentence/paragraph/section), hybrid (sentence boundary + size cap). Production deployments typically combine multiple strategies per content type. Per ByteTools — RAG Chunking Best Practices 2026.
  • Pick semantic for long-form prose; clause-level for contracts/regulations/SOPs. 2026 rule of thumb: semantic chunking by sentence-similarity wins on long-form prose (research papers, transcripts, books); clause-level chunking wins on contracts, regulations, SOPs — anywhere the legal or procedural unit is the natural chunk. Per Future AGI — Evaluating RAG Chunking Strategies 2026.
  • Multiple chunking strategies per pipeline: Dify uses semantic, LlamaIndex supports recursive, RAGFlow does document-structure-based. Modern RAG pipelines mix techniques — no single chunking strategy wins everything; production stacks route by content type. Per Firecrawl — Best Chunking Strategies for RAG 2026.
  • Computational cost decision tree: token-based (cheap) → sentence/recursive (moderate) → semantic (expensive, needs embeddings) → LLM-based (most expensive but highest quality). 2026 cost-awareness shift: pick the simplest chunking method that meets your retrieval-quality bar — LLM-based parsing is reserved for the high-stakes corpus where retrieval failures cost more than the embedding spend. Per DasRoot — RAG Chunking Strategies: Document Splitting (April 2026).
  • Semantic Chunking 5 best practices: dynamic threshold + buffer tuning + multi-modal awareness + chunk-overlap calibration + structural inheritance. Extend.ai's March 2026 best-practices guide formalizes the five operational levers — gives practitioners a tuning playbook beyond "set window-size and pray." Per Extend — Semantic Chunking 5 Best Practices (March 2026).

Connections 5

Outbound 5

Resources 3