Architecture

Structured Chunking

The practice of splitting S3-stored structured and semi-structured data (Parquet files, JSON documents, CSV records) into semantically meaningful chunks for embedding and retrieval, preserving row boundaries, schema context, and relational structure.

5 connections 3 resources

Summary

What it is

The practice of splitting S3-stored structured and semi-structured data (Parquet files, JSON documents, CSV records) into semantically meaningful chunks for embedding and retrieval, preserving row boundaries, schema context, and relational structure.

Where it fits

Structured chunking connects lakehouse data to vector indexing pipelines. Unlike unstructured document chunking (which splits by character or sentence), structured chunking respects data boundaries — a chunk might be a group of rows from a Parquet file, a JSON object with its schema, or a table partition with column metadata attached.

Misconceptions / Traps
  • Naive fixed-size chunking destroys tabular structure. Splitting a Parquet row group mid-row produces meaningless chunks. Chunking must respect record boundaries.
  • Including schema metadata in each chunk (column names, types, descriptions) improves retrieval relevance but increases embedding cost and storage.
  • Chunk size must balance retrieval precision (smaller chunks) against context completeness (larger chunks). For tabular data, a chunk per logical group (partition, date range, entity) often works better than a fixed token count.
Key Connections
  • scoped_to Vector Indexing on Object Storage, S3 — chunking S3-stored structured data
  • enables Embedding Generation — chunks are the input to embedding models
  • enables RAG over Structured Data — chunked structured data feeds RAG retrieval
  • depends_on Apache Parquet — the source format for most structured data on S3

Definition

What it is

The practice of splitting S3-stored documents into semantically meaningful chunks (sections, paragraphs, tables) with preserved structural metadata, optimized for embedding generation and retrieval-augmented generation.

Why it exists

Naive fixed-size chunking of documents stored on S3 loses context boundaries, splits tables mid-row, and produces embeddings that mix unrelated content. Structured chunking respects document semantics, producing higher-quality embeddings and more accurate retrieval.

Primary use cases

Document preparation for RAG systems, embedding-optimized document splitting, structured parsing of PDFs and HTML from S3 corpora.

Connections 5

Outbound 5

Resources 3