Architecture

RAG over Structured Data

The architecture pattern of using retrieval-augmented generation (RAG) to answer natural language questions against structured data (Iceberg tables, Parquet files) stored in S3, combining text-to-SQL or schema-aware retrieval with LLM generation.

10 connections 3 resources

Summary

What it is

The architecture pattern of using retrieval-augmented generation (RAG) to answer natural language questions against structured data (Iceberg tables, Parquet files) stored in S3, combining text-to-SQL or schema-aware retrieval with LLM generation.

Where it fits

RAG over Structured Data bridges the gap between LLM-assisted data systems and traditional analytics. Instead of embedding and retrieving unstructured documents, this pattern retrieves table schemas, column statistics, and sample data from S3-backed catalogs to ground the LLM's SQL generation or data summarization.

Misconceptions / Traps
  • RAG over structured data is not just "text-to-SQL." It also includes retrieving relevant table schemas, data dictionaries, and business glossaries to contextualize the LLM's response.
  • Generated SQL must be validated and sandboxed. An LLM-generated query against production Iceberg tables can produce incorrect results or scan excessive data if not constrained.
  • Schema retrieval quality depends on catalog metadata richness. Tables without descriptions, column comments, or meaningful names produce poor retrieval results.
Key Connections
  • scoped_to LLM-Assisted Data Systems, Lakehouse — LLM-powered analytics on S3 data
  • depends_on Natural Language Querying — the LLM capability that generates SQL
  • depends_on Metadata Enrichment & Tagging — rich metadata improves retrieval quality
  • enables AI-Safe Views — constrained views limit what RAG queries can access

Definition

What it is

A retrieval-augmented generation pattern where LLMs answer questions by retrieving relevant rows or aggregates from structured tables on S3 (Iceberg, Delta, Parquet) rather than from unstructured document corpora.

Why it exists

Standard RAG assumes unstructured text chunks. Many high-value enterprise datasets live in structured lakehouse tables on S3. RAG over structured data bridges the gap, letting LLMs generate SQL or query structured metadata to ground their answers in precise, tabular facts.

Primary use cases

Natural language querying of lakehouse tables, LLM-powered business intelligence, conversational analytics over S3-stored datasets.

Connections 10

Outbound 6
Inbound 4

Resources 3