Natural Language Querying
Summary
What it is
Using LLMs to translate natural language questions into executable queries (SQL, API calls) over S3-backed datasets.
Where it fits
Natural language querying is the accessibility layer of S3-backed data systems. It lets business users ask questions in plain language and get results from Iceberg, Parquet, or other S3-backed tables — without knowing SQL.
Misconceptions / Traps
- Natural language to SQL is not solved. LLMs generate plausible-looking SQL that may be wrong. Guardrails (schema validation, result sampling, SQL review) are essential.
- Query accuracy depends heavily on schema metadata quality. Well-documented columns, table descriptions, and sample values improve LLM-generated SQL dramatically.
Key Connections
depends_onGeneral-Purpose LLM — requires language understanding and SQL generationaugmentsTrino, DuckDB — generates SQL for these enginesscoped_toLLM-Assisted Data Systems, Lakehouse
Definition
What it is
Using LLMs to translate natural language questions into executable queries (SQL, API calls) over S3-backed datasets, making data accessible to non-technical users.
Why it exists
S3-backed lakehouses contain valuable data accessible only through SQL or programming interfaces. Natural language querying removes this barrier, allowing business users to ask questions in plain language and get results from Iceberg, Parquet, or other S3-backed data.
Primary use cases
Self-service analytics over lakehouse data, natural language to SQL for Trino/DuckDB queries, conversational interfaces over S3-backed datasets.
Relationships
Outbound Relationships
scoped_todepends_onInbound Relationships
Resources
Official AWS sample repository for natural language querying of S3 data using Athena and generative AI text-to-SQL, a reference architecture for the pattern.
AWS ML Blog detailing a production text-to-SQL architecture using Bedrock (Claude), Glue Data Catalog metadata, and Athena for querying S3 data lakes with natural language.
AWS Big Data Blog on improving text-to-SQL accuracy by enriching Glue Data Catalog metadata, addressing the schema-to-SQL grounding challenge for S3 data.