LLM Capability

Schema Inference

Summary

What it is

Using LLMs to infer or suggest schemas from semi-structured data (JSON, CSV, nested formats) stored in S3.

Where it fits

Schema inference automates the tedious process of determining what fields, types, and structures exist in S3 data. It accelerates onboarding new datasets and proposes schema evolution changes for existing tables.

Misconceptions / Traps

LLM-inferred schemas are suggestions, not ground truth. Always validate against actual data samples before applying to production tables.
Sampling matters. Schema inference from a small sample may miss rare fields or variant types that appear in the full dataset.

Key Connections

depends_on General-Purpose LLM — requires language understanding for schema analysis
solves Schema Evolution — automates schema change proposals
augments Apache Iceberg — can suggest schema changes for Iceberg tables
scoped_to LLM-Assisted Data Systems, Table Formats

Definition

What it is

Using LLMs to infer, suggest, or validate schemas from semi-structured data (JSON, CSV with inconsistent headers, nested formats) stored in S3.

Why it exists

Semi-structured data arriving in S3 often has no declared schema. Manually inspecting files to determine field names, types, and nesting is tedious and error-prone. LLMs can analyze sample data and propose schemas automatically.

Primary use cases

Schema suggestion for new S3 datasets, validation of inferred schemas against existing table definitions, automated schema evolution proposals.

Relationships

Outbound Relationships

scoped_to

LLM-Assisted Data Systems Table Formats

depends_on

General-Purpose LLM

solves

Schema Evolution

augments

Apache Iceberg

Inbound Relationships

enables

General-Purpose LLM Code-Focused LLM

Resources

DocsHigh

docs.databricks.com/aws/en/ingestion/cloud-object-storage/au...

Databricks Auto Loader documentation for automated schema inference from S3-stored files, sampling up to 50 GB to detect column types and handle schema evolution.

DocsHigh

docs.aws.amazon.com/glue/latest/dg/edit-jobs-source-s3-files...

Official AWS Glue documentation for schema inference from S3 data sources, covering automatic type detection, crawler-based discovery, and Glue Studio's Infer Schema feature.

PaperMedium

arxiv.org/html/2509.04632

Research paper on using LLMs for schema inference on tabular data repositories, inferring entity types, attributes, and relationships from column headers and cell values.