LLM Capability

Schema Inference

Summary

What it is

Using LLMs to infer or suggest schemas from semi-structured data (JSON, CSV, nested formats) stored in S3.

Where it fits

Schema inference automates the tedious process of determining what fields, types, and structures exist in S3 data. It accelerates onboarding new datasets and proposes schema evolution changes for existing tables.

Misconceptions / Traps

  • LLM-inferred schemas are suggestions, not ground truth. Always validate against actual data samples before applying to production tables.
  • Sampling matters. Schema inference from a small sample may miss rare fields or variant types that appear in the full dataset.

Key Connections

  • depends_on General-Purpose LLM — requires language understanding for schema analysis
  • solves Schema Evolution — automates schema change proposals
  • augments Apache Iceberg — can suggest schema changes for Iceberg tables
  • scoped_to LLM-Assisted Data Systems, Table Formats

Definition

What it is

Using LLMs to infer, suggest, or validate schemas from semi-structured data (JSON, CSV with inconsistent headers, nested formats) stored in S3.

Why it exists

Semi-structured data arriving in S3 often has no declared schema. Manually inspecting files to determine field names, types, and nesting is tedious and error-prone. LLMs can analyze sample data and propose schemas automatically.

Primary use cases

Schema suggestion for new S3 datasets, validation of inferred schemas against existing table definitions, automated schema evolution proposals.

Relationships

Inbound Relationships

Resources