LLM Capability

Schema Drift Detection

Monitoring S3-stored datasets for unexpected schema changes — new columns, type changes, missing fields, structural shifts — and alerting before downstream consumers break.

5 connections 2 resources

Summary

What it is

Monitoring S3-stored datasets for unexpected schema changes — new columns, type changes, missing fields, structural shifts — and alerting before downstream consumers break.

Where it fits

Schema drift detection is the proactive complement to schema evolution. While table formats handle planned schema changes, drift detection catches unplanned changes — a data producer silently adding a column, changing a type, or dropping a field — before they propagate to dashboards and ML models.

Misconceptions / Traps
  • Schema drift is different from schema evolution. Evolution is intentional and managed; drift is unintentional and must be detected. Both need handling, but the tools are different.
  • LLM-based drift detection goes beyond structural comparison (which tools like Great Expectations handle). LLMs can detect semantic drift — when a field's meaning changes even if its type does not.
Key Connections
  • solves Schema Evolution — catches unplanned schema changes
  • augments Write-Audit-Publish — automated drift check in the audit step
  • depends_on General-Purpose LLM — for semantic drift detection
  • scoped_to LLM-Assisted Data Systems, Table Formats

Definition

What it is

Using LLMs to continuously monitor S3-stored datasets for unexpected schema changes — new columns, altered types, renamed fields, missing required fields — and alert before downstream pipelines break.

Why it exists

Schema changes in S3-stored data (new JSON fields, altered CSV headers, Parquet column additions) propagate silently through data lakes. LLM-based detection understands semantic schema meaning, catching breaking changes that rule-based checks miss.

Primary use cases

Automated schema monitoring for S3 data lakes, pre-ingestion schema validation, schema change impact analysis.

Recent developments

Latest signals
  • Data drift vs model drift — distinct categories with distinct response patterns. Data drift = changes in input distribution; model drift = degradation in predictive performance as a result. Schema drift sits upstream of both — a schema change is the canonical input-distribution change that triggers both downstream categories. Per Orq.ai — Understanding Model Drift and Data Drift in LLMs 2026 Guide.
  • Semantic drift = LLM-specific drift category. For LLM systems specifically, semantic drift occurs when the meaning of vector embeddings — produced by the model's embedding layer or generated for RAG retrieval — shifts relative to the space the system was calibrated against. Distinct from schema-level shape changes. Per Stack Pulsar — LLM Model Drift Detection 2026.
  • Statistical drift detection metric toolkit: PSI / KS / Wasserstein / JS-KL / chi-square. Production drift detection in 2026 standardizes on these five statistical tests (each appropriate for different distribution-shape changes). LLM/RAG drift adds embedding-similarity-distance-over-time as a sixth signal.
  • Tools: Evidently AI as the canonical integration target. Evidently AI emerged as the canonical drift-detection framework integrated by most pipelines in 2026 — covers statistical drift + embedding-based + automated-evaluation pipelines + LLM-behavior degradation in one stack.
  • Automated re-training pipelines + real-time performance tracking + continuous refinement. The 2026 best-practice trio: invest in automated re-training pipelines, real-time performance tracking, and continuous refinement infrastructure. Per All Days Tech — Model Drift in Production 2026: Runbook.
  • Schema drift detection upstream of data drift — early-warning signal. Schema changes detected pre-ingestion stop downstream data drift before it propagates into models. LLM-driven schema-drift detectors can parse upstream API specs / source DB schemas / Kafka topic registries + flag breaking changes before the data lands in S3. Per DasRoot — How to Monitor LLM Drift in Production.

Connections 5

Outbound 4
Inbound 1

Resources 2