LLM Capability

Data Classification

Summary

What it is

Using LLMs to categorize, tag, or label S3-stored objects based on content analysis — by topic, sensitivity level, or compliance category.

Where it fits

Data classification enables governance over S3 data lakes. It identifies PII, classifies documents by sensitivity, and routes data to appropriate processing pipelines — all of which are critical at scale where manual review is impossible.

Misconceptions / Traps

  • Classification accuracy varies by data type and domain. General-purpose LLMs may misclassify domain-specific content. Fine-tuned or domain-adapted models improve accuracy.
  • Classification is not a substitute for proper access controls. Tagging data as "sensitive" does not protect it — IAM policies and encryption must enforce the classification.

Key Connections

  • depends_on General-Purpose LLM — requires content understanding
  • augments Apache Iceberg — enriches table metadata with classification tags
  • constrained_by High Cloud Inference Cost — per-object processing is expensive
  • scoped_to LLM-Assisted Data Systems, Metadata Management

Definition

What it is

Using LLMs to categorize, tag, or label S3-stored objects based on content analysis — classifying documents by topic, sensitivity level, or compliance category.

Why it exists

S3 buckets accumulate vast quantities of unlabeled data. Classification enables governance (identifying PII), organization (routing data to correct processing pipelines), and discovery (finding relevant data across a large lake).

Primary use cases

PII detection in S3-stored documents, automated data governance tagging, content-based routing in data lake ingestion.

Relationships

Inbound Relationships

Resources