Browse the Index

19 connections 3 resources

The storage paradigm of flat-namespace, HTTP-accessible binary objects with metadata. Data is addressed by bucket and key, not by filesystem path.

Lakehouse

14 connections 3 resources

The convergence of data lake storage (raw files on object storage) with data warehouse capabilities — ACID transactions, schema enforcement, SQL acces...

Data Lake

The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream...

Table Formats

15 connections 4 resources

The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collectio...

Vector Indexing on Object Storage

7 connections 3 resources

The practice of building and querying vector indexes over embeddings derived from data stored in S3.

LLM-Assisted Data Systems

14 connections 3 resources

The intersection of large language models and S3-centric data infrastructure. Scoped strictly to cases where LLMs operate on, enhance, or derive value...

Metadata Management

The discipline of maintaining catalogs, schemas, statistics, and descriptive information about objects and datasets stored in S3.

Data Versioning

2 connections 3 resources

Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, branching, and rollback.

Technology

AWS S3

10 connections 4 resources

Amazon's fully managed object storage service — the origin and reference implementation of the S3 API.

MinIO

7 connections 4 resources

An open-source, S3-compatible object storage server designed for high performance and self-hosted deployment.

Ceph

A distributed storage system providing object, block, and file storage in a unified platform. S3 compatibility via its RADOS Gateway (RGW).

Apache Ozone

A scalable, distributed object storage system in the Hadoop ecosystem with an S3-compatible interface.

Apache Iceberg

12 connections 4 resources

An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) ...

Delta Lake

8 connections 4 resources

An open table format and storage layer providing ACID transactions, scalable metadata, and schema enforcement on data stored in object storage. Origin...

Apache Hudi

7 connections 4 resources

A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage...

DuckDB

9 connections 3 resources

An in-process analytical database engine (like SQLite for analytics) that reads Parquet, Iceberg, and other formats directly from S3 without requiring...

Trino

9 connections 4 resources

A distributed SQL query engine for federated analytics across heterogeneous data sources, with deep support for S3-backed data lakes and lakehouses.

ClickHouse

A column-oriented DBMS designed for real-time analytical queries, with native support for reading from and writing to S3.

Apache Spark

9 connections 4 resources

A distributed compute engine for large-scale data processing — batch ETL, streaming, SQL, and machine learning — over S3-stored data.

LanceDB

A vector database that stores data in the Lance columnar format directly on object storage. Designed for serverless vector search without a separate i...

StarRocks

An MPP analytical database with native lakehouse capabilities, able to directly query S3 data in Parquet, ORC, and Iceberg formats.

Apache Flink

A distributed stream processing framework that processes data in real-time, with S3 as checkpoint store, state backend, and output sink.

Standard

S3 API

13 connections 3 resources

The HTTP-based API for object storage operations — PUT, GET, DELETE, LIST, multipart upload. The de-facto standard for object storage interoperability...

Apache Parquet

16 connections 4 resources

A columnar file format specification designed for efficient analytical queries. Stores data by column, enabling predicate pushdown, projection pruning...

Apache Arrow

A cross-language in-memory columnar data format specification with libraries for zero-copy reads, IPC, and efficient analytics.

Iceberg Table Spec

The specification defining how a logical table is represented as metadata files, manifest lists, manifests, and data files on object storage. Provides...

Delta Lake Protocol

The specification for ACID transaction logs over Parquet files on object storage. Defines how writes, deletes, and schema changes are recorded in a JS...

Apache Hudi Spec

4 connections 4 resources

The specification for managing incremental data processing on object storage — record-level upserts, deletes, change logs, and timeline-based metadata...

ORC

Optimized Row Columnar file format specification — a columnar format with built-in indexing, compression, and predicate pushdown support, originally d...

Apache Avro

A row-based data serialization format with rich schema definition and built-in schema evolution support. Schemas are stored with the data.

Architecture

Lakehouse Architecture

23 connections 3 resources

A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table f...

Medallion Architecture

A layered data quality pattern — Bronze (raw), Silver (cleansed), Gold (business-ready) — with each layer stored on object storage.

Separation of Storage and Compute

9 connections 3 resources

The design pattern of keeping data in S3 while running independent, elastically scaled compute engines against it.

Hybrid S3 + Vector Index

A pattern that stores raw data on S3 and maintains a vector index over embeddings that points back to S3 objects.

Offline Embedding Pipeline

A batch pattern where embeddings are generated from S3-stored data on a schedule, with resulting vectors written back to object storage or a vector in...

Local Inference Stack

A pattern of running ML/LLM models on local hardware against data stored in or pulled from S3, avoiding cloud-based inference APIs.

Write-Audit-Publish

A data quality pattern where data lands in a raw S3 zone, undergoes validation, and is promoted to a curated zone only after passing audits.

Tiered Storage

Moving data between hot, warm, and cold storage tiers based on access frequency. S3 itself offers tiering (Standard, Infrequent Access, Glacier).

Pain Point

Small Files Problem

8 connections 2 resources

Too many small objects in S3 degrade query performance and increase API call costs. Each file requires a separate GET request, and S3 charges per-requ...

Cold Scan Latency

8 connections 2 resources

Slow first-query performance against S3-stored data, caused by object discovery, metadata fetching, and data transfer over HTTP.

Schema Evolution

10 connections 2 resources

Changing data schemas (adding columns, renaming fields, altering types) in S3-stored datasets without breaking downstream consumers.

Legacy Ingestion Bottlenecks

Older ETL systems designed for HDFS or traditional databases that cannot efficiently write to modern S3-based lakehouse architectures.

High Cloud Inference Cost

The expense of running LLM/ML inference via cloud APIs (per-token or per-request pricing) against S3 data at scale.

Object Listing Performance

The slowness and cost of listing large numbers of objects in S3's flat namespace using prefix-based scans. Paginated at 1,000 objects per request.

Metadata Overhead at Scale

4 connections 2 resources

Table format metadata (manifests, snapshots, statistics) grows as S3 datasets grow, eventually slowing planning, compaction, and garbage collection.

Partition Pruning Complexity

The difficulty of efficiently skipping irrelevant S3 objects during queries. Requires careful partitioning strategy, predicate pushdown, and metadata ...

Vendor Lock-In

Dependence on a single S3 provider's proprietary features, pricing, or integrations that makes migration difficult.

Egress Cost

The cost charged by cloud providers for data transferred out of their S3 service — to the internet, another region, or another cloud.

S3 Consistency Model Variance

2 connections 3 resources

The differences in consistency guarantees across S3-compatible storage providers. AWS S3 is now strongly consistent; other providers may differ.

Lack of Atomic Rename

The S3 API has no atomic rename operation. Renaming requires copy-then-delete — a two-step, non-atomic process.

Model Class

Embedding Model

A class of model that converts unstructured data (text, images, audio) into fixed-dimensional vector representations suitable for similarity search.

General-Purpose LLM

10 connections 3 resources

A large language model for broad text tasks. In scope when applied to metadata extraction, summarization, schema inference, or querying of S3-stored c...

Code-Focused LLM

An LLM specialized for code understanding and generation. A subtype of General-Purpose LLM with enhanced ability to work with structured and semi-stru...

Small / Distilled Model

2 connections 2 resources

A compact model (typically under 10B parameters) suitable for local or edge deployment, often distilled from a larger model to retain key capabilities...

LLM Capability

Embedding Generation

7 connections 2 resources

Converting unstructured content stored in S3 (documents, images, logs) into vector representations for similarity search.

Semantic Search

Querying S3-derived vector embeddings to find content by meaning rather than exact keyword match.

Metadata Extraction

Using LLMs to extract structured metadata (entities, categories, summaries, key-value pairs) from unstructured objects stored in S3.

Schema Inference

7 connections 3 resources

Using LLMs to infer or suggest schemas from semi-structured data (JSON, CSV, nested formats) stored in S3.

Data Classification

6 connections 2 resources

Using LLMs to categorize, tag, or label S3-stored objects based on content analysis — by topic, sensitivity level, or compliance category.

Natural Language Querying