# LLMS3: The S3 & Object Storage Ecosystem Index > LLMS3 is a curated index of the S3 and object storage ecosystem. It maps 61 nodes across 7 types (Topics, Technologies, Standards, Architectures, Pain Points, Model Classes, LLM Capabilities) and 192 authoritative resources. The index covers the technologies, standards, architectural patterns, and engineering challenges that define how data is stored, queried, and processed on S3-compatible object storage. LLMS3 is a structured knowledge base for the S3 and object storage ecosystem. It is designed to help engineers, architects, and LLMs navigate the landscape of technologies, standards, and patterns that surround S3-compatible object storage. This file is organized into sections. **Guides** come first — 8 cross-cutting guides that address common engineering decisions and trade-offs. Each guide references specific nodes from the index. After the guides, nodes are organized by type: **Topics** (navigational entry points — conceptual domains with no version or maintainer), **Technologies** (concrete tools, systems, or platforms with version histories and maintainers), **Standards** (format, protocol, or interface specifications that technologies implement), **Architectures** (repeatable system designs — blueprints, not products), **Pain Points** (concrete, recurring problems experienced by engineers operating S3-centric systems at scale), **Model Classes** (categories of ML/LLM models by their operational role in S3-centric systems), and **LLM Capabilities** (specific functions performed by models, scoped to operations on S3-stored data). The file concludes with a **Relationship Index** — a compact edge list showing how every node connects to every other node. ## Guides ### Guide 1: How S3 Shapes Lakehouse Design {#how-s3-shapes-lakehouse-design} #### Problem framing Every lakehouse architecture sits on object storage — almost always S3 or an S3-compatible store. But S3 is not a database, and its constraints fundamentally shape how lakehouses are designed. Engineers building lakehouses need to understand which S3 behaviors are features, which are limitations, and how table formats work around both. #### Relevant nodes - **Topics:** S3, Object Storage, Lakehouse, Table Formats - **Technologies:** AWS S3, MinIO, Ceph, Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino, DuckDB, ClickHouse, StarRocks - **Standards:** S3 API, Apache Parquet, Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec - **Architectures:** Lakehouse Architecture, Separation of Storage and Compute, Medallion Architecture - **Pain Points:** Lack of Atomic Rename, Cold Scan Latency, Small Files Problem, Metadata Overhead at Scale, Object Listing Performance #### Decision path 1. **Choose your S3 layer.** AWS S3 for managed convenience, MinIO for self-hosted control, Ceph for unified storage needs. This choice determines consistency model, available features, and egress economics. 2. **Choose a table format.** This is the most consequential decision: - **Iceberg** if you need multi-engine access (Spark + Trino + Flink reading the same tables), hidden partitioning, and broad community adoption. - **Delta Lake** if you are in the Databricks ecosystem and want tight Spark integration with streaming+batch unification. - **Hudi** if your primary workload is CDC ingestion with record-level upserts. - All three use Parquet as the data file format. The difference is in metadata structure, commit protocol, and partition management. 3. **Understand the S3 constraints you are inheriting:** - **No atomic rename** → table commits require workarounds (DynamoDB for Delta, metadata pointers for Iceberg). Plan for this complexity. - **LIST is slow** → table formats reduce listing dependency through manifests, but metadata itself grows and must be maintained. - **Cold scan latency** → first queries are slow. Metadata-driven pruning (partition pruning, column statistics) is essential, not optional. - **Small files** → streaming writes and high-parallelism batch jobs produce small files by default. Compaction is mandatory. 4. **Choose your query engines.** Separation of storage and compute means multiple engines can read the same S3 data: - **Spark** for batch ETL and large-scale transformations - **Trino** for interactive federated queries - **DuckDB** for single-machine ad-hoc exploration - **StarRocks/ClickHouse** for low-latency dashboards 5. **Plan metadata operations.** Snapshot expiration, orphan file cleanup, manifest merging, and compaction are operational requirements, not optional maintenance tasks. At scale, these consume significant compute. #### What changed over time - Early data lakes on S3 had no table semantics — raw Parquet files with Hive-style partitioning and no transactions. - Table formats (Hudi 2016, Delta 2019, Iceberg 2018 graduated to Apache TLP 2020) added ACID, schema evolution, and time-travel. - AWS S3 moved from eventual to strong consistency (December 2020), eliminating a class of bugs but not the atomic rename gap. - Iceberg has converged toward becoming the de-facto standard, with Databricks adding Iceberg support alongside Delta. - Metadata management (catalogs, compaction, GC) has shifted from "nice to have" to a core operational requirement as lakehouse deployments have matured. #### Sources - https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf (Paper — the foundational lakehouse paper) - https://iceberg.apache.org/spec/ (Spec — Iceberg table format specification) - https://github.com/delta-io/delta/blob/master/PROTOCOL.md (Spec — Delta Lake protocol) - https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html (Docs — S3 performance patterns) - https://delta.io/blog/2022-05-18-multi-cluster-writes-to-delta-lake-storage-in-s3/ (Blog — how Delta handles S3's lack of atomic rename) - https://docs.databricks.com/aws/en/delta/s3-limitations (Docs — explicit S3 limitations for Delta Lake) - https://www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake/ (Blog — table format comparison) ### Guide 2: Small Files Problem — Why It Exists and the Common Mitigations {#small-files-problem} #### Problem framing A dataset with 10 million 10KB files performs worse on S3 than the same data in 100 files of 1GB each. The small files problem is the most common performance issue in S3-based systems, and it is caused by how data is produced, not by S3 itself. Every S3 LIST call returns at most 1,000 objects, every GET has per-request latency, and analytical engines must open each file individually. #### Relevant nodes - **Topics:** S3, Object Storage, Table Formats - **Technologies:** Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Apache Flink, DuckDB, Trino - **Standards:** Apache Parquet - **Architectures:** Medallion Architecture, Lakehouse Architecture - **Pain Points:** Small Files Problem, Object Listing Performance, Cold Scan Latency #### Decision path 1. **Identify the root cause.** Small files come from three common sources: - **Streaming writes:** Flink/Spark Streaming commits one file per checkpoint interval per partition. With 100 partitions and 1-minute checkpoints, that is 100 files per minute. - **High-parallelism batch writes:** A Spark job with 1,000 tasks writing one file each produces 1,000 files per batch. - **Excessive partitioning:** Partitioning by high-cardinality columns (e.g., user_id) creates one file per partition value per write. 2. **Fix at the writer level (proactive):** - Reduce Spark write parallelism with `coalesce()` or `repartition()` before writing. - Increase Flink checkpoint intervals where freshness requirements allow. - Partition by low-cardinality columns (date, region) not high-cardinality ones. - Use Spark's Adaptive Query Execution (AQE) to coalesce small shuffle partitions. 3. **Fix at the table format level (reactive):** - **Iceberg:** Run `rewriteDataFiles` for compaction. Iceberg's hidden partitioning reduces over-partitioning risk. - **Delta Lake:** Use `OPTIMIZE` with Z-ordering or liquid clustering. Databricks Auto Compaction handles this automatically. - **Hudi:** Configure inline compaction for Merge-on-Read tables or run offline compaction jobs. 4. **Target file sizes.** For Parquet files on S3: - Analytical queries: 256MB–1GB per file - Streaming with near-real-time needs: 128MB minimum, compact to 256MB+ periodically - Below 100MB: almost always problematic 5. **Monitor continuously.** Small files accumulate over time. Set up monitoring for average file size per table/partition and alert when it drops below threshold. #### What changed over time - Early Hadoop data lakes had the same problem on HDFS, but HDFS NameNode memory limits forced engineers to address it. S3's limitless namespace hid the problem until query performance degraded. - Table formats introduced compaction as a first-class operation. Iceberg's `rewriteDataFiles`, Delta's `OPTIMIZE`, and Hudi's inline compaction all exist specifically because of this problem. - Auto-compaction features (Databricks Auto Optimize, Spark AQE) have shifted the solution from manual intervention to automated background maintenance. - The problem has not gone away — it has moved from "my job produces too many files" to "my compaction job cannot keep up with my write rate." #### Sources - https://delta.io/blog/2023-01-25-delta-lake-small-file-compaction-optimize/ (Blog — Delta Lake OPTIMIZE for small file compaction) - https://docs.databricks.com/aws/en/delta/tune-file-size (Docs — controlling file sizes in Delta) - https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html (Docs — S3 performance optimization) - https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html (Docs — 1,000-object LIST limit) - https://iceberg.apache.org/spec/ (Spec — Iceberg metadata tree structure) - https://www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake/ (Blog — format comparison including compaction approaches) ### Guide 3: Why Iceberg Exists (and What It Replaces) {#why-iceberg-exists} #### Problem framing Before Iceberg, querying data on S3 meant pointing a Hive Metastore at a directory of Parquet files and hoping for the best. There were no transactions, schema changes required rewriting data, partition layouts were user-visible and fragile, and concurrent reads/writes produced unpredictable results. Iceberg replaces this entire stack of workarounds with a formal table specification. #### Relevant nodes - **Topics:** Table Formats, Lakehouse, S3 - **Technologies:** Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino, DuckDB, Apache Flink - **Standards:** Iceberg Table Spec, Apache Parquet, S3 API - **Architectures:** Lakehouse Architecture - **Pain Points:** Schema Evolution, Small Files Problem, Partition Pruning Complexity, Metadata Overhead at Scale, Lack of Atomic Rename #### Decision path 1. **Understand what Iceberg replaces:** - **Hive-style partitioning** → Iceberg's hidden partitioning. Users no longer need to specify partition columns in queries; the table format handles pruning transparently. - **Schema rigidity** → Iceberg's column-ID-based schema evolution. Add, drop, rename, and reorder columns as metadata-only operations. No data rewrite required. - **No transactions** → Iceberg's snapshot isolation. Writers produce new snapshots; readers see consistent table state. Concurrent access is safe. - **Directory listing for file discovery** → Iceberg's manifest files. Query planners read manifests instead of listing S3 prefixes — eliminating the object listing bottleneck. 2. **Decide if Iceberg is right for your workload:** - **Yes** if you need multi-engine access (Spark, Trino, Flink, DuckDB all reading the same tables). - **Yes** if schema evolution is frequent and you cannot afford data rewrites. - **Yes** if you want vendor-neutral table format with the broadest ecosystem support. - **Consider alternatives** if you are deeply invested in Databricks (Delta Lake has tighter integration) or need CDC-first ingestion patterns (Hudi specializes here). 3. **Understand Iceberg's S3 constraints:** - Iceberg metadata is stored as files on S3. Metadata operations (commit, planning) are subject to S3 latency. - Atomic commits on S3 require a catalog (Hive Metastore, Nessie, AWS Glue) to coordinate metadata pointer updates. - Metadata grows with every commit. Snapshot expiration and orphan file cleanup are operational necessities. 4. **Plan for metadata maintenance from day one:** - Expire old snapshots regularly (`expireSnapshots`) - Remove orphan files that are no longer referenced - Compact manifests when manifest lists grow large - Monitor metadata file counts and planning times #### What changed over time - Iceberg started at Netflix (2018) to solve table management problems at Netflix's scale on S3. - Graduated to Apache Top-Level Project (2020), signaling broad industry adoption. - Multi-engine support expanded — from Spark-only to Spark, Trino, Flink, DuckDB, ClickHouse, StarRocks. - Iceberg REST catalog emerged as a standard catalog interface, reducing lock-in to specific metadata stores. - Databricks began supporting Iceberg alongside Delta, effectively acknowledging Iceberg's momentum as the cross-engine standard. #### Sources - https://iceberg.apache.org/spec/ (Spec — the authoritative table format specification) - https://iceberg.apache.org/docs/latest/ (Docs — official documentation) - https://iceberg.apache.org/docs/latest/aws/ (Docs — AWS/S3 integration specifics) - https://github.com/apache/iceberg (GitHub — canonical repository) - https://iceberg.apache.org/docs/latest/evolution/ (Docs — schema evolution mechanics) - https://www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake/ (Blog — comparison with Delta and Hudi) - https://www.dremio.com/blog/table-format-partitioning-comparison-apache-iceberg-apache-hudi-and-delta-lake/ (Blog — partitioning strategy comparison) ### Guide 4: Where DuckDB Fits (and Where It Doesn't) {#where-duckdb-fits} #### Problem framing Engineers encounter S3-stored data constantly — Parquet files in data lakes, Iceberg tables in lakehouses, ad-hoc exports. Historically, exploring this data required setting up Spark clusters or Trino coordinators. DuckDB changes the equation by bringing fast columnar analytics to a single machine, reading directly from S3. But knowing when DuckDB is the right tool — and when it is not — prevents both over-engineering and under-performing. #### Relevant nodes - **Topics:** S3, Lakehouse - **Technologies:** DuckDB, Trino, Apache Spark, ClickHouse, StarRocks - **Standards:** Apache Parquet, Apache Arrow - **Pain Points:** Small Files Problem, Object Listing Performance, Cold Scan Latency #### Decision path 1. **Use DuckDB when:** - You need ad-hoc exploration of S3 data (quick SELECT against a few Parquet files) - You are developing and testing queries before deploying them to Spark or Trino - You need embedded analytics in an application (DuckDB runs in-process, no server needed) - Your data fits in a single machine's processing capacity (up to ~100GB of result sets, much more for streaming scans) - You want to query Iceberg tables on S3 without deploying a cluster 2. **Do not use DuckDB when:** - Data volume requires distributed processing (petabyte-scale joins, multi-TB shuffles) - You need concurrent multi-user access (DuckDB is single-process) - You need to write to table formats on S3 in production pipelines (use Spark/Flink) - You are querying millions of small files on S3 (DuckDB is constrained by S3 listing performance) 3. **DuckDB + S3 configuration:** - Use the `httpfs` extension for S3 access with credential configuration - DuckDB supports reading Parquet, CSV, JSON, and Iceberg directly from S3 URIs - Arrow integration enables zero-copy data exchange with Python analytics libraries - Parallel S3 reads improve throughput for larger datasets 4. **DuckDB vs. alternatives (quick reference):** - **DuckDB vs. Spark:** DuckDB for single-machine, interactive; Spark for distributed, production pipelines - **DuckDB vs. Trino:** DuckDB for local exploration; Trino for multi-user, multi-source, federated queries - **DuckDB vs. ClickHouse:** DuckDB for embedded/serverless; ClickHouse for persistent, low-latency dashboards - **DuckDB vs. StarRocks:** DuckDB for development; StarRocks for production analytics with caching #### What changed over time - DuckDB started as an academic project (CWI Amsterdam) focused on in-process OLAP — the "SQLite for analytics." - S3 support came via the `httpfs` extension, making DuckDB immediately useful for data lake exploration. - Iceberg support expanded DuckDB from "Parquet file reader" to "lakehouse query tool" — querying table format metadata, not just raw files. - The "DuckDB for everything" trend has led to engineers using it beyond its design envelope. Single-machine performance is excellent but has a ceiling. - Integration with Python (pandas, Polars, Arrow) has made DuckDB the default local analytics tool for data engineers. #### Sources - https://duckdb.org/docs/ (Docs — official documentation) - https://duckdb.org/docs/extensions/httpfs/s3api (Docs — S3 access configuration) - https://github.com/duckdb/duckdb (GitHub — source repository) - https://arrow.apache.org/docs/format/Columnar.html (Spec — Arrow format used for in-memory processing) ### Guide 5: Vector Indexing on Object Storage — What's Real vs. Hype {#vector-indexing-real-vs-hype} #### Problem framing Vector databases and semantic search are heavily marketed features in the AI ecosystem. For engineers building on S3, the question is practical: can you build production vector search over S3-stored data, and what are the real trade-offs? The answer depends on data volume, latency requirements, and whether you need a separate infrastructure layer. #### Relevant nodes - **Topics:** Vector Indexing on Object Storage, LLM-Assisted Data Systems, S3 - **Technologies:** LanceDB, AWS S3 - **Standards:** S3 API - **Architectures:** Hybrid S3 + Vector Index, Offline Embedding Pipeline, Local Inference Stack - **Model Classes:** Embedding Model, Small / Distilled Model - **LLM Capabilities:** Embedding Generation, Semantic Search - **Pain Points:** High Cloud Inference Cost, Cold Scan Latency #### Decision path 1. **Decide if you need vector search at all:** - **Yes** if your data is unstructured (documents, images, logs) and users need to find content by meaning. - **Yes** if you are building RAG systems grounded in S3-stored corpora. - **No** if your queries are structured (SQL filters, exact matches, aggregations). Table formats and SQL engines are the right tool. - **Maybe** if you want to combine semantic and structured search (hybrid search) — this is real but adds complexity. 2. **Choose your vector index architecture:** - **S3-native (LanceDB):** Vector indexes stored as files on S3. Serverless, no separate infrastructure, lowest operational overhead. Trade-off: higher query latency (S3 read on every query). - **Dedicated vector database (Milvus, Weaviate):** Separate infrastructure with in-memory indexes. Lower latency, higher throughput. Trade-off: another system to operate, and you store data in two places (S3 + vector DB). - **Managed service (OpenSearch, S3 Vectors):** Cloud-managed vector search. Trade-off: vendor lock-in and cost at scale. 3. **Plan your embedding pipeline:** - Source data lives in S3 → embedding model processes it → vectors are stored in the index - **Batch (Offline Embedding Pipeline):** Process S3 data on a schedule. Cost-predictable. Stale by design. - **Stream:** Embed on ingest. Fresh but expensive and operationally complex. - **Embedding model choice:** Commercial APIs (OpenAI) for quality, open-source (sentence-transformers) for cost/privacy, small/distilled models for local inference. 4. **Understand what's real vs. hype:** - **Real:** Vector search over thousands to millions of documents on S3. LanceDB handles 1B+ vectors on S3. RAG with S3-backed corpora works in production. - **Real:** Embedding costs dominate the total cost. The index itself is cheap; generating embeddings is not. - **Hype:** "Just add vector search to your data lake." Integration requires embedding pipelines, index maintenance, sync mechanisms, and relevance tuning. - **Hype:** "Vector search replaces SQL." It does not. It answers a different question (semantic similarity vs. predicate matching). #### What changed over time - Early vector databases (2020-2022) were standalone systems with no S3 story. Data had to be copied in. - S3-native vector search emerged (LanceDB, Lance format) to align with the separation of storage and compute principle. - AWS announced S3 Vectors — native vector storage in S3 itself — signaling that vector search is moving into the storage layer. - Embedding model costs dropped significantly (open-source models, quantized models, distillation). This makes the embedding pipeline more viable at S3 data scale. - The "RAG over S3 data" pattern has become a standard architecture, with AWS, Databricks, and LangChain providing reference implementations. #### Sources - https://aws.amazon.com/blogs/architecture/a-scalable-elastic-database-and-search-solution-for-1b-vectors-built-on-lancedb-and-amazon-s3/ (Blog — 1B vector search on S3) - https://lancedb.github.io/lancedb/ (Docs — S3-native vector database) - https://milvus.io/docs/overview.md (Docs — vector database with S3 backend) - https://milvus.io/docs/deploy_s3.md (Docs — Milvus S3 configuration) - https://aws.amazon.com/blogs/aws/introducing-amazon-s3-vectors-first-cloud-storage-with-native-vector-support-at-scale/ (Blog — S3 Vectors announcement) - https://sbert.net/ (Docs — open-source embedding models) - https://platform.openai.com/docs/guides/embeddings (Docs — commercial embedding API) - https://github.com/aws-samples/text-embeddings-pipeline-for-rag (GitHub — reference embedding pipeline) ### Guide 6: LLMs over S3 Data — Embeddings, Metadata, and Local Inference Constraints {#llms-over-s3-data} #### Problem framing LLMs can extract value from S3-stored data — generating embeddings, extracting metadata, classifying documents, inferring schemas, and translating natural language to SQL. But every one of these operations has a cost, and at S3 data volumes (terabytes to petabytes), the cost question dominates. Engineers need to understand which LLM capabilities are viable at their scale, how to control costs, and when local inference is the right answer. #### Relevant nodes - **Topics:** LLM-Assisted Data Systems, S3, Vector Indexing on Object Storage, Metadata Management - **Technologies:** LanceDB, AWS S3 - **Architectures:** Offline Embedding Pipeline, Local Inference Stack, Hybrid S3 + Vector Index - **Model Classes:** Embedding Model, General-Purpose LLM, Code-Focused LLM, Small / Distilled Model - **LLM Capabilities:** Embedding Generation, Semantic Search, Metadata Extraction, Schema Inference, Data Classification, Natural Language Querying - **Pain Points:** High Cloud Inference Cost, Egress Cost #### Decision path 1. **Assess your LLM use case against S3 data volume:** - **Embedding generation** at 1M documents: ~$50-500 via cloud API, ~$5-50 on local GPU. Viable at most scales. - **Metadata extraction** on 10M objects: ~$5,000-50,000 via cloud API. Only viable with prioritization (extract from high-value objects only) or local inference. - **Schema inference** is low-volume (run once per new dataset). Cloud API cost is negligible. - **Natural language querying** is per-query cost. Low volume, high value per query. Cloud API is usually fine. - **Data classification** at petabyte scale: requires local inference or AWS Macie for PII. Cloud LLM APIs are prohibitive. 2. **Choose your inference strategy:** - **Cloud API (OpenAI, Bedrock, SageMaker):** Highest quality, highest cost, zero infrastructure. Use for low-volume, high-value tasks (schema inference, NL querying). - **Managed local (SageMaker endpoints):** Medium cost, auto-scaling, AWS-managed. Use for medium-volume batch processing. - **Self-hosted local (vLLM, llama.cpp):** Lowest per-token cost at high volume, highest operational overhead. Use for high-volume embedding and classification. - **Small/distilled models:** Run on commodity hardware. Quality trade-off. Use when 90% accuracy is acceptable and volume makes cloud APIs prohibitive. 3. **Account for data movement costs:** - Cloud inference often requires moving S3 data to inference endpoints → egress charges. - Local inference with MinIO (on-premise S3) eliminates egress entirely. - Hybrid: keep models near data. Deploy inference in the same region/VPC as your S3 buckets. 4. **Structure your pipeline:** - Use the **Offline Embedding Pipeline** pattern for batch processing. Schedule daily/weekly. Idempotent and resumable. - Store embeddings back to S3 (Lance format, Parquet with vector columns, or dedicated vector store). - Use the **Hybrid S3 + Vector Index** pattern to make embedded data searchable. - Metadata extraction results → enrich table format metadata (Iceberg custom properties, Glue Data Catalog tags). 5. **Set quality expectations:** - LLM outputs are probabilistic. Schema inference suggestions need human review. Classification needs confidence thresholds. NL-to-SQL needs query validation. - Build validation into the pipeline, not as an afterthought. #### What changed over time - Early LLM-over-data workloads (2022-2023) used cloud APIs exclusively. Costs were high and scale was limited. - Open-source embedding models (sentence-transformers, E5) made local embedding generation viable. - Quantized inference (llama.cpp, GGML/GGUF) brought LLM inference to commodity hardware. - vLLM and model streaming from S3 (Run:ai Model Streamer) reduced cold-start latency for self-hosted inference. - AWS introduced S3 Vectors and S3 Metadata features, signaling that LLM-derived data enrichment is moving into the storage platform itself. - The cost-per-token of both cloud and local inference has dropped steadily, but S3 data volumes grow faster. The economic tension persists. #### Sources - https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer/ (Docs — loading models from S3) - https://github.com/ggml-org/llama.cpp (GitHub — local inference engine) - https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/ (Blog — S3-backed model streaming) - https://docs.aws.amazon.com/sagemaker/latest/dg/inference-cost-optimization.html (Docs — inference cost optimization) - https://aws.amazon.com/bedrock/ (Docs — managed LLM service with S3 integration) - https://sbert.net/ (Docs — open-source embedding framework) - https://aws.amazon.com/blogs/storage/building-self-managed-rag-applications-with-amazon-eks-and-amazon-s3-vectors/ (Blog — self-managed RAG on S3) - https://engineering.grab.com/llm-powered-data-classification (Blog — LLM classification at petabyte scale) - https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide (Blog — inference unit economics) - https://aws.amazon.com/s3/features/metadata/ (Docs — S3 Metadata feature) ### Guide 7: Choosing a Table Format — Iceberg vs. Delta vs. Hudi {#choosing-a-table-format} #### Problem framing The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — all solve the same fundamental problem: adding transactional table semantics to files on S3. But they solve it differently, optimize for different workloads, and have different ecosystem affinities. This guide helps engineers choose. #### Relevant nodes - **Topics:** Table Formats, Lakehouse, S3 - **Technologies:** Apache Iceberg, Delta Lake, Apache Hudi, Apache Spark, Trino, DuckDB, Apache Flink - **Standards:** Iceberg Table Spec, Delta Lake Protocol, Apache Hudi Spec, Apache Parquet - **Architectures:** Lakehouse Architecture - **Pain Points:** Schema Evolution, Small Files Problem, Lack of Atomic Rename, Metadata Overhead at Scale, Vendor Lock-In #### Decision path 1. **Start with your primary engine:** - **Databricks/Spark-heavy:** Delta Lake has the tightest integration. Features like Auto Optimize, liquid clustering, and predictive I/O work best (or only) on Databricks. - **Multi-engine (Spark + Trino + Flink + DuckDB):** Iceberg. It was designed for engine-agnostic access from the start. Every major engine has a first-class Iceberg connector. - **CDC-first (Change Data Capture):** Hudi. Record-level upserts and incremental queries are Hudi's core strength. MoR table type is optimized for write-heavy, update-heavy workloads. 2. **Evaluate on S3-specific dimensions:** | Dimension | Iceberg | Delta Lake | Hudi | |-----------|---------|------------|------| | S3 atomic commit | Catalog-based pointer swap | Requires DynamoDB log store | Marker-based with lock provider | | Schema evolution | Column-ID-based, metadata-only | Enforced + evolvable | Schema-on-read + enforcement | | Partition management | Hidden partitioning (transparent) | User-managed (+ liquid clustering on Databricks) | User-managed | | Compaction | `rewriteDataFiles` | `OPTIMIZE` | Inline or offline compaction | | Multi-engine support | Broadest | Improving (Delta Kernel) | Moderate | | Metadata model | Manifest tree (prunable) | Flat JSON log (checkpointed) | Timeline (action-based) | 3. **Consider ecosystem momentum:** - Iceberg is converging toward becoming the industry standard. Snowflake, AWS, Google, and Databricks all support it. - Delta Lake remains strong in the Databricks ecosystem and is gaining multi-engine support via Delta Kernel. - Hudi adoption is concentrated in CDC-heavy and streaming-heavy environments (Uber, ByteDance). 4. **Do not over-invest in the choice:** - All three formats use Parquet as the data file format. Migration between formats is a metadata operation, not a data rewrite. - The trend is toward interoperability (Iceberg compatibility layers for Delta, UniForm for cross-format reading). The choice is becoming less permanent. #### What changed over time - 2016-2018: Hudi (then Hoodie) emerged at Uber for incremental ETL; Iceberg developed at Netflix for massive-scale table management; Delta developed at Databricks for reliable Spark pipelines. - 2019-2020: All three open-sourced and entered Apache or equivalent foundations. The "format war" narrative emerged. - 2021-2023: Iceberg gained momentum as the cross-engine standard. Snowflake, AWS (Athena/Glue), and Trino adopted it. - 2023-2024: Databricks announced UniForm (Delta tables readable as Iceberg) and direct Iceberg support, effectively hedging on format convergence. - The industry trend is toward Iceberg as the de-facto standard, with Delta and Hudi remaining viable in their core ecosystems. #### Sources - https://iceberg.apache.org/spec/ (Spec — Iceberg table format specification) - https://github.com/delta-io/delta/blob/master/PROTOCOL.md (Spec — Delta Lake protocol) - https://hudi.apache.org/tech-specs/ (Spec — Hudi technical specifications) - https://hudi.apache.org/docs/overview (Docs — Hudi overview and table types) - https://www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake/ (Blog — comprehensive comparison) - https://www.dremio.com/blog/table-format-partitioning-comparison-apache-iceberg-apache-hudi-and-delta-lake/ (Blog — partitioning comparison) - https://docs.delta.io/latest/delta-storage.html (Docs — Delta S3 storage configuration) - https://www.onehouse.ai/blog/open-table-formats-and-the-open-data-lakehouse-in-perspective (Blog — open table format strategy) ### Guide 8: Egress, Lock-In, and the Case for S3-Compatible Alternatives {#egress-lock-in-s3-alternatives} #### Problem framing AWS S3 egress pricing and proprietary feature creep create a gravitational well: data flows in cheaply but flows out expensively. For organizations with multi-cloud strategies, data sovereignty requirements, or cost sensitivity, this creates a strategic problem. S3-compatible alternatives (MinIO, Ceph, Ozone) and open table formats offer a way out — but with real trade-offs. #### Relevant nodes - **Topics:** S3, Object Storage - **Technologies:** AWS S3, MinIO, Ceph, Apache Ozone - **Standards:** S3 API - **Architectures:** Separation of Storage and Compute, Tiered Storage, Local Inference Stack - **Pain Points:** Vendor Lock-In, Egress Cost, S3 Consistency Model Variance #### Decision path 1. **Quantify your lock-in exposure:** - How much data egress are you paying monthly? (Check AWS Cost Explorer, data transfer line items) - Which AWS-specific S3 features do you depend on? (S3 Select, S3 Inventory, S3 Object Lambda, S3 Intelligent-Tiering, S3 Glacier) - Could your table format, query engine, and ML pipeline run on a different S3-compatible store without modification? 2. **Evaluate S3-compatible alternatives:** - **MinIO:** Best for teams that want S3-compatible storage with zero egress on their own hardware. Highest S3 API coverage among alternatives. Single-binary deployment. - **Ceph:** Best for organizations that need unified storage (object + block + file) on a single platform. Higher operational complexity. - **Apache Ozone:** Best for organizations migrating from Hadoop/HDFS and needing both Hadoop FS and S3 API access. 3. **Assess trade-offs honestly:** - **Consistency:** MinIO provides strict consistency. Ceph and Ozone may differ — test your workload's assumptions. - **Feature coverage:** AWS-specific features (S3 Select, S3 Inventory, Glacier tiers) may not exist in alternatives. - **Operational cost:** Self-hosted storage has hardware, networking, staffing, and maintenance costs. Compare total cost of ownership, not just egress savings. - **Performance:** AWS S3 is a planet-scale distributed system. Self-hosted alternatives may not match throughput or durability at the same scale. 4. **Mitigate lock-in without full migration:** - Use open table formats (Iceberg, Delta, Hudi) instead of proprietary formats. Data stays portable even if the storage layer changes. - Use the S3 API as the interface contract. Avoid AWS-specific extensions where S3 API operations suffice. - Use **Tiered Storage** strategically — keep hot data in AWS S3 for performance, cold data on-premise for cost. - Use **Separation of Storage and Compute** — if you change storage layers, compute engines keep working. 5. **Hybrid architectures:** - Production data on AWS S3 + development/testing on MinIO → reduces AWS costs, maintains compatibility - Hot data in AWS S3 + archival on self-hosted MinIO → tiered by cost - Multi-cloud with Iceberg tables → same table format readable from any S3-compatible store #### What changed over time - Early cloud adoption treated egress costs as negligible. As data volumes grew, egress became a significant budget line. - AWS reduced some egress charges (free egress to CloudFront, lower cross-AZ pricing) but the fundamental incentive structure persists: data gravity toward AWS. - MinIO's growth accelerated as organizations sought S3-compatible alternatives for on-premise and edge deployments. - Open table formats reduced data format lock-in (no proprietary file formats), but infrastructure lock-in (IAM, VPC, monitoring, catalog integration) remains. - Cloud providers began offering competitive pricing (Cloudflare R2 with zero egress, Google Cloud free egress to specific destinations), creating pricing pressure that may reduce egress costs further. #### Sources - https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/ (Blog — AWS data transfer cost architecture) - https://docs.aws.amazon.com/cur/latest/userguide/cur-data-transfers-charges.html (Docs — understanding egress charges) - https://www.cloudzero.com/blog/aws-egress-costs/ (Blog — egress cost analysis) - https://www.cloudflare.com/learning/cloud/what-is-vendor-lock-in/ (Docs — vendor lock-in explained) - https://min.io/docs/minio/linux/index.html (Docs — MinIO documentation) - https://docs.ceph.com/en/latest/radosgw/s3/ (Docs — Ceph S3 gateway) - https://ozone.apache.org/ (Docs — Apache Ozone) - https://www.onehouse.ai/blog/open-table-formats-and-the-open-data-lakehouse-in-perspective (Blog — open formats reduce lock-in) - https://aws.amazon.com/s3/storage-classes/ (Docs — S3 storage classes for tiering) ## Topics ### S3 {#s3} **What it is:** Amazon's Simple Storage Service and the broader ecosystem of S3-compatible object storage. The root concept of this entire index. **Where it fits:** Every node in the index answers the question "How does this relate to S3?" S3 is not just a product — it is the API, the paradigm, and the ecosystem that the rest of the map is built around. **Misconceptions / traps:** - S3 is not a filesystem. It has no directories, no atomic rename, and no POSIX semantics. Treating it like a filesystem causes subtle bugs. - "S3-compatible" does not mean identical. Consistency guarantees, performance characteristics, and feature coverage vary across providers. **Key connections:** - Root topic — all other Topics connect inward via `scoped_to` - **Object Storage** `scoped_to` S3 — S3 is the dominant implementation of the object storage paradigm - **S3 API** `scoped_to` S3 — the HTTP interface that defines the ecosystem - **AWS S3** `scoped_to` S3 — the origin and reference implementation **Sources:** - https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html (Docs, High) - https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html (Docs, High) - https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html (Docs, High) ### Object Storage {#object-storage} **What it is:** The storage paradigm of flat-namespace, HTTP-accessible binary objects with metadata. Data is addressed by bucket and key, not by filesystem path. **Where it fits:** Object storage is the foundational layer beneath everything in this index. S3 is the dominant API; all technologies, table formats, and architectures in the map operate on top of object storage. **Misconceptions / traps:** - Object storage has no native directory hierarchy. Prefixes simulate folders but LIST operations scan linearly — not like `ls` on a filesystem. - Durability (11 9s) is not the same as availability or performance. Data is safe but access can be slow or throttled. **Key connections:** - `scoped_to` **S3** — S3 is the dominant object storage API - **Lakehouse** `scoped_to` Object Storage — lakehouses are built on object storage - **AWS S3**, **MinIO**, **Ceph**, **Apache Ozone** `scoped_to` Object Storage — concrete implementations - **Separation of Storage and Compute** `scoped_to` Object Storage — the pattern that decouples compute from data **Sources:** - https://aws.amazon.com/what-is/object-storage/ (Docs, High) - https://www.redhat.com/en/topics/data-storage/what-is-object-storage (Docs, High) - https://min.io/product/overview (Docs, High) ### Lakehouse {#lakehouse} **What it is:** The convergence of data lake storage (raw files on object storage) with data warehouse capabilities — ACID transactions, schema enforcement, SQL access, time-travel. **Where it fits:** Lakehouse sits between raw object storage and business analytics. It is the architectural layer where table formats (Iceberg, Delta, Hudi) add structure to S3 data, enabling SQL engines to query it reliably. **Misconceptions / traps:** - A lakehouse is not just "a data lake with SQL." The key differentiator is transactional guarantees — ACID, schema evolution, snapshot isolation — provided by table format specs. - Lakehouse does not eliminate ETL. It eliminates the second copy of data in a separate warehouse, but data still needs transformation. **Key connections:** - `scoped_to` **Object Storage** — the lakehouse stores all data on object storage - **Lakehouse Architecture** `scoped_to` Lakehouse — the concrete architectural pattern - **Apache Iceberg**, **Delta Lake**, **Apache Hudi** `scoped_to` Lakehouse — table format technologies - **Medallion Architecture** `scoped_to` Lakehouse — a data quality pattern within lakehouses - **Iceberg Table Spec**, **Delta Lake Protocol**, **Apache Hudi Spec** `scoped_to` Lakehouse — the specifications that define table semantics **Sources:** - https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf (Paper, High) - https://www.databricks.com/glossary/data-lakehouse (Docs, High) - https://docs.databricks.com/aws/en/lakehouse-architecture/ (Docs, High) ### Data Lake {#data-lake} **What it is:** The pattern of storing raw, heterogeneous data in object storage for later processing. Data arrives in its original form and is transformed downstream. **Where it fits:** Data lakes are the precursor to lakehouses. In the S3 world, a data lake is the simplest form — dump everything into S3 and figure out the schema later. Lakehouses add the structure that data lakes lack. **Misconceptions / traps:** - "Schema-on-read" does not mean "no schema." Without any schema management, data lakes become data swamps — undiscoverable and untrusted. - Data lakes and lakehouses are not mutually exclusive. Most lakehouses include raw data lake zones (e.g., Medallion Bronze layer). **Key connections:** - `is_a` **Object Storage** — a data lake is a use of object storage - `scoped_to` **S3** — S3 is the dominant storage layer for data lakes - **Apache Spark** `scoped_to` Data Lake — the primary compute engine for lake workloads - **Apache Flink** `scoped_to` Data Lake — streaming ingestion into lakes - **Write-Audit-Publish** `scoped_to` Data Lake — quality gating pattern for lake data **Sources:** - https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html (Docs, High) - https://aws.amazon.com/what-is/data-lake/ (Docs, High) ### Table Formats {#table-formats} **What it is:** The category of specifications (Iceberg, Delta, Hudi) that bring table semantics — schema, partitioning, ACID transactions, time-travel — to collections of files on object storage. **Where it fits:** Table formats bridge the gap between raw files on S3 and the structured tables that SQL engines expect. They are the enabling layer for lakehouse architectures. **Misconceptions / traps:** - Table formats are specifications, not databases. They define how metadata and data files are organized — the query engine is separate. - Choosing a table format is increasingly a convergent decision. Iceberg has become the de-facto standard, but Delta and Hudi remain relevant in their ecosystems. **Key connections:** - `scoped_to` **S3** — all table formats operate on S3-stored files - **Iceberg Table Spec**, **Delta Lake Protocol**, **Apache Hudi Spec** `scoped_to` Table Formats — the three major specifications - **Apache Parquet** `scoped_to` Table Formats — the dominant data file format under all three - **Schema Evolution** `scoped_to` Table Formats — the problem table formats exist to solve - **Metadata Overhead at Scale** `scoped_to` Table Formats — the problem table formats introduce **Sources:** - https://iceberg.apache.org/spec/ (Spec, High) - https://github.com/delta-io/delta/blob/master/PROTOCOL.md (Spec, High) - https://hudi.apache.org/docs/overview (Docs, High) - https://www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake/ (Blog, Medium) ### Vector Indexing on Object Storage {#vector-indexing-on-object-storage} **What it is:** The practice of building and querying vector indexes over embeddings derived from data stored in S3. **Where it fits:** This topic connects the LLM side of the index to the storage side. Embeddings are generated from S3-stored content, indexed for similarity search, and the results point back to S3 objects. **Misconceptions / traps:** - Vector indexes are not a replacement for structured queries. They answer "what's semantically similar?" not "what matches this predicate?" - Storing vector indexes on S3 (e.g., LanceDB) is viable but query latency is higher than dedicated vector databases with in-memory indexes. **Key connections:** - `scoped_to` **Object Storage**, **S3** — vectors are derived from and point to S3 data - **LanceDB** `scoped_to` Vector Indexing on Object Storage — S3-native vector database - **Embedding Model** `scoped_to` Vector Indexing on Object Storage — produces the vectors - **Hybrid S3 + Vector Index** `scoped_to` Vector Indexing on Object Storage — the architectural pattern - **Embedding Generation** `scoped_to` Vector Indexing on Object Storage — the capability that feeds vectors **Sources:** - https://aws.amazon.com/blogs/architecture/a-scalable-elastic-database-and-search-solution-for-1b-vectors-built-on-lancedb-and-amazon-s3/ (Blog, High) - https://lancedb.github.io/lancedb/ (Docs, High) - https://milvus.io/docs/overview.md (Docs, High) ### LLM-Assisted Data Systems {#llm-assisted-data-systems} **What it is:** The intersection of large language models and S3-centric data infrastructure. Scoped strictly to cases where LLMs operate on, enhance, or derive value from S3-stored data. **Where it fits:** This topic anchors the AI/ML portion of the index. Every model class and LLM capability in the index connects here — and every connection must pass the S3 scope test: if S3 disappeared, the entry should disappear too. **Misconceptions / traps:** - This is not a general AI topic. Standalone chatbots, general AI trends, and models with no S3 data connection are out of scope. - LLM integration with S3 data is constrained by inference cost and data egress. The economic viability of LLM-over-S3 workloads depends on choosing between cloud APIs and local inference. **Key connections:** - `scoped_to` **S3** — all LLM work here is grounded in S3 data - **Embedding Model**, **General-Purpose LLM**, **Code-Focused LLM**, **Small / Distilled Model** `scoped_to` LLM-Assisted Data Systems — model classes - **Offline Embedding Pipeline**, **Local Inference Stack** `scoped_to` LLM-Assisted Data Systems — architectural patterns - **High Cloud Inference Cost** `scoped_to` LLM-Assisted Data Systems — the dominant cost constraint **Sources:** - https://aws.amazon.com/bedrock/ (Docs, High) - https://python.langchain.com/docs/tutorials/rag/ (Docs, High) - https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html (Docs, High) ### Metadata Management {#metadata-management} **What it is:** The discipline of maintaining catalogs, schemas, statistics, and descriptive information about objects and datasets stored in S3. **Where it fits:** Metadata management is the connective tissue between raw S3 storage and usable data. Without it, billions of objects are opaque blobs. With it, they become discoverable, governed, and queryable. **Misconceptions / traps:** - S3 object metadata (content-type, custom headers) is not the same as table metadata (schemas, partition info, statistics). Both exist but serve different purposes. - Metadata catalogs (Glue, HMS, Nessie) are not optional at scale. Without a catalog, every query engine must independently discover and interpret S3 data layout. **Key connections:** - `scoped_to` **Object Storage**, **S3** — metadata describes S3-stored data - **Metadata Overhead at Scale** `scoped_to` Metadata Management — the scaling problem - **Metadata Extraction** `scoped_to` Metadata Management — LLM-driven enrichment - **Data Classification** `scoped_to` Metadata Management — automated tagging of S3 objects **Sources:** - https://docs.aws.amazon.com/glue/latest/dg/components-overview.html (Docs, High) - https://github.com/apache/hive/tree/master/standalone-metastore (GitHub, High) - https://projectnessie.org/ (Docs, High) - https://open-metadata.org/ (Docs, High) ### Data Versioning {#data-versioning} **What it is:** Techniques for tracking and managing changes to datasets stored in object storage over time, including snapshots, branching, and rollback. **Where it fits:** S3 objects are immutable once written. Data versioning adds the concept of change history on top of that immutability — from S3's built-in object versioning to table format snapshots to Git-like branching with lakeFS. **Misconceptions / traps:** - S3 object versioning and dataset versioning are different things. S3 versioning tracks individual object changes; dataset versioning (Iceberg snapshots, lakeFS branches) tracks logical dataset state. - Versioning has storage cost implications. Every snapshot or version retains data, and garbage collection policies are essential at scale. **Key connections:** - `scoped_to` **Object Storage**, **S3** — versioning operates on S3-stored data **Sources:** - https://lakefs.io/ (Docs, High) - https://dvc.org/doc (Docs, High) - https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html (Docs, High) ## Technologies ### AWS S3 {#aws-s3} **What it is:** Amazon's fully managed object storage service — the origin and reference implementation of the S3 API. **Where it fits:** AWS S3 is the gravitational center of the ecosystem. It defined the API that became the de-facto standard, and most tools in this index were built to work with AWS S3 first and other providers second. **Misconceptions / traps:** - AWS S3 is now strongly consistent (read-after-write), but code written against the old eventual consistency model may still contain unnecessary workarounds. - S3 storage is cheap; S3 API calls and egress are not. Cost optimization requires understanding request pricing and transfer charges, not just storage GB. **Key connections:** - `implements` **S3 API** — the reference implementation of the standard - `enables` **Lakehouse Architecture** — provides the storage layer for lakehouses - `enables` **Separation of Storage and Compute** — foundational to the pattern - `used_by` **Medallion Architecture** — each layer stores data on S3 - `constrained_by` **Object Listing Performance**, **Lack of Atomic Rename**, **Egress Cost** — key operational limitations **Sources:** - https://docs.aws.amazon.com/s3/ (Docs, High) - https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html (Docs, High) - https://aws.amazon.com/s3/ (Docs, High) ### MinIO {#minio} **What it is:** An open-source, S3-compatible object storage server designed for high performance and self-hosted deployment. **Where it fits:** MinIO is the primary open-source alternative to AWS S3. It enables organizations to run the same S3 workloads on-premise, at the edge, or in any cloud — breaking vendor lock-in while keeping the S3 API contract. **Misconceptions / traps:** - MinIO implements the S3 API but is not AWS S3. Some AWS-specific features (S3 Select, S3 Inventory) may not be available or behave differently. - MinIO provides strict read-after-write consistency by default — stronger than historical AWS S3 behavior. **Key connections:** - `implements` **S3 API** — full S3-compatible interface - `enables` **Lakehouse Architecture** — can serve as the storage layer - `solves` **Vendor Lock-In** — S3-compatible self-hosted alternative - `constrained_by` **Lack of Atomic Rename** — same S3 API limitation applies - **LanceDB** `indexes` MinIO — vector search over MinIO-stored data **Sources:** - https://min.io/docs/minio/linux/index.html (Docs, High) - https://github.com/minio/minio (GitHub, High) - https://blog.min.io/ (Blog, High) ### Ceph {#ceph} **What it is:** A distributed storage system providing object, block, and file storage in a unified platform. S3 compatibility via its RADOS Gateway (RGW). **Where it fits:** Ceph is the enterprise-grade, self-managed storage platform for organizations that need S3-compatible object storage alongside block and file access from a single infrastructure. **Misconceptions / traps:** - Ceph is not just an object store — it is a unified storage platform. The S3 gateway is one component. Operational complexity is significantly higher than MinIO. - S3 API coverage in Ceph RGW is broad but not complete. Test specific API operations (multipart uploads, lifecycle policies) before production use. **Key connections:** - `implements` **S3 API** — via RADOS Gateway - `solves` **Vendor Lock-In** — self-hosted deployment option - `scoped_to` **S3**, **Object Storage** — participates in the S3-compatible ecosystem **Sources:** - https://docs.ceph.com/en/latest/ (Docs, High) - https://docs.ceph.com/en/latest/radosgw/s3/ (Docs, High) - https://github.com/ceph/ceph (GitHub, High) ### Apache Ozone {#apache-ozone} **What it is:** A scalable, distributed object storage system in the Hadoop ecosystem with an S3-compatible interface. **Where it fits:** Ozone bridges the legacy Hadoop world (HDFS, YARN, MapReduce) and the modern S3-based world. It gives Hadoop-native workloads an S3 API while also supporting the Hadoop filesystem interface. **Misconceptions / traps:** - Ozone is not a drop-in HDFS replacement. It has a different consistency model and metadata architecture (SCM + OM). - Adoption outside the Hadoop ecosystem is limited. If you don't have legacy Hadoop workloads, MinIO or AWS S3 are more practical choices. **Key connections:** - `implements` **S3 API** — S3-compatible interface for Hadoop environments - `solves` **Legacy Ingestion Bottlenecks** — migration path from HDFS - `scoped_to` **S3**, **Object Storage** — part of the S3-compatible ecosystem **Sources:** - https://ozone.apache.org/ (Docs, High) - https://ozone.apache.org/docs/current/interface/s3.html (Docs, High) - https://github.com/apache/ozone (GitHub, High) ### Apache Iceberg {#apache-iceberg} **What it is:** An open table format for large analytic datasets. Manages metadata, snapshots, and schema evolution for collections of data files (typically Parquet) on object storage. **Where it fits:** Iceberg is the central table format in the S3 ecosystem. It turns a pile of Parquet files on S3 into a reliable, evolvable, SQL-queryable table — without requiring a database server. It has become the de-facto standard across engines (Spark, Trino, Flink, DuckDB). **Misconceptions / traps:** - Iceberg is not a query engine. It is a table format specification plus libraries. You still need Spark, Trino, DuckDB, or another engine to query Iceberg tables. - Hidden partitioning is powerful but not magic. Poor sort order or excessive partition granularity still produces small files and slow queries. **Key connections:** - `implements` **Lakehouse Architecture** — the primary table format for lakehouses - `depends_on` **Apache Parquet** — default data file format - `solves` **Small Files Problem** (compaction), **Schema Evolution** (column-ID-based evolution), **Partition Pruning Complexity** (hidden partitioning) - `constrained_by` **Metadata Overhead at Scale**, **Lack of Atomic Rename** - `scoped_to` **Table Formats**, **Lakehouse** **Sources:** - https://iceberg.apache.org/docs/latest/ (Docs, High) - https://iceberg.apache.org/spec/ (Spec, High) - https://github.com/apache/iceberg (GitHub, High) - https://iceberg.apache.org/docs/latest/aws/ (Docs, High) ### Delta Lake {#delta-lake} **What it is:** An open table format and storage layer providing ACID transactions, scalable metadata, and schema enforcement on data stored in object storage. Originally developed at Databricks. **Where it fits:** Delta Lake is the table format native to the Databricks ecosystem. It competes with Iceberg and Hudi but has the strongest integration with Spark-based platforms. On S3, Delta Lake requires external coordination for atomic commits due to the lack of atomic rename. **Misconceptions / traps:** - Delta Lake on S3 requires a DynamoDB-based log store or equivalent for multi-writer safety. Without it, concurrent writes can corrupt the transaction log. - "Delta" and "Databricks" are closely associated, but Delta is open-source. However, some advanced features (liquid clustering, predictive optimization) are Databricks-proprietary. **Key connections:** - `implements` **Lakehouse Architecture** — provides ACID on data lakes - `depends_on` **Delta Lake Protocol**, **Apache Parquet** — protocol spec and data format - `solves` **Schema Evolution** — schema enforcement with evolution support - `constrained_by` **Vendor Lock-In** (Databricks ecosystem affinity), **Lack of Atomic Rename** (S3 limitation) - `scoped_to` **Table Formats**, **Lakehouse** **Sources:** - https://docs.delta.io/latest/index.html (Docs, High) - https://github.com/delta-io/delta (GitHub, High) - https://github.com/delta-io/delta/blob/master/PROTOCOL.md (Spec, High) - https://docs.delta.io/latest/delta-storage.html (Docs, High) ### Apache Hudi {#apache-hudi} **What it is:** A table format and data management framework optimized for incremental data processing — upserts, deletes, and change data capture — on object storage. **Where it fits:** Hudi occupies the niche of record-level mutations on S3 data. Where Iceberg and Delta focus on batch analytics, Hudi's strength is CDC ingestion and near-real-time upserts — making it the choice for pipelines that need to update individual records. **Misconceptions / traps:** - Hudi has two table types (Copy-on-Write and Merge-on-Read) with very different performance profiles. Choosing the wrong one is a common early mistake. - Hudi's operational complexity (compaction scheduling, cleaning policies, indexing) is higher than Iceberg or Delta. Budget for operational overhead. **Key connections:** - `implements` **Lakehouse Architecture** — provides incremental processing on lakes - `depends_on` **Apache Hudi Spec**, **Apache Parquet** — specification and data format - `solves` **Legacy Ingestion Bottlenecks** (incremental ingestion), **Schema Evolution** - `scoped_to` **Table Formats**, **Lakehouse** **Sources:** - https://hudi.apache.org/docs/overview (Docs, High) - https://github.com/apache/hudi (GitHub, High) - https://hudi.apache.org/docs/s3_hoodie (Docs, Medium) ### DuckDB {#duckdb} **What it is:** An in-process analytical database engine (like SQLite for analytics) that reads Parquet, Iceberg, and other formats directly from S3 without requiring a server or cluster. **Where it fits:** DuckDB fills the gap between "I need to explore this S3 data" and "I need to deploy a Spark cluster." It brings fast columnar analytics to a single machine, reading S3 data directly — ideal for development, ad-hoc analysis, and embedded analytics. **Misconceptions / traps:** - DuckDB is single-node. It does not scale horizontally. For petabyte-scale queries, you still need Spark, Trino, or StarRocks. - DuckDB reads from S3 over HTTP. Performance is bottlenecked by network throughput and S3 request latency, especially with many small files. **Key connections:** - `depends_on` **Apache Parquet**, **Apache Arrow** — reads Parquet, processes in Arrow format - `constrained_by` **Small Files Problem**, **Object Listing Performance** — performance degrades with too many small S3 objects - **Natural Language Querying** `augments` DuckDB — LLMs can generate SQL for DuckDB - `scoped_to` **S3**, **Lakehouse** **Sources:** - https://duckdb.org/docs/ (Docs, High) - https://duckdb.org/docs/extensions/httpfs/s3api (Docs, High) - https://github.com/duckdb/duckdb (GitHub, High) ### Trino {#trino} **What it is:** A distributed SQL query engine for federated analytics across heterogeneous data sources, with deep support for S3-backed data lakes and lakehouses. **Where it fits:** Trino is the multi-engine query layer for S3 lakehouses. It queries Iceberg, Delta, Hudi, and raw Parquet on S3 through connectors — and can join S3 data with operational databases in a single query. **Misconceptions / traps:** - Trino is a query engine, not a storage engine. It reads from S3 but does not manage data. Writes go through table format commit protocols. - Trino requires a coordinator and workers — operational overhead is higher than DuckDB. Use DuckDB for single-user exploration; Trino for multi-user production queries. **Key connections:** - `depends_on` **Apache Parquet** — reads Parquet files from S3 - `used_by` **Lakehouse Architecture** — a primary query engine for lakehouses - `constrained_by` **Small Files Problem**, **Object Listing Performance** — performance affected by S3 access patterns - **Natural Language Querying** `augments` Trino — LLMs generate SQL for Trino - `scoped_to` **S3**, **Lakehouse** **Sources:** - https://trino.io/docs/current/ (Docs, High) - https://trino.io/docs/current/object-storage.html (Docs, High) - https://trino.io/docs/current/connector/iceberg.html (Docs, High) - https://github.com/trinodb/trino (GitHub, High) ### ClickHouse {#clickhouse} **What it is:** A column-oriented DBMS designed for real-time analytical queries, with native support for reading from and writing to S3. **Where it fits:** ClickHouse occupies the performance tier above pure lakehouse queries. It can use S3 as a storage backend (S3-backed MergeTree) while maintaining its own columnar indexes for sub-second query performance — bridging the gap between S3 data lakes and dedicated analytics databases. **Misconceptions / traps:** - ClickHouse with S3 storage is not the same as querying S3 directly. ClickHouse maintains local indexes and metadata for performance; it uses S3 for durability and cost. - The S3 table function (for ad-hoc S3 reads) and the S3-backed MergeTree engine (for persistent tables) are different features with different performance characteristics. **Key connections:** - `depends_on` **Apache Parquet** — reads/writes Parquet for S3 interop - `implements` **Separation of Storage and Compute** — S3-backed storage with independent compute - `scoped_to` **S3**, **Lakehouse** **Sources:** - https://clickhouse.com/docs (Docs, High) - https://clickhouse.com/docs/en/integrations/s3 (Docs, High) - https://github.com/ClickHouse/ClickHouse (GitHub, High) ### Apache Spark {#apache-spark} **What it is:** A distributed compute engine for large-scale data processing — batch ETL, streaming, SQL, and machine learning — over S3-stored data. **Where it fits:** Spark is the workhorse of the S3 data ecosystem. It is the primary engine for building and maintaining lakehouse tables (Iceberg, Delta, Hudi), running ETL pipelines, and processing data at petabyte scale. **Misconceptions / traps:** - Spark's S3 access goes through the Hadoop S3A connector, not a native S3 client. S3A configuration (committers, credential providers, connection pooling) is a common source of operational issues. - Spark produces small files by default when writing with high parallelism. Use coalesce, repartition, or table format compaction to control output file sizes. **Key connections:** - `used_by` **Lakehouse Architecture**, **Medallion Architecture** — the primary compute engine - `constrained_by` **Small Files Problem** — high parallelism produces many small output files - `scoped_to` **S3**, **Data Lake** **Sources:** - https://spark.apache.org/docs/latest/ (Docs, High) - https://spark.apache.org/docs/latest/cloud-integration.html (Docs, High) - https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html (Docs, High) - https://github.com/apache/spark (GitHub, High) ### LanceDB {#lancedb} **What it is:** A vector database that stores data in the Lance columnar format directly on object storage. Designed for serverless vector search without a separate index server. **Where it fits:** LanceDB is the S3-native option for vector search. Unlike Milvus or Pinecone, LanceDB stores both raw data and vector indexes as files on S3 — aligning with the separation of storage and compute principle and eliminating a separate infrastructure layer. **Misconceptions / traps:** - Serverless on S3 means higher query latency than in-memory vector databases. LanceDB trades latency for simplicity and cost. - LanceDB uses the Lance format, not Parquet. Data must be converted or ingested into Lance format for vector search. **Key connections:** - `indexes` **MinIO**, **AWS S3** — builds vector indexes over S3-stored data - `implements` **Hybrid S3 + Vector Index** — the canonical implementation of this pattern - `scoped_to` **Vector Indexing on Object Storage**, **S3** **Sources:** - https://lancedb.github.io/lancedb/ (Docs, High) - https://github.com/lancedb/lancedb (GitHub, High) - https://lancedb.github.io/lancedb/guides/storage/ (Docs, High) - https://github.com/lancedb/lance (GitHub, High) ### StarRocks {#starrocks} **What it is:** An MPP analytical database with native lakehouse capabilities, able to directly query S3 data in Parquet, ORC, and Iceberg formats. **Where it fits:** StarRocks bridges pure lakehouse queries (Trino) and dedicated analytical databases (ClickHouse). It can query S3 data directly like Trino but also cache hot data locally for sub-second performance, making it the choice when you need low-latency analytics over lakehouse data. **Misconceptions / traps:** - StarRocks' external table performance on S3 is comparable to Trino. The latency advantage comes from its local caching and materialized views — which require managing local storage. - Shared-data architecture on S3 is a newer feature. Evaluate maturity for your use case before production deployment. **Key connections:** - `depends_on` **Apache Parquet** — reads Parquet files from S3 - `used_by` **Lakehouse Architecture** — queries lakehouse data - `constrained_by` **Cold Scan Latency** — first-query performance limited by S3 access - `scoped_to` **S3**, **Lakehouse** **Sources:** - https://docs.starrocks.io/ (Docs, High) - https://github.com/StarRocks/starrocks (GitHub, High) - https://docs.starrocks.io/docs/deployment/shared_data/s3/ (Docs, Medium) ### Apache Flink {#apache-flink} **What it is:** A distributed stream processing framework that processes data in real-time, with S3 as checkpoint store, state backend, and output sink. **Where it fits:** Flink is the streaming complement to Spark's batch processing. In the S3 world, Flink continuously ingests data into lakehouse tables (Iceberg, Delta) and uses S3 for fault-tolerant checkpointing. **Misconceptions / traps:** - Flink streaming writes to S3 inherently produce small files (one file per checkpoint interval per writer). Compaction is mandatory — either via the table format or a separate job. - Flink's S3 filesystem plugin requires careful configuration. The wrong S3 filesystem implementation (s3:// vs s3a:// vs s3p://) causes silent failures. **Key connections:** - `used_by` **Medallion Architecture**, **Lakehouse Architecture** — streaming data into lakehouse layers - `constrained_by` **Small Files Problem** — streaming writes produce many small files - `scoped_to` **S3**, **Data Lake** **Sources:** - https://nightlies.apache.org/flink/flink-docs-stable/ (Docs, High) - https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/s3/ (Docs, High) - https://github.com/apache/flink (GitHub, High) ## Standards ### S3 API {#s3-api} **What it is:** The HTTP-based API for object storage operations — PUT, GET, DELETE, LIST, multipart upload. The de-facto standard for object storage interoperability. **Where it fits:** The S3 API is the protocol layer that makes the entire ecosystem possible. Every object storage server (MinIO, Ceph, Ozone), every compute engine (Spark, DuckDB, Trino), and every table format operates against this API. **Misconceptions / traps:** - The S3 API is not formally standardized by any standards body. It is a de-facto standard defined by AWS's implementation. Compatibility varies across providers. - LIST is paginated at 1,000 objects per request with no server-side filtering beyond prefix. This is a fundamental performance constraint, not a configuration issue. **Key connections:** - `enables` **Lakehouse Architecture**, **Separation of Storage and Compute** — the interface that makes decoupled architectures possible - `solves` **Vendor Lock-In** — as a de-facto interoperability standard across providers - **AWS S3**, **MinIO**, **Ceph**, **Apache Ozone** `implements` S3 API — concrete implementations - `scoped_to` **S3** Note: Pain points **Object Listing Performance**, **Lack of Atomic Rename**, and **S3 Consistency Model Variance** reference S3 API as their origin in their definitions, but no formal edges connect S3 API to those pain points. **Sources:** - https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html (Docs, High) - https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html (Docs, High) - https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-auth-using-authorization-header.html (Spec, High) ### Apache Parquet {#apache-parquet} **What it is:** A columnar file format specification designed for efficient analytical queries. Stores data by column, enabling predicate pushdown, projection pruning, and compression. **Where it fits:** Parquet is the lingua franca of the S3 data ecosystem. Every table format (Iceberg, Delta, Hudi) defaults to Parquet as the data file format, and every query engine (Spark, DuckDB, Trino, ClickHouse) reads it natively. **Misconceptions / traps:** - Parquet is a file format, not a table format. A single Parquet file has no concept of schema evolution, transactions, or partitioning — those come from the table format layer. - Parquet row group size matters for S3 performance. Row groups that are too small increase S3 request overhead; too large wastes I/O for selective queries. 128MB-256MB is a common target. **Key connections:** - `used_by` **DuckDB**, **Trino**, **Apache Spark**, **ClickHouse** — the universal analytics file format - `enables` **Lakehouse Architecture** — provides efficient columnar storage on S3 - `solves` **Cold Scan Latency** — columnar layout enables predicate pushdown, reducing I/O - `scoped_to` **S3**, **Table Formats** **Sources:** - https://parquet.apache.org/documentation/latest/ (Spec, High) - https://github.com/apache/parquet-format (GitHub, High) - https://parquet.apache.org/ (Docs, High) ### Apache Arrow {#apache-arrow} **What it is:** A cross-language in-memory columnar data format specification with libraries for zero-copy reads, IPC, and efficient analytics. **Where it fits:** Arrow sits between S3 storage (Parquet on disk) and compute (query execution in memory). It defines how columnar data is laid out in memory, eliminating serialization overhead when processing S3-stored Parquet data. **Misconceptions / traps:** - Arrow is an in-memory format, not a storage format. You do not "store Arrow files on S3" (though Arrow IPC files exist, they are not the primary use case). - Arrow and Parquet are complementary, not competing. Parquet is the on-disk format; Arrow is the in-memory format. Most engines read Parquet into Arrow for processing. **Key connections:** - `used_by` **DuckDB**, **Apache Spark** — in-memory processing format - `scoped_to` **S3**, **Table Formats** **Sources:** - https://arrow.apache.org/docs/format/Columnar.html (Spec, High) - https://arrow.apache.org/ (Docs, High) - https://github.com/apache/arrow (GitHub, High) ### Iceberg Table Spec {#iceberg-table-spec} **What it is:** The specification defining how a logical table is represented as metadata files, manifest lists, manifests, and data files on object storage. Provides ACID, schema evolution, hidden partitioning, and time-travel. **Where it fits:** The Iceberg spec is the blueprint that Apache Iceberg implements. It defines the metadata tree structure that turns a collection of Parquet files on S3 into a reliable, evolvable table — and enables any engine to read the same table consistently. **Misconceptions / traps:** - The spec defines behavior, not implementation. Different engines (Spark, Flink, Trino) may implement the spec at different levels of completeness. - Manifest files accumulate with every write. Without regular metadata cleanup (expire snapshots, remove orphan files), metadata overhead grows. **Key connections:** - `enables` **Lakehouse Architecture** — the specification that makes Iceberg-based lakehouses possible - `solves` **Schema Evolution** (column-ID-based evolution), **Partition Pruning Complexity** (partition specs in metadata) - `scoped_to` **Table Formats**, **Lakehouse** **Sources:** - https://iceberg.apache.org/spec/ (Spec, High) - https://github.com/apache/iceberg (GitHub, High) - https://iceberg.apache.org/docs/latest/ (Docs, High) ### Delta Lake Protocol {#delta-lake-protocol} **What it is:** The specification for ACID transaction logs over Parquet files on object storage. Defines how writes, deletes, and schema changes are recorded in a JSON-based commit log stored alongside data files. **Where it fits:** The Delta protocol is what makes Delta Lake tables transactional. The commit log serializes changes so concurrent readers and writers see consistent state — even on S3, where atomic rename is unavailable. **Misconceptions / traps:** - The Delta protocol requires either atomic rename or an external coordination mechanism (DynamoDB, Azure ADLS). On S3, multi-cluster writes are unsafe without a log store. - Protocol versions (reader/writer features) must be managed carefully. Upgrading to a newer protocol version may make older readers unable to open the table. **Key connections:** - `enables` **Lakehouse Architecture** — the spec that makes Delta Lake ACID possible - `solves` **Schema Evolution** — schema enforcement in the transaction log - **Delta Lake** `depends_on` Delta Lake Protocol - `scoped_to` **Table Formats**, **Lakehouse** **Sources:** - https://github.com/delta-io/delta/blob/master/PROTOCOL.md (Spec, High) - https://docs.delta.io/latest/index.html (Docs, High) - https://github.com/delta-io/delta (GitHub, High) ### Apache Hudi Spec {#apache-hudi-spec} **What it is:** The specification for managing incremental data processing on object storage — record-level upserts, deletes, change logs, and timeline-based metadata. **Where it fits:** The Hudi spec defines how to efficiently mutate individual records in S3-stored datasets. It is the specification behind Hudi's Copy-on-Write and Merge-on-Read table types, and its timeline abstraction tracks all changes. **Misconceptions / traps:** - The Hudi spec's timeline model is conceptually different from Iceberg's snapshot model and Delta's transaction log. Understanding the timeline abstraction is prerequisite to operating Hudi tables. - The RFC-based evolution model means the spec is a living document. Breaking changes can be introduced via RFCs. **Key connections:** - `enables` **Lakehouse Architecture** — makes incremental processing possible on data lakes - **Apache Hudi** `depends_on` Apache Hudi Spec - `scoped_to` **Table Formats**, **Lakehouse** **Sources:** - https://hudi.apache.org/tech-specs/ (Spec, High) - https://hudi.apache.org/docs/overview (Docs, High) - https://github.com/apache/hudi (GitHub, High) - https://github.com/apache/hudi/tree/master/rfc (Spec, High) ### ORC {#orc} **What it is:** Optimized Row Columnar file format specification — a columnar format with built-in indexing, compression, and predicate pushdown support, originally developed for the Hive ecosystem. **Where it fits:** ORC is the legacy columnar format in the Hadoop/Hive ecosystem. On S3, it serves the same role as Parquet — efficient columnar storage for analytical queries — but is primarily used in organizations with existing Hive investments. **Misconceptions / traps:** - ORC and Parquet are functionally similar for most workloads. The choice is usually driven by ecosystem (Hive → ORC, everything else → Parquet) rather than technical superiority. - ORC's built-in ACID support (for Hive) operates differently from table format ACID (Iceberg, Delta). They are not the same concept. **Key connections:** - `used_by` **Apache Spark**, **Trino** — supported as a data file format - `solves` **Cold Scan Latency** — columnar format enables predicate pushdown - `scoped_to` **S3**, **Table Formats** **Sources:** - https://orc.apache.org/specification/ (Spec, High) - https://orc.apache.org/docs/ (Docs, High) - https://github.com/apache/orc (GitHub, High) ### Apache Avro {#apache-avro} **What it is:** A row-based data serialization format with rich schema definition and built-in schema evolution support. Schemas are stored with the data. **Where it fits:** Avro is the ingestion format of the S3 ecosystem. Data flowing from Kafka, operational databases, and streaming systems into S3 often arrives in Avro — because Avro's schema-with-data approach handles the frequent schema changes typical of event streams. **Misconceptions / traps:** - Avro is a row-oriented format. It is efficient for writing and ingestion but inefficient for analytical queries compared to Parquet. Convert to Parquet after landing in S3. - Avro's schema evolution rules (backward/forward compatibility) are powerful but strict. Breaking changes silently corrupt data if compatibility modes are misconfigured. **Key connections:** - `used_by` **Apache Spark** — a supported input/output format - `solves` **Schema Evolution** — schema-with-data approach supports evolution - `scoped_to` **S3**, **Table Formats** **Sources:** - https://avro.apache.org/docs/current/specification/ (Spec, High) - https://github.com/apache/avro (GitHub, High) - https://avro.apache.org/ (Docs, High) ## Architectures ### Lakehouse Architecture {#lakehouse-architecture} **What it is:** A unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access) by using a table format as the bridge layer. **Where it fits:** Lakehouse Architecture is the dominant architectural pattern in the S3 ecosystem. It eliminates the need for a separate data warehouse by adding reliability directly to S3-stored data through table formats. **Misconceptions / traps:** - A lakehouse does not mean "no data warehouse." It means the warehouse capabilities are applied to data lake storage. Some workloads may still benefit from a dedicated OLAP engine. - Lakehouse performance depends heavily on metadata management. Without catalog maintenance (snapshot expiration, orphan file cleanup), query planning degrades. **Key connections:** - `depends_on` **S3 API**, **Apache Parquet** — the storage interface and file format - `solves` **Cold Scan Latency** — metadata-driven query planning reduces unnecessary S3 scans - `constrained_by` **Metadata Overhead at Scale**, **Lack of Atomic Rename** - **Apache Iceberg**, **Delta Lake**, **Apache Hudi** `implements` Lakehouse Architecture - **Trino**, **Apache Spark**, **StarRocks**, **Apache Flink** `used_by` Lakehouse Architecture - `scoped_to` **Lakehouse**, **Object Storage** **Sources:** - https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf (Paper, High) - https://www.databricks.com/product/data-lakehouse (Docs, High) - https://docs.databricks.com/aws/en/lakehouse-architecture/ (Docs, High) ### Medallion Architecture {#medallion-architecture} **What it is:** A layered data quality pattern — Bronze (raw), Silver (cleansed), Gold (business-ready) — with each layer stored on object storage. **Where it fits:** Medallion is the most widely adopted data quality pattern within lakehouses. It organizes S3 data into progressive quality tiers, giving each tier a clear contract and making it safe for different consumers to read at different quality levels. **Misconceptions / traps:** - Three layers is a convention, not a rule. Some organizations use two layers; others add more. The pattern is about progressive refinement, not a fixed number of tiers. - Medallion does not solve the small files problem — it can worsen it. Each layer transformation may produce many small output files, especially with streaming Silver→Gold pipelines. **Key connections:** - `is_a` **Lakehouse Architecture** — a specialization of the lakehouse pattern - `constrained_by` **Legacy Ingestion Bottlenecks**, **Small Files Problem** - **AWS S3** `used_by` Medallion Architecture — each layer resides on S3 - **Apache Spark**, **Apache Flink** `used_by` Medallion Architecture — compute engines for tier transformations - `scoped_to` **Lakehouse**, **Data Lake** **Sources:** - https://www.databricks.com/glossary/medallion-architecture (Docs, High) - https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion (Docs, High) - https://learn.microsoft.com/en-us/fabric/onelake/onelake-medallion-lakehouse-architecture (Docs, High) ### Separation of Storage and Compute {#separation-of-storage-and-compute} **What it is:** The design pattern of keeping data in S3 while running independent, elastically scaled compute engines against it. **Where it fits:** This is the foundational architectural principle of the S3 ecosystem. Every query engine, table format, and data pipeline in this index assumes storage and compute are separate — data stays in S3, compute spins up and down on demand. **Misconceptions / traps:** - Separation of storage and compute does not mean "no local storage." Caching, spill-to-disk, and local indexes are still used — the principle is that the source of truth is in S3. - Network latency between compute and S3 is the fundamental trade-off. Every query pays the cost of reading over HTTP instead of local disk. **Key connections:** - `depends_on` **S3 API** — the interface that enables decoupling - `solves` **Vendor Lock-In** — swap compute engines without moving data - `constrained_by` **Cold Scan Latency**, **Egress Cost** — the costs of network-based data access - **ClickHouse** `implements` Separation of Storage and Compute - `scoped_to` **S3**, **Object Storage** **Sources:** - https://docs.snowflake.com/en/user-guide/intro-key-concepts (Docs, High) - https://docs.databricks.com/aws/en/lakehouse-architecture/ (Docs, High) - https://www.databricks.com/glossary/data-lakehouse (Docs, High) ### Hybrid S3 + Vector Index {#hybrid-s3--vector-index} **What it is:** A pattern that stores raw data on S3 and maintains a vector index over embeddings that points back to S3 objects. **Where it fits:** This pattern bridges structured storage (S3) with semantic retrieval (vector search). It is the architecture behind RAG systems that ground LLM responses in S3-stored documents. **Misconceptions / traps:** - The vector index and the raw data can drift. If S3 objects are updated or deleted without updating the index, search results return stale or broken references. - Hybrid does not mean "query both simultaneously." Typically, vector search retrieves references first, then the application fetches the raw data from S3 in a second step. **Key connections:** - `depends_on` **S3 API** — raw data stored in S3 - `solves` **Cold Scan Latency** — pre-computed embeddings avoid scanning raw content - `constrained_by` **High Cloud Inference Cost** — generating embeddings is expensive - **LanceDB** `implements` Hybrid S3 + Vector Index - **Embedding Generation**, **Semantic Search** `enables` Hybrid S3 + Vector Index - `scoped_to` **Vector Indexing on Object Storage**, **S3** **Sources:** - https://aws.amazon.com/blogs/architecture/a-scalable-elastic-database-and-search-solution-for-1b-vectors-built-on-lancedb-and-amazon-s3/ (Blog, High) - https://milvus.io/docs/deploy_s3.md (Docs, High) - https://lancedb.github.io/lancedb/examples/serverless_lancedb_with_s3_and_lambda/ (Docs, High) ### Offline Embedding Pipeline {#offline-embedding-pipeline} **What it is:** A batch pattern where embeddings are generated from S3-stored data on a schedule, with resulting vectors written back to object storage or a vector index. **Where it fits:** This pattern is the cost-effective way to add semantic search to S3 data. Instead of real-time embedding on every query, data is vectorized in batch — keeping inference costs predictable and avoiding always-on GPU infrastructure. **Misconceptions / traps:** - "Offline" means batch, not "never updated." A daily or weekly refresh is typical. Freshness requirements determine the schedule. - Embedding pipeline failures can leave the vector index out of sync with S3 data. Idempotent, resumable pipelines are essential. **Key connections:** - `depends_on` **S3 API** — reads source data from and writes embeddings to S3 - `constrained_by` **High Cloud Inference Cost** — the motivating economic constraint - `scoped_to` **LLM-Assisted Data Systems**, **S3** **Sources:** - https://aws.amazon.com/blogs/big-data/generate-vector-embeddings-for-your-data-using-aws-lambda-as-a-processor-for-amazon-opensearch-ingestion/ (Blog, High) - https://github.com/aws-samples/text-embeddings-pipeline-for-rag (GitHub, High) - https://blog.skypilot.co/large-scale-embedding/ (Blog, Medium) ### Local Inference Stack {#local-inference-stack} **What it is:** A pattern of running ML/LLM models on local hardware against data stored in or pulled from S3, avoiding cloud-based inference APIs. **Where it fits:** This is the cost optimization pattern for LLM workloads over S3 data. When the volume of data to process is large enough, local inference (on-premise GPUs or edge devices) is orders of magnitude cheaper than per-token cloud API pricing. **Misconceptions / traps:** - "Local" does not mean "free." GPUs, power, cooling, and operational overhead have real costs. The break-even point depends on volume and model size. - Model quality may differ. Smaller local models (distilled, quantized) trade accuracy for cost. Evaluate whether the quality loss is acceptable for your use case. **Key connections:** - `solves` **High Cloud Inference Cost**, **Egress Cost** — eliminates per-token and egress charges - `scoped_to` **LLM-Assisted Data Systems**, **S3** **Sources:** - https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer/ (Docs, High) - https://github.com/ggml-org/llama.cpp (GitHub, High) - https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/ (Blog, High) ### Write-Audit-Publish {#write-audit-publish} **What it is:** A data quality pattern where data lands in a raw S3 zone, undergoes validation, and is promoted to a curated zone only after passing audits. **Where it fits:** WAP is the quality gate for S3 data lakes. It prevents bad data from reaching production consumers by isolating writes in a staging area, running validation checks, and only publishing data that passes. **Misconceptions / traps:** - WAP requires branching or snapshot isolation. Without a table format that supports branches (Iceberg) or staging areas (lakeFS), implementing WAP on raw S3 is manual and error-prone. - Audit logic must be idempotent. If audits fail and data is re-submitted, the system must handle duplicates gracefully. **Key connections:** - `depends_on` **S3 API** — data lands in S3 for staging - `solves` **Schema Evolution** — catches incompatible changes before they affect consumers - `scoped_to` **Data Lake**, **S3** **Sources:** - https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/ (Blog, High) - https://iceberg.apache.org/docs/latest/ (Docs, High) ### Tiered Storage {#tiered-storage} **What it is:** Moving data between hot, warm, and cold storage tiers based on access frequency. S3 itself offers tiering (Standard, Infrequent Access, Glacier). **Where it fits:** Tiered storage is the cost optimization layer for S3 data. It ensures frequently accessed data is fast and expensive while archival data is slow and cheap — a critical pattern for large data lakes where 80%+ of data is rarely accessed. **Misconceptions / traps:** - Retrieval from cold tiers (Glacier, Deep Archive) has latency measured in minutes to hours. Do not tier data that might be needed for interactive queries. - S3 Intelligent-Tiering automates tier transitions but has per-object monitoring charges. For predictable access patterns, explicit lifecycle rules are cheaper. **Key connections:** - `solves` **Egress Cost** — keeps hot data close to compute, cold data in cheap tiers - `constrained_by` **Vendor Lock-In** — tiering policies and pricing are provider-specific - `scoped_to` **S3**, **Object Storage** **Sources:** - https://aws.amazon.com/s3/storage-classes/ (Docs, High) - https://kafka.apache.org/41/operations/tiered-storage/ (Docs, High) - https://docs.confluent.io/platform/current/clusters/tiered-storage.html (Docs, High)