Architecture
Repeatable design patterns that combine multiple technologies to solve structural problems.
52 nodesA unified architecture combining data lake storage (files on S3) with warehouse capabilities (ACID, schema enforcement, SQL access…
A layered data quality pattern — Bronze (raw), Silver (cleansed), Gold (business-ready) — with each layer stored on object storage…
The design pattern of keeping data in S3 while running independent, elastically scaled compute engines against it.
A pattern that stores raw data on S3 and maintains a vector index over embeddings that points back to S3 objects.
A batch pattern where embeddings are generated from S3-stored data on a schedule, with resulting vectors written back to object st…
A pattern of running ML/LLM models on local hardware against data stored in or pulled from S3, avoiding cloud-based inference APIs…
A data quality pattern where data lands in a raw S3 zone, undergoes validation, and is promoted to a curated zone only after passi…
Moving data between hot, warm, and cold storage tiers based on access frequency. S3 itself offers tiering (Standard, Infrequent Ac…
An erasure coding scheme that distributes data fragments and parity blocks across geographically separated sites, providing durabi…
An architecture placing NVMe flash as a high-performance local storage tier beneath the S3 API, serving hot objects with microseco…
An architecture that streams data directly from storage devices to GPU memory, bypassing the CPU and system memory entirely. Uses …
Using RDMA network transport for microsecond-level object storage access within high-performance computing clusters, bypassing ker…
Placing a cache layer (SSD, Alluxio, CDN, or in-memory cache) in front of S3 to serve frequently accessed objects with lower laten…
Using S3 as the durable repository for ML model checkpoints, trained model artifacts, training logs, and experiment metadata. A ce…
Streaming training data directly from S3 into GPU training loops during ML model training, avoiding the need to download entire da…
Storing ML feature vectors and embedding tables on S3 in columnar formats (Parquet, Lance), enabling cost-effective persistence an…
A continuous pipeline that regenerates vector embeddings as source data in S3 changes, keeping vector indexes in sync with the lat…
Bidirectional replication between two or more S3-compatible storage sites where all sites accept writes simultaneously, with confl…
A one-way replication pattern where data collected at edge S3-compatible storage nodes is continuously replicated to a central S3 …
Using S3 Object Lock to create a tamper-proof backup vault where backup data cannot be deleted or modified until the retention per…
A defense-in-depth backup architecture combining S3 Object Lock, air-gapped replication, anomaly detection on access patterns, and…
A metadata pattern that tracks which rows in a data file have been logically deleted or updated, using a compact bitmap instead of…
An architectural pattern adapting Log-Structured Merge-tree storage to object storage, where writes are batched into sorted append…
The background maintenance operation that merges many small data files into fewer, larger files within a table format (Iceberg, De…
The architecture pattern of capturing row-level changes (inserts, updates, deletes) from operational databases and applying them t…
The practice of restricting access to specific rows or columns within lakehouse tables based on user identity, role, or policy, en…
The combination of data encryption (at rest and in transit) with key management service (KMS) integration to protect S3-stored dat…
The set of architectural strategies for ensuring that multiple tenants (customers, business units, or environments) sharing an S3-…
The architecture pattern of using retrieval-augmented generation (RAG) to answer natural language questions against structured dat…
The practice of physically organizing data files within a table by the values of one or more columns, so that queries filtering on…
The practice of deliberately targeting optimal data file sizes (typically 128 MB to 1 GB for Parquet on S3) to balance S3 request …
The practice of recording a tamper-evident history of all data access, modification, and governance events within an S3-based lake…
The process of replacing personally identifiable information (PII) in S3-stored datasets with non-reversible or reversible tokens,…
The architectural decision between processing S3 data in periodic batch jobs (hourly/daily) versus continuous streaming ingestion,…
An architecture pattern where data ingestion into S3-based lakehouses is triggered by events (S3 notifications, Kafka messages, we…
The optimization technique used by table formats (especially Iceberg) to skip reading irrelevant manifest files during query plann…
Lakehouse design patterns that embed regulatory requirements (GDPR, CCPA, HIPAA, SOX) directly into the data architecture rather t…
The catalog-level capability to create lightweight named references (branches and tags) to specific table states, enabling isolate…
The practice of creating constrained, pre-filtered views over lakehouse tables that limit what data AI/LLM systems can access, pre…
The practice of splitting S3-stored structured and semi-structured data (Parquet files, JSON documents, CSV records) into semantic…
The discipline of designing, executing, and reporting reproducible performance tests for S3-based data systems, covering throughpu…
The practice of forecasting and provisioning storage, compute, and network resources for S3-based data systems based on projected …
Architectural approaches that combine multiple metadata systems (e.g., Glue Catalog for Iceberg tables, OpenMetadata for governanc…
Architectural strategies for enabling multiple table formats (Iceberg, Delta, Hudi), query engines (Spark, Trino, Flink), and cata…
A concurrency model for lakehouse table formats that uses distributed timelines rather than locks or optimistic retries, allowing …
A vector database architecture that separates index storage on object storage from query compute, using Inverted File Indexes (IVF…
The strategy of physically organizing table data files by column values so query engines can skip irrelevant files. On S3-backed l…
A security architecture where a control plane issues short-lived, narrowly scoped S3 credentials at query time rather than relying…
Automated rules that transition S3 objects between storage tiers (Standard → Infrequent Access → Glacier → Deep Archive) or expire…
The architectural pattern of using governed, ACID-transactional lakehouse tables on S3 as the single data substrate for AI/ML pipe…
An architectural pattern for co-locating heterogeneous data types — images, video, audio, PDFs, sensor streams — alongside structu…
A query-time data protection architecture that dynamically masks, tokenizes, or filters sensitive fields from S3-backed lakehouse …