Hybrid S3 + Vector Index
A pattern that stores raw data on S3 and maintains a vector index over embeddings that points back to S3 objects.
Summary
A pattern that stores raw data on S3 and maintains a vector index over embeddings that points back to S3 objects.
This pattern bridges structured storage (S3) with semantic retrieval (vector search). It is the architecture behind RAG systems that ground LLM responses in S3-stored documents.
- The vector index and the raw data can drift. If S3 objects are updated or deleted without updating the index, search results return stale or broken references.
- Hybrid does not mean "query both simultaneously." Typically, vector search retrieves references first, then the application fetches the raw data from S3 in a second step.
depends_onS3 API — raw data stored in S3solvesCold Scan Latency — pre-computed embeddings avoid scanning raw contentconstrained_byHigh Cloud Inference Cost — generating embeddings is expensive- LanceDB
implementsHybrid S3 + Vector Index - Embedding Generation, Semantic Search
enablesHybrid S3 + Vector Index scoped_toVector Indexing on Object Storage, S3
Definition
A pattern that stores raw data (documents, media, logs) on S3 and maintains a vector index (embeddings + similarity search) that points back to the S3 objects.
S3 is excellent for durable, cheap storage of unstructured content, but it has no concept of semantic similarity. A vector index adds a semantic retrieval layer without duplicating the raw data.
Retrieval-augmented generation (RAG) over S3-stored corpora, semantic document search, content recommendation systems backed by S3 data.
Connections 11
Outbound 5
scoped_to2depends_on1solves1constrained_by1Inbound 6
Resources 3
AWS Architecture Blog describing a production-grade 1B+ vector search solution built on LanceDB with S3 as the storage layer, demonstrating the hybrid pattern at scale.
Official Milvus documentation for configuring S3 as the object storage backend for vector data and index persistence.
LanceDB's official example of running a serverless vector database directly on S3 with AWS Lambda.