Billion-Scale Vector Search on S3 — Decoupling Compute and Storage
Problem Framing
Storing full-precision HNSW graphs in RAM becomes economically unviable beyond approximately 100 million vectors — a billion 768-dimensional float32 vectors require ~3 TB of memory for the graph alone, before accounting for the vectors themselves. Decoupled vector search separates index storage (on S3) from query compute, using IVF+PQ quantization to compress the in-memory search footprint by approximately 64x and fetching full-precision vectors from S3 only during the re-ranking phase. Engineers need to understand this architecture, its latency characteristics, and when to use S3 Vectors versus dedicated vector databases.
Relevant Nodes
- Topics: S3, Vector Indexing on Object Storage
- Technologies: Amazon S3 Vectors, LanceDB
- Architectures: Decoupled Vector Search, RAG over Structured Data, Hybrid S3 + Vector Index
- Pain Points: High Cloud Inference Cost, Egress Cost
Decision Path
Assess your vector scale and access pattern. Below 10M vectors, in-memory HNSW in a dedicated database (Milvus, Qdrant) is straightforward and fast. Between 10M and 100M, cost optimization becomes relevant. Beyond 100M, decoupled architectures are often the only economically viable option.
- Determine your query latency requirement: sub-10ms requires in-memory, sub-100ms is achievable with S3-backed quantized search, sub-1s opens up full S3 scan approaches.
Understand IVF+PQ mechanics. Inverted File Index (IVF) partitions the vector space into clusters. Product Quantization (PQ) compresses each vector from (e.g.) 3072 bytes to 48 bytes. At query time, the engine scans only relevant clusters using compressed representations, then fetches full-precision vectors from S3 for the top-k candidates.
- The compression ratio determines the in-memory footprint: 64x compression means 1B vectors fit in ~50 GB of RAM instead of ~3 TB.
- Recall degrades with aggressive quantization — tune nprobe (clusters searched) and PQ segments to balance recall vs. latency.
Architect the dual-runtime engine. The search pipeline has two phases:
- Coarse search (in-memory): Scan quantized centroids and PQ codes to identify candidate vectors. This runs on CPU/GPU compute with the compressed index in RAM.
- Re-ranking (S3-backed): Fetch full-precision vectors for the top candidates from S3 and compute exact distances. Latency depends on S3 GET performance and the number of candidates.
Configure S3 Vectors or LanceDB. S3 Vectors provides a managed API for storing and querying vectors directly on S3 with ~100ms warm-query latency. LanceDB stores vectors in Lance format on S3 with embedded IVF-PQ indices, offering self-managed control with similar latency characteristics.
- S3 Vectors: zero infrastructure, pay-per-query, best for serverless RAG.
- LanceDB: open-source, self-hosted, supports multimodal data (vectors + metadata + images in one table).
Set up warm caching for hot queries. The first query against cold S3 data incurs higher latency (200–500ms). Subsequent queries against the same index partitions benefit from S3's internal caching. For latency-sensitive workloads, pre-warm frequently accessed partitions using scheduled probe queries.
Benchmark latency against your SLA. S3 Vectors targets ~100ms for warm queries. LanceDB on S3 achieves similar latency for quantized search but re-ranking latency depends on the number of S3 GETs. Benchmark with your actual vector dimensionality, dataset size, and concurrency requirements before committing to an architecture.
What Changed Over Time
- Early vector databases (2020–2022) assumed all indices fit in memory. HNSW was the default algorithm, optimized for low-latency recall at moderate scale.
- IVF+PQ on object storage was pioneered by research systems (Faiss, ScaNN) but required manual index management on S3.
- LanceDB (2023) introduced the Lance format, enabling self-describing vector indices stored natively on S3 with random-access performance.
- Amazon S3 Vectors (2025) made S3 itself vector-aware, eliminating the need for a separate database process for basic similarity search.
- The architectural pattern has converged: quantized coarse search in memory, full-precision re-ranking from object storage. The debate is now about managed vs. self-managed, not about whether decoupling works.