Architecture

Decoupled Vector Search

A vector database architecture that separates index storage on object storage from query compute, using Inverted File Indexes (IVF) and Product Quantization (PQ) to compress the in-memory vector footprint by approximately 64x while fetching full-precision vectors from S3 only for final re-ranking.

5 connections 3 resources 1 post

Summary

What it is

Where it fits

Replaces monolithic, RAM-bound HNSW-based vector databases for billion-scale retrieval. The foundational architecture behind Amazon S3 Vectors and Databricks Vector Search. Enables enterprise RAG pipelines without provisioned vector database clusters by tiering the index to object storage and keeping only compressed cluster centroids in memory.

Misconceptions / Traps

Not inherently slower than in-memory search for all workloads — warm query latency reaches approximately 100ms, sufficient for non-real-time RAG.
Does not eliminate the need for embeddings — it changes where and how they are stored and queried.
Requires a dual-runtime query engine separating asynchronous I/O threads from CPU-bound distance computation to prevent network latency from starving compute cores.

Key Connections

enables Amazon S3 Vectors — S3 Vectors implements this architecture natively at the storage layer
enables RAG over Structured Data — makes billion-scale semantic retrieval economically viable on S3
scoped_to Vector Indexing on Object Storage — the architecture that defines how vectors live on object storage

Definition

What it is

Why it exists

Storing full-precision HNSW graphs entirely in RAM becomes economically unviable beyond approximately 100 million vectors. Decoupled vector search maps IVF clusters to contiguous S3 object fragments, keeps only compressed centroids in memory, and issues concurrent byte-range reads to fetch full-precision vectors for final distance computation. This requires a dual-runtime query engine where asynchronous I/O threads are strictly separated from CPU-bound mathematical threads.

Primary use cases

Billion-scale enterprise RAG pipelines, serverless vector retrieval via S3 Vectors, cost-efficient semantic search without provisioned vector database clusters.

Recent developments

Latest signals

LanceDB on S3 + AWS Lambda is the 2026 reference serverless implementation. Decoupled storage (Lance columnar on S3) + stateless query compute (Lambda) — automatic scaling per workload, no always-on infrastructure to manage. The "vector database without a vector-database cluster" pattern. Per Grokipedia — Vector Search in LanceDB and AWS Architecture Blog — Scalable Elastic Database for 1B+ Vectors on LanceDB + S3.
IVF-PQ at 200M vectors per bucket: partitions = √rows; subvectors = dims/16. AWS's published reference implementation: per 200M-vector bucket, create LanceDB tables with IVF-PQ on cosine distance — partition count is the square root of row count, subvector count is vector dimensionality divided by 16. The empirical tuning formula that makes billion-scale serverless vector search work. Per AWS Architecture Blog — Scalable 1B+ Vector Solution on LanceDB + S3.
LanceDB 2026 launches: Lance-native SQL via DuckDB, multi-bucket storage, 1.5M IOPS benchmarks. Q1 2026 LanceDB shipped DuckDB SQL retrieval against Lance natively, Uber-scale multi-bucket storage, and 1.5M IOPS benchmarks against object storage. The serverless-vector-search-on-S3 architecture is no longer constrained by IOPS ceilings. Per LanceDB — AI-Native Multimodal Lakehouse.
Mindshare growth: 6.7% → 9.6% YoY (steepest among 9 leading vector databases). MarkTechPost's 2026 nine-database comparison ranks LanceDB as the fastest-growing entrant — driven by interest in serverless + multimodal architectures. The serverless-vector-search pattern is winning architectural mindshare. Per MarkTechPost — Best Vector Databases in 2026: Pricing, Scale Limits, Architecture Tradeoffs.
IVF-PQ compresses in-memory footprint ~64×. Inverted File Indexes (IVF) + Product Quantization (PQ) keep only compressed centroids in RAM; full-precision vectors stay on S3 + load via concurrent byte-range reads for final re-ranking. The technical mechanism that makes billion-scale serverless work. Per LanceDB Blog — IVF_PQ: Accelerate Vector Search by Creating Indices.
Dual-runtime engine requirement: async I/O threads separated from CPU-bound math. Production decoupled vector search requires strict separation of I/O-bound threads (fetching from S3) from CPU-bound threads (distance computation) — single-threaded designs collapse under the I/O latency. The architectural constraint that determines which vector DBs scale. Per Medium — The Future of Vector Search: Exploring LanceDB for Billion-Scale Search.

Connections 5

Outbound 5

scoped_to2

S3 Vector Indexing on Object Storage

enables2

Amazon S3 Vectors RAG over Structured Data

solves1

Egress Cost

Resources 3

DocsHigh

aws.amazon.com/s3/features/vectors/

Official AWS documentation for S3 Vectors, the primary implementation of decoupled vector search on object storage, including index limits and pricing.

BlogHigh

aws.amazon.com/blogs/aws/introducing-amazon-s3-vectors-first...

AWS engineering blog detailing the architecture of native vector support in S3, covering IVF indexing and approximate nearest neighbor mechanics.

BlogMedium

dgallitelli95.medium.com/serverless-rag-on-aws-amazon-bedroc...

Practical guide to building serverless RAG pipelines using decoupled vector search on S3 with Amazon Bedrock integration.

Summary

Definition

Recent developments

Connections 5

Resources 3

Featured in