Guide 5

Vector Indexing on Object Storage — What's Real vs. Hype

Problem Framing

Vector databases and semantic search are heavily marketed features in the AI ecosystem. For engineers building on S3, the question is practical: can you build production vector search over S3-stored data, and what are the real trade-offs? The answer depends on data volume, latency requirements, and whether you need a separate infrastructure layer.

Relevant Nodes

  • Topics: Vector Indexing on Object Storage, LLM-Assisted Data Systems, S3
  • Technologies: LanceDB, AWS S3
  • Standards: S3 API
  • Architectures: Hybrid S3 + Vector Index, Offline Embedding Pipeline, Local Inference Stack
  • Model Classes: Embedding Model, Small / Distilled Model
  • LLM Capabilities: Embedding Generation, Semantic Search
  • Pain Points: High Cloud Inference Cost, Cold Scan Latency

Decision Path

  1. Decide if you need vector search at all:

    • Yes if your data is unstructured (documents, images, logs) and users need to find content by meaning.
    • Yes if you are building RAG systems grounded in S3-stored corpora.
    • No if your queries are structured (SQL filters, exact matches, aggregations). Table formats and SQL engines are the right tool.
    • Maybe if you want to combine semantic and structured search (hybrid search) — this is real but adds complexity.
  2. Choose your vector index architecture:

    • S3-native (LanceDB): Vector indexes stored as files on S3. Serverless, no separate infrastructure, lowest operational overhead. Trade-off: higher query latency (S3 read on every query).
    • Dedicated vector database (Milvus, Weaviate): Separate infrastructure with in-memory indexes. Lower latency, higher throughput. Trade-off: another system to operate, and you store data in two places (S3 + vector DB).
    • Managed service (OpenSearch, S3 Vectors): Cloud-managed vector search. Trade-off: vendor lock-in and cost at scale.
  3. Plan your embedding pipeline:

    • Source data lives in S3 → embedding model processes it → vectors are stored in the index
    • Batch (Offline Embedding Pipeline): Process S3 data on a schedule. Cost-predictable. Stale by design.
    • Stream: Embed on ingest. Fresh but expensive and operationally complex.
    • Embedding model choice: Commercial APIs (OpenAI) for quality, open-source (sentence-transformers) for cost/privacy, small/distilled models for local inference.
  4. Understand what's real vs. hype:

    • Real: Vector search over thousands to millions of documents on S3. LanceDB handles 1B+ vectors on S3. RAG with S3-backed corpora works in production.
    • Real: Embedding costs dominate the total cost. The index itself is cheap; generating embeddings is not.
    • Hype: "Just add vector search to your data lake." Integration requires embedding pipelines, index maintenance, sync mechanisms, and relevance tuning.
    • Hype: "Vector search replaces SQL." It does not. It answers a different question (semantic similarity vs. predicate matching).

What Changed Over Time

  • Early vector databases (2020-2022) were standalone systems with no S3 story. Data had to be copied in.
  • S3-native vector search emerged (LanceDB, Lance format) to align with the separation of storage and compute principle.
  • AWS announced S3 Vectors — native vector storage in S3 itself — signaling that vector search is moving into the storage layer.
  • Embedding model costs dropped significantly (open-source models, quantized models, distillation). This makes the embedding pipeline more viable at S3 data scale.
  • The "RAG over S3 data" pattern has become a standard architecture, with AWS, Databricks, and LangChain providing reference implementations.

Sources