Architecture

Local Inference Stack

Summary

What it is

A pattern of running ML/LLM models on local hardware against data stored in or pulled from S3, avoiding cloud-based inference APIs.

Where it fits

This is the cost optimization pattern for LLM workloads over S3 data. When the volume of data to process is large enough, local inference (on-premise GPUs or edge devices) is orders of magnitude cheaper than per-token cloud API pricing.

Misconceptions / Traps

"Local" does not mean "free." GPUs, power, cooling, and operational overhead have real costs. The break-even point depends on volume and model size.
Model quality may differ. Smaller local models (distilled, quantized) trade accuracy for cost. Evaluate whether the quality loss is acceptable for your use case.

Key Connections

solves High Cloud Inference Cost, Egress Cost — eliminates per-token and egress charges
scoped_to LLM-Assisted Data Systems, S3

Definition

What it is

A pattern of running ML/LLM models on local hardware (on-premise or edge) against data stored in or pulled from S3, avoiding cloud-based inference APIs entirely.

Why it exists

Cloud inference costs scale linearly with volume. For organizations processing large amounts of S3-stored data, running models locally on owned hardware can be orders of magnitude cheaper — and eliminates egress charges.

Primary use cases

On-premise embedding generation, local metadata extraction from S3-stored documents, edge inference for IoT data stored in S3.

Relationships

Outbound Relationships

scoped_to

LLM-Assisted Data Systems S3

solves

High Cloud Inference Cost Egress Cost

Resources

DocsHigh

docs.vllm.ai/en/stable/models/extensions/runai_model_streame...

Official vLLM documentation for loading and serving LLM models directly from S3 using the Run:ai Model Streamer, enabling S3-backed local inference.

GitHubHigh

github.com/ggml-org/llama.cpp

Canonical repository for llama.cpp, the most widely used C/C++ LLM inference engine for running quantized models on commodity hardware.

BlogHigh

developer.nvidia.com/blog/reducing-cold-start-latency-for-ll...

NVIDIA technical blog demonstrating how streaming models from S3 reduces cold-start latency and infrastructure costs.