Architecture

Local Inference Stack

Summary

What it is

A pattern of running ML/LLM models on local hardware against data stored in or pulled from S3, avoiding cloud-based inference APIs.

Where it fits

This is the cost optimization pattern for LLM workloads over S3 data. When the volume of data to process is large enough, local inference (on-premise GPUs or edge devices) is orders of magnitude cheaper than per-token cloud API pricing.

Misconceptions / Traps

  • "Local" does not mean "free." GPUs, power, cooling, and operational overhead have real costs. The break-even point depends on volume and model size.
  • Model quality may differ. Smaller local models (distilled, quantized) trade accuracy for cost. Evaluate whether the quality loss is acceptable for your use case.

Key Connections

  • solves High Cloud Inference Cost, Egress Cost — eliminates per-token and egress charges
  • scoped_to LLM-Assisted Data Systems, S3

Definition

What it is

A pattern of running ML/LLM models on local hardware (on-premise or edge) against data stored in or pulled from S3, avoiding cloud-based inference APIs entirely.

Why it exists

Cloud inference costs scale linearly with volume. For organizations processing large amounts of S3-stored data, running models locally on owned hardware can be orders of magnitude cheaper — and eliminates egress charges.

Primary use cases

On-premise embedding generation, local metadata extraction from S3-stored documents, edge inference for IoT data stored in S3.

Relationships

Resources