SGLang
An open-source LLM serving engine optimized for structured generation and prefix sharing. Distributed under Apache 2.0. The **RadixAttention** mechanism — SGLang's core innovation — uses a radix tree to identify and share KV-cache state across requests with overlapping prefixes, dramatically improving throughput for workloads where prompts share large structured prefixes (system instructions, few-shot examples, persistent context). RadixAttention `depends_on` remote storage backends for evicting cold cache lines, making S3 the natural durability target.
Definition
An open-source LLM serving engine optimized for structured generation and prefix sharing. Distributed under Apache 2.0. The **RadixAttention** mechanism — SGLang's core innovation — uses a radix tree to identify and share KV-cache state across requests with overlapping prefixes, dramatically improving throughput for workloads where prompts share large structured prefixes (system instructions, few-shot examples, persistent context). RadixAttention `depends_on` remote storage backends for evicting cold cache lines, making S3 the natural durability target.
Most LLM serving engines treat each request as independent, recomputing the entire KV-cache for every prompt. Production workloads have massive prefix overlap — multi-tenant serving with shared system prompts, agentic workflows with persistent context, structured generation with template prefixes. SGLang's RadixAttention specifically exploits this redundancy at the engine level, with eviction to remote storage so the radix tree doesn't bound itself to GPU RAM.
Structured generation with high prefix overlap, multi-tenant LLM serving, agentic workflows with persistent context, function-calling pipelines with shared schema prefixes.
Connections 6
Outbound 5
Inbound 1
competes_with1