Memory Offloading for Large Language Model Inference with Latency SLO Guarantees
Abstract
Offloading large language model state to host memory during inference can reduce serving cost by supporting larger models, longer inputs, and larger batch sizes, but existing approaches do not account for latency SLOs and therefore either violate SLOs frequently or waste host memory. This paper presents Select-N, a latency-SLO-aware memory offloading system for LLM serving. Select-N exploits the deterministic per-layer decoder computation time of modern LLMs and introduces the offloading interval as a tunable knob to capture the tradeoff between SLO compliance and host memory usage. Based on this abstraction, Select-N uses a two-stage approach: an offline stage to derive an effective interval range and a runtime stage to adapt the interval at inference-iteration granularity according to hardware conditions. Evaluation shows that Select-N consistently meets SLOs and improves throughput over existing mechanisms by up to 1.85x by maximizing host memory utilization.