Pie: A Programmable Serving System for Emerging LLM Applications

Oct 12, 2025·

In Gim

Zhiyao Ma

Seung-Seob Lee

Lin Zhong

· 0 min read

PDF DOI

Abstract

Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O—entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3×-3.4× higher) on agentic workflows by enabling application-specific optimizations.

Type

Conference

Publication

In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles

Last updated on Oct 12, 2025

LLM Inference Serving System

← Cacheback: Speculative Decoding With Nothing But Cache Nov 4, 2025

Hopter: a Safe, Robust, and Responsive Embedded Operating System Jun 27, 2025 →

No results found

Pie: A Programmable Serving System for Emerging LLM Applications