SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Shared Use of Next-token Prediction (SUN) is a novel AI serving method that improves GPU efficiency by enabling cross-model sharing of decode execution. The system decomposes Transformer models into task-specific prefill modules and shared frozen decode modules, achieving up to 2.0x higher throughput per GPU while maintaining comparable accuracy to full fine-tuning. An enhanced Quantized SUN (QSUN) version delivers 45% inference speedup through low-bit quantization of the shared decode module.

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

New AI Serving Method SUN Boosts GPU Efficiency by Sharing Decode Workloads

A novel system for serving multiple large language models (LLMs) simultaneously promises to dramatically improve GPU utilization by enabling a previously impossible feat: cross-model sharing of decode execution. Researchers have introduced Shared Use of Next-token Prediction (SUN), a method that addresses the chronic inefficiency of memory-bound decoding in multi-model environments, where traditional resource partitioning leads to severe GPU underutilization, especially under skewed workloads.

The core innovation of SUN lies in its architectural disaggregation of the Transformer model. It decomposes a standard decoder-only Transformer into two distinct modules: a task-specific prefill module and a shared, frozen decode module. By fine-tuning only the prefill component for each specific model or task, the system enables a single, universal decode module to be shared across different LLMs. This breakthrough allows for a model-agnostic decode routing policy that dynamically balances decode requests across a pooled set of GPU workers, maximizing hardware utilization and system throughput.

Performance Gains and Quantization Synergy

Empirical evaluations across diverse tasks and model families demonstrate that SUN achieves predictive accuracy comparable to full fine-tuning of individual models. Crucially, it maintains this performance while delivering superior system efficiency with fewer dedicated decode workers. In benchmark tests, SUN improves throughput per GPU by up to 2.0x over conventional disaggregated serving methods, while keeping the critical latency metric—time-per-output-token (TPOT)—within a 5% margin.

Furthermore, the SUN architecture inherently facilitates advanced optimization techniques like low-bit quantization. An enhanced version, Quantized SUN (QSUN), leverages this by applying quantization specifically to the shared decode module. This synergy results in a 45% inference speedup compared to standard SUN, while maintaining comparable model accuracy and preserving all the core benefits of shared decoding execution.

Why This Matters for AI Infrastructure

  • Unlocks GPU Efficiency: SUN directly tackles the problem of GPU underutilization in multi-LLM serving, a major cost and scalability bottleneck for AI providers.
  • Enables Practical Multi-Model Deployment: By making decode execution a shared resource, it becomes more feasible to serve a diverse portfolio of models on the same hardware cluster efficiently.
  • Future-Proofs with Quantization: The design naturally complements model compression techniques like quantization (QSUN), paving the way for even faster and more cost-effective inference.
  • Maintains Model Fidelity: The approach achieves these system-level gains without sacrificing the task-specific accuracy users expect from fully fine-tuned models.

常见问题