SUN AI: Shared Next-token Prediction Boosts GPU Efficiency 2x

New AI Serving Method SUN Boosts GPU Efficiency by Sharing Decode Workloads

A novel system for serving multiple large language models (LLMs) simultaneously promises to dramatically improve GPU utilization by enabling a previously impossible feat: cross-model sharing of decode execution. Researchers have introduced Shared Use of Next-token Prediction (SUN), a method that addresses the chronic inefficiency of memory-bound decoding in multi-model environments, where traditional resource partitioning leads to severe GPU underutilization, especially under skewed workloads.

The core innovation of SUN lies in its architectural disaggregation of the Transformer model. It decomposes a standard decoder-only Transformer into two distinct modules: a task-specific prefill module and a shared, frozen decode module. By fine-tuning only the prefill component for each specific model or task, the system enables a single, universal decode module to be shared across different LLMs. This breakthrough allows for a model-agnostic decode routing policy that dynamically balances decode requests across a pooled set of GPU workers, maximizing hardware utilization and system throughput.

Performance Gains and Quantization Synergy

Empirical evaluations across diverse tasks and model families demonstrate that SUN achieves predictive accuracy comparable to full fine-tuning of individual models. Crucially, it maintains this performance while delivering superior system efficiency with fewer dedicated decode workers. In benchmark tests, SUN improves throughput per GPU by up to 2.0x over conventional disaggregated serving methods, while keeping the critical latency metric—time-per-output-token (TPOT)—within a 5% margin.

Furthermore, the SUN architecture inherently facilitates advanced optimization techniques like low-bit quantization. An enhanced version, Quantized SUN (QSUN), leverages this by applying quantization specifically to the shared decode module. This synergy results in a 45% inference speedup compared to standard SUN, while maintaining comparable model accuracy and preserving all the core benefits of shared decoding execution.

Why This Matters for AI Infrastructure

Unlocks GPU Efficiency: SUN directly tackles the problem of GPU underutilization in multi-LLM serving, a major cost and scalability bottleneck for AI providers.
Enables Practical Multi-Model Deployment: By making decode execution a shared resource, it becomes more feasible to serve a diverse portfolio of models on the same hardware cluster efficiently.
Future-Proofs with Quantization: The design naturally complements model compression techniques like quantization (QSUN), paving the way for even faster and more cost-effective inference.
Maintains Model Fidelity: The approach achieves these system-level gains without sacrificing the task-specific accuracy users expect from fully fine-tuned models.

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

New AI Serving Method SUN Boosts GPU Efficiency by Sharing Decode Workloads

Performance Gains and Quantization Synergy

Why This Matters for AI Infrastructure

常见问题

New AI Serving Method SUN Boosts GPU Efficiency by Sharing Decode Workloads

Performance Gains and Quantization Synergy

Why This Matters for AI Infrastructure

常见问题

相关推荐

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Large Electron Model: A Universal Ground State Predictor

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

Speculative Speculative Decoding

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

Speculative Speculative Decoding