Best-of-$\infty$ -- Asymptotic Performance of Test-Time Compute

New research introduces Best-of-Infinity (Bo∞), an adaptive generation framework that efficiently approximates infinite majority voting for large language models. The method dynamically allocates inference-time compute based on answer agreement, achieving near-optimal performance with finite resources. The study also demonstrates that optimally weighted ensembles of multiple LLMs can surpass the performance of any individual constituent model.

Best-of-$\infty$ -- Asymptotic Performance of Test-Time Compute

Best-of-N to Infinity: New Research Proposes Adaptive, Efficient LLM Voting Scheme

New research proposes a novel method to harness the power of infinite majority voting for large language models (LLMs) without the impossible computational cost. The study, detailed in the preprint "Best-of-N for Large Language Models via Adaptive Generation and Weighted Ensembles" (arXiv:2509.21091v2), introduces Best-of-Infinity (Bo∞) as a theoretical limit and an adaptive, practical framework that dynamically allocates inference-time compute based on answer agreement, while also demonstrating that optimally weighted model ensembles can surpass any single constituent model.

The Theoretical Power and Practical Problem of Bo∞

The research begins by analyzing the best-of-N technique, where an LLM generates N candidate answers to a query, and the final output is selected by majority vote. The study proves that as N approaches infinity (Bo∞), the performance of this method becomes highly impressive, effectively converging on the most probable correct answer. However, this theoretical ideal presents a fundamental practical barrier: it requires an infinite test-time budget, making direct implementation impossible for real-world applications where computational resources are finite and costly.

Adaptive Generation: Efficiently Approaching the Infinity Limit

To bridge this gap between theory and practice, the authors propose an adaptive generation scheme. Instead of pre-defining a fixed, large N, the system dynamically determines how many samples are needed. It generates answers sequentially and stops once a sufficient level of agreement or consensus is reached among the outputs. This approach efficiently allocates inference-time computation, spending more resources on ambiguous queries that require more samples for a clear majority and fewer on straightforward ones, thereby closely approximating the Bo∞ performance with a finite, often substantially lower, average N.

Beyond Single Models: The Superiority of Weighted Ensembles

The framework's innovation extends beyond adaptive sampling for a single model. The researchers generalize it to weighted ensembles of multiple LLMs. In this setup, different models contribute to the candidate pool according to an assigned weight. The paper's key finding is that such a weighted mixture can achieve performance that outperforms any individual model within the ensemble. The challenge then becomes finding the optimal weighting scheme to maximize overall accuracy and reliability.

Optimizing the Ensemble with Mixed-Integer Programming

To solve for the best possible ensemble, the researchers formulate the search for optimal ensemble weighting as a mixed-integer linear program (MILP). This mathematical framework allows for the efficient computation of the precise weight each model should have in the mixture to maximize the expected performance of the majority voting process. The use of MILP provides a rigorous and computationally tractable method for ensemble creation, moving beyond simple averaging or heuristic weighting strategies.

Experimental Validation and Performance Gains

The paper supports its theoretical contributions with extensive experiments. These empirical tests demonstrate the tangible effectiveness of both the adaptive generation scheme for single models and the optimally weighted ensemble approach. The results validate that the methods deliver significant performance improvements, efficiently leveraging compute to approach the theoretical benefits of large-scale and infinite sampling while remaining practically feasible.

Why This Matters: Key Takeaways for AI Development

  • Efficiency at Scale: The adaptive generation scheme provides a principled way to gain the benefits of massive sampling (Bo∞) with a smart, variable computational budget, making advanced inference techniques more viable.
  • Ensemble Superiority: The research formally shows that a properly weighted ensemble of LLMs is not just a fallback but a strategy that can definitively outperform every model in its pool, emphasizing the value of model diversity.
  • Practical Optimization: By framing ensemble weighting as a mixed-integer linear program, the work offers a rigorous, optimizable pipeline for creating state-of-the-art model mixtures, advancing beyond ad-hoc ensemble methods.

常见问题