Best-of-$\infty$ -- Asymptotic Performance of Test-Time Compute

Researchers have introduced a theoretical framework called bo∞ (best-of-infinity) that analyzes the asymptotic performance of best-of-N LLM selection with majority voting as N approaches infinity. The study, detailed in arXiv:2509.21091v2, demonstrates that aggregating infinite stochastic samples can converge to highly reliable answers, though computationally infeasible. To bridge this gap, the authors propose an adaptive generation scheme and extend the framework to optimally weighted multi-model ensembles using mixed-integer linear programming (MILP).

Best-of-$\infty$ -- Asymptotic Performance of Test-Time Compute

Best-of-N LLM Selection with Majority Voting: A Path to Infinite-Scale Inference

Researchers have introduced a novel theoretical framework for optimizing large language model (LLM) performance by scaling a simple majority-voting technique to its logical extreme. The study, detailed in the preprint arXiv:2509.21091v2, analyzes the "best-of-N" approach, where an LLM generates N candidate answers and the final output is selected via majority vote. The analysis focuses on the asymptotic limit as N approaches infinity—a method dubbed bo∞ (best-of-infinity). While this theoretical limit demonstrates impressive performance gains, it is computationally infeasible, requiring an infinite inference-time budget. To bridge this gap, the authors propose an adaptive generation scheme and extend the framework to weighted, multi-model ensembles, formulating the optimal weighting as a solvable optimization problem.

From Infinite Theory to Adaptive, Practical Implementation

The core insight of the bo∞ analysis is that aggregating a vast number of stochastic samples from an LLM can converge to a highly reliable and accurate answer through simple majority rule. This leverages the statistical principle that repeated sampling can effectively marginalize out the model's inherent randomness and errors. However, generating an infinite number of candidates is impossible in practice. The proposed solution is an adaptive generation scheme that dynamically determines how many samples (N) are needed. Instead of fixing N beforehand, the system generates answers sequentially and stops once a pre-defined level of consensus or agreement among the outputs is reached, thereby allocating compute efficiently based on the difficulty or ambiguity of the query.

Extending to Optimal Weighted Ensembles of Multiple LLMs

Moving beyond a single model, the research extends the best-of-N paradigm to heterogeneous ensembles comprising multiple, potentially different LLMs. Here, the selection is not just a simple vote but a weighted majority, where each model's votes are scaled by an assigned weight. The study proves that such a weighted mixture can strictly outperform any single constituent model within the ensemble. Crucially, the authors show that finding the optimal ensemble weighting—the set of weights that maximizes expected performance—can be formulated as a mixed-integer linear programming (MILP) problem. This formulation allows for the efficient computation of the best weighting scheme using standard optimization solvers.

Experimental Validation and Performance Gains

The paper supports its theoretical contributions with extensive experiments. These demonstrate the practical effectiveness of both the adaptive stopping rule and the optimally weighted ensembles. The adaptive scheme is shown to achieve performance close to the theoretical bo∞ limit but with a dramatically reduced and variable sample count, making it a cost-effective strategy for real-world deployment. Furthermore, experiments with ensembles of diverse LLMs confirm that the computed optimal weights lead to significant performance improvements over uniform voting or using the best single model, validating the MILP-based optimization approach.

Why This Matters for AI Development

  • Efficient Inference: The adaptive generation scheme provides a principled method to dynamically allocate computational resources during LLM inference, balancing cost and performance intelligently.
  • Ensemble Superiority: The work offers a formal proof and a practical method (MILP optimization) showing that carefully weighted combinations of multiple LLMs can be more powerful than any individual component, a key insight for building state-of-the-art AI systems.
  • Bridging Theory and Practice: It translates the compelling but impractical theory of infinite sampling (bo∞) into actionable algorithms—adaptive stopping and optimal ensemble weighting—that are immediately applicable to improve the reliability and accuracy of current LLMs.

常见问题