Best-of-N Selection with Majority Voting: A Path to Infinite LLM Performance
In a significant advancement for large language model (LLM) inference, researchers have introduced a novel analysis of the "best-of-N" selection strategy, where the final answer is chosen via majority vote from N independent model samples. The study, detailed in the preprint "Best-of-N with Majority Voting for Large Language Models," explores the theoretical limit of this approach as N approaches infinity—a state termed Best-of-Infinity (BoI). While BoI demonstrates remarkable performance, its requirement for an infinite test-time budget renders it impractical. To bridge this gap, the authors propose an adaptive generation scheme that dynamically allocates computational resources based on answer agreement, alongside a framework for creating optimally weighted ensembles of multiple LLMs.
Theoretical Limits and Practical Adaptive Strategies
The core analysis reveals that the best-of-N with majority voting strategy achieves increasingly impressive accuracy as N grows, with BoI representing the performance ceiling. This method effectively mitigates individual model errors and inconsistencies by leveraging the "wisdom of the crowd" principle. However, generating an infinite number of samples is computationally prohibitive for real-world applications.
To address this fundamental limitation, the researchers developed an adaptive generation scheme. Instead of using a fixed, large N, this method dynamically determines how many samples to generate based on the observed agreement among answers. The process stops once a clear consensus emerges, thereby efficiently allocating inference-time computation and making the power of large-N strategies accessible with finite, often substantially reduced, budgets.
Optimal Weighted Ensembles of Multiple LLMs
Moving beyond sampling from a single model, the work extends the framework to weighted ensembles of multiple LLMs. The research demonstrates that a strategically combined mixture of different models can consistently outperform any single constituent model, even the strongest one. This finding underscores the value of diversity in model architectures and training data.
Critically, the paper formulates the search for the optimal ensemble weighting as a mixed-integer linear programming (MILP) problem. This formulation allows for the efficient computation of the precise weight assigned to each model in the ensemble to maximize expected performance on a given task, providing a principled and automated method for creating superior composite AI systems.
Experimental Validation and Performance Gains
The proposed methodologies were subjected to extensive experimentation across various benchmarks. Results confirm that the adaptive scheme closely approximates the performance of large-N sampling while drastically reducing computational costs. Furthermore, the optimally weighted ensembles validated the theoretical claim, achieving higher accuracy and robustness than any individual model within the mixture. These experiments solidify the practical viability of the approaches for enhancing LLM reliability and output quality.
Why This Matters for AI Development
- Bridging Theory and Practice: The work provides a theoretical understanding of best-of-N limits (BoI) and delivers practical, adaptive algorithms to harness near-optimal performance with finite compute, making advanced inference strategies viable for production systems.
- Efficient Compute Allocation: The adaptive generation scheme represents a shift towards compute-aware AI, dynamically investing resources where uncertainty is high, which is crucial for scalable and cost-effective deployment of large models.
- Superior Model Performance: The optimal ensemble framework offers a clear, mathematically-grounded path to build systems that are more capable than any single available LLM, pushing the frontier of what is achievable with current model portfolios.
- Automated Ensemble Design: Formulating optimal weighting as a MILP problem removes guesswork from ensemble creation, allowing for the automated construction of high-performance, multi-model AI agents.