SorryDB: AI Provers Tackle Real-World Lean Theorems

SorryDB: A Dynamic Benchmark for AI in Formal Mathematics

Researchers have introduced SorryDB, a novel, dynamically-updating benchmark designed to rigorously evaluate AI's capability to contribute to real-world formal mathematics. Unlike static benchmarks composed of competition-style problems, SorryDB is built from 78 active formalization projects on GitHub, specifically extracting open Lean tasks—placeholders where a mathematical proof is required. This approach ensures that AI systems improving on this benchmark become more aligned with actual community needs and better at navigating the complex dependencies inherent in large-scale mathematical libraries.

The benchmark's dynamic nature, continuously refreshed with new tasks from ongoing projects, directly addresses the critical issue of test-set contamination in AI evaluation. It provides a robust, real-time metric for an agent's ability to engage with novel, unseen formalization work, moving beyond performance on cached or memorized problems.

Evaluating Current AI Approaches on Real-World Tasks

In an initial evaluation, the research team assessed a diverse collection of AI methods on a snapshot of 1,000 tasks from SorryDB. The tested approaches included generalist large language models (LLMs), agentic systems that can plan and execute proof steps, and specialized symbolic provers. The results revealed a complementary landscape of capabilities.

The most performant system was an agentic approach built on Gemini Flash. However, the study crucially found that this agent was "not strictly better" than other methods. Off-the-shelf LLMs, specialized provers, and even a simple curated list of common Lean tactics each demonstrated unique strengths on different types of tasks within the benchmark.

Why This Matters for AI and Mathematics

The development of SorryDB represents a significant shift in how AI for formal math is measured and advanced. Its design principles directly target the gap between academic benchmarks and practical utility.

Alignment with Real Work: By sourcing tasks from GitHub projects, it pushes AI development toward tools that are genuinely usable by mathematicians working on complex, interdependent proofs.
Mitigating Data Contamination: The continuously updating stream of tasks prevents models from simply memorizing benchmark answers, ensuring evaluations test true reasoning and generalization.
Revealing Complementary Strengths: The benchmark shows that no single AI approach is dominant, highlighting the need for hybrid systems that combine the strategic planning of agents with the brute-force search of provers and the knowledge of LLMs.

By providing a living, community-sourced testbed, SorryDB establishes a more authoritative and trustworthy framework for progress in automated theorem proving and AI-assisted formal verification, steering the field toward practical impact.

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

SorryDB: A Dynamic Benchmark for AI in Formal Mathematics

Evaluating Current AI Approaches on Real-World Tasks

Why This Matters for AI and Mathematics

常见问题

SorryDB: A Dynamic Benchmark for AI in Formal Mathematics

Evaluating Current AI Approaches on Real-World Tasks

Why This Matters for AI and Mathematics

常见问题

相关推荐

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

Large Electron Model: A Universal Ground State Predictor

Sparse autoencoders reveal organized biological knowledge but minimal regulatory logic in single-cell foundation models: a comparative atlas of Geneformer and scGPT

Speculative Speculative Decoding