SorryDB: A Dynamic Benchmark for AI in Formal Mathematics
Researchers have introduced SorryDB, a novel, dynamically-updating benchmark designed to rigorously evaluate AI's capability to contribute to real-world formal mathematics. Unlike static benchmarks composed of competition-style problems, SorryDB is built from 78 active formalization projects on GitHub, specifically extracting open Lean tasks—placeholders where a mathematical proof is required. This approach ensures that AI systems improving on this benchmark become more aligned with actual community needs and better at navigating the complex dependencies inherent in large-scale mathematical libraries.
The benchmark's dynamic nature, continuously refreshed with new tasks from ongoing projects, directly addresses the critical issue of test-set contamination in AI evaluation. It provides a robust, real-time metric for an agent's ability to engage with novel, unseen formalization work, moving beyond performance on cached or memorized problems.
Evaluating Current AI Approaches on Real-World Tasks
In an initial evaluation, the research team assessed a diverse collection of AI methods on a snapshot of 1,000 tasks from SorryDB. The tested approaches included generalist large language models (LLMs), agentic systems that can plan and execute proof steps, and specialized symbolic provers. The results revealed a complementary landscape of capabilities.
The most performant system was an agentic approach built on Gemini Flash. However, the study crucially found that this agent was "not strictly better" than other methods. Off-the-shelf LLMs, specialized provers, and even a simple curated list of common Lean tactics each demonstrated unique strengths on different types of tasks within the benchmark.
Why This Matters for AI and Mathematics
The development of SorryDB represents a significant shift in how AI for formal math is measured and advanced. Its design principles directly target the gap between academic benchmarks and practical utility.
- Alignment with Real Work: By sourcing tasks from GitHub projects, it pushes AI development toward tools that are genuinely usable by mathematicians working on complex, interdependent proofs.
- Mitigating Data Contamination: The continuously updating stream of tasks prevents models from simply memorizing benchmark answers, ensuring evaluations test true reasoning and generalization.
- Revealing Complementary Strengths: The benchmark shows that no single AI approach is dominant, highlighting the need for hybrid systems that combine the strategic planning of agents with the brute-force search of provers and the knowledge of LLMs.
By providing a living, community-sourced testbed, SorryDB establishes a more authoritative and trustworthy framework for progress in automated theorem proving and AI-assisted formal verification, steering the field toward practical impact.