Major Audit Reveals Critical Flaws in MedCalc-Bench, a Key Medical AI Benchmark
A systematic audit of the widely used MedCalc-Bench has uncovered over 20 critical errors in its clinical calculator implementations, calling into question the benchmark's validity for assessing true clinical reasoning in large language models (LLMs). The findings, detailed in a new paper (arXiv:2603.02222v1), reveal that a simple "open-book" prompting strategy allows models to achieve over 81% accuracy, surpassing even sophisticated reinforcement learning systems, and suggests the benchmark primarily measures formula recall rather than diagnostic intelligence.
Benchmark Audit Uncovers Formula Inaccuracies and Runtime Bugs
The research team conducted a rigorous audit of the calculator code within the NeurIPS-published dataset. They identified and corrected more than 20 distinct errors, which ranged from critical formula inaccuracies affecting clinical validity to runtime bugs that could cause calculation failures. This discovery is significant because MedCalc-Bench is a cornerstone of evaluations like the HELM MedHELM leaderboard, where state-of-the-art direct prompting scores have plateaued around 35% on its Verified split.
Open-Book Prompting Surpasses Complex RL Training
In a striking demonstration, the researchers showed that a fundamental shift in the evaluation framing yields dramatically different results. By providing the model with the correct calculator specification at inference time—an "open-book" approach—accuracy on models like GLM-4.6V and GLM-4.7 soared from approximately 52% to 81-85%. This simple intervention, requiring no fine-tuning, outperforms the best previously published result of 74% achieved by a complex Reinforcement Learning (RL) with verifiable rewards system.
New Upper Bound and a Call for Reframed Evaluation
Using GPT-5.2-Thinking, the study established a performance upper bound of 95-97% on the corrected benchmark. The residual errors were attributed not to model failure but to persistent ground-truth issues and dataset ambiguities in the source material. The collective findings lead to a powerful conclusion: MedCalc-Bench in its current form predominantly measures a model's ability to memorize medical formulas and execute precise arithmetic, not its capacity for nuanced clinical reasoning.
Why This Matters: Implications for AI Medical Evaluation
- Benchmark Validity is Critical: The discovery of pervasive errors in a published, high-profile dataset underscores the need for rigorous and ongoing audits of AI evaluation tools, especially in high-stakes fields like medicine.
- Framing Defines Results: The benchmark may be inadvertently testing the wrong skill. The dramatic performance leap with "open-book" access suggests it is better framed as a tool-use evaluation, assessing a model's ability to correctly utilize a provided clinical calculator, rather than as a test of embedded clinical knowledge.
- Pathways for Improvement: For developers, the research highlights that improving tool-use and specification following may be a more fruitful path than training models to internally memorize vast libraries of precise formulas. For the field, it calls for new benchmarks that more directly and robustly evaluate diagnostic reasoning and decision-making processes.