CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

CUDABench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on text-to-CUDA generation, assessing their ability to produce functional and performant GPU kernels from natural language descriptions. It introduces novel evaluation metrics including a Performance-Score based on roofline analysis to measure hardware efficiency beyond mere compilation success. The benchmark reveals significant gaps between compilation rates and functional correctness in current models, highlighting key challenges in automated high-performance computing.

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Introducing CUDABench: A New Benchmark for Evaluating LLMs on Text-to-CUDA Generation

Researchers have introduced CUDABench, a comprehensive new benchmark designed to rigorously evaluate the ability of Large Language Models (LLMs) to generate functional and performant GPU kernels directly from natural language descriptions. This work addresses a critical gap in AI programming evaluation, moving beyond simple code translation to assess the more complex and general task of text-to-CUDA generation, which is essential for automating high-performance computing. The benchmark introduces novel evaluation metrics, including a Performance-Score based on roofline analysis, to measure not just if the code compiles, but if it runs correctly and utilizes GPU hardware efficiently.

Beyond Compilation: Measuring Functional Correctness and Performance

Current benchmarks often focus narrowly on whether an LLM can produce syntactically correct CUDA code that compiles. CUDABench pioneers a more holistic assessment through its Generative Verification Pipeline. This pipeline evaluates three critical dimensions: compilation success, functional correctness via execution-based testing, and hardware efficiency. The novel roofline-based Performance-Score is particularly significant, as it measures how well the generated kernel utilizes memory bandwidth and computational throughput—key factors in real-world GPU programming. Early benchmarking reveals a telling discrepancy: while some models achieve high compilation rates, their functional correctness remains low, highlighting a major challenge for the field.

The CUDABench-Set: A Breadth-Depth-Difficulty Evaluation Space

The benchmark's foundation is the CUDABench-Set, a carefully curated collection of problems spanning a wide evaluation space of Breadth, Depth, and Difficulty. It covers diverse, demanding application domains such as artificial intelligence (e.g., transformer layers), scientific computing (e.g., stencil computations), and data analytics. This domain variety tests the LLMs' grasp of specific algorithms and mathematical concepts beyond generic coding syntax. The set is designed to uncover weaknesses in domain-specific algorithmic knowledge and the models' ability to optimize for hardware-specific constraints, which are common failure points in automated code generation.

Key Findings and Challenges for LLM-Based GPU Programming

Benchmarking state-of-the-art models with CUDABench has yielded insightful findings that chart the path for future research. A primary challenge identified is the significant gap between compilation success and actual functional correctness, suggesting models often generate plausible-looking but logically flawed code. Furthermore, LLMs frequently demonstrate a lack of domain-specific knowledge and produce kernels with suboptimal utilization of GPU resources, leading to poor performance even when the code runs. These results underscore that generating high-performance computing code requires deep, integrated understanding of algorithms, hardware architecture, and performance engineering—a high bar for current generative AI.

Why This Matters: The Future of AI-Assisted High-Performance Computing

The release of CUDABench (available on GitHub) represents a major step forward in quantifying the capabilities and limitations of LLMs in a critical programming domain. For developers and researchers, it provides a standardized, rigorous framework to track progress. The benchmark's focus on execution-based verification and hardware-aware performance metrics moves the field toward evaluating practical utility rather than just syntactic correctness. As AI continues to automate complex tasks, robust benchmarks like CUDABench are essential for guiding development toward creating truly reliable and efficient AI programming assistants for high-performance computing.

常见问题