Benchmarking GenAI for Software Engineering: Challenges and Insights
Keynote at AISM @ ASE 2025, Seoul, South Korea, November 20, 2025
GenAI is rapidly reshaping software engineering, advancing capabilities in code generation, translation, testing, and issue analysis. However, current evaluation practices remain fragmented, inconsistent, and often irreproducible, making it difficult to assess genuine progress. In this talk, we will explore the challenges of systematically and transparently benchmarking GenAI for software engineering. We will present a unified framework that integrates key components (metrics, workloads, prompting strategies, and experimental procedures) to enable rigorous and comparable assessments across diverse tasks. Through practical examples, we will demonstrate how to achieve trustworthy, evidence-based, and reproducible evaluations of Large Language Models (LLMs) for software development.