Keynote at AISM @ ASE 2025
Benchmarking GenAI for Software Engineering: Challenges and Insights
GenAI is rapidly reshaping software engineering, advancing capabilities in code generation, translation, testing, and issue analysis. However, current evaluation practices remain fragmented, inconsistent, and often irreproducible, making it difficult to assess genuine progress.
In this talk, we explore the challenges of systematically and transparently benchmarking GenAI for software engineering. We present a unified framework that integrates key components (metrics, workloads, prompting strategies, and experimental procedures) to enable rigorous and comparable assessments across diverse tasks. Through practical examples, we demonstrate how to achieve trustworthy, evidence-based, and reproducible evaluations of Large Language Models (LLMs) for software development.