Benchmarking GenAI for Software Engineering: Challenges and Insights

Keynote at AISM @ ASE 2025 · Seoul, South Korea · Nov 20, 2025

GenAI is rapidly reshaping software engineering, advancing capabilities in code generation, translation, testing, and issue analysis. However, current evaluation practices remain fragmented, inconsistent, and often irreproducible, making it difficult to assess genuine progress.

In this talk, we explore the challenges of systematically and transparently benchmarking GenAI for software engineering. We present a unified framework that integrates key components (metrics, workloads, prompting strategies, and experimental procedures) to enable rigorous and comparable assessments across diverse tasks. Through practical examples, we demonstrate how to achieve trustworthy, evidence-based, and reproducible evaluations of Large Language Models (LLMs) for software development.

Download Presentation Workshop Program