Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization (ESEC/FSE 2022 - Research Papers)

Write a Blog >>

Mon 14 - Fri 18 November 2022 Singapore

Who

Lin Shi, Fangwen Mu, Xiao Chen, Song Wang, Junjie Wang, Ye Yang, Ge Li, Xin Xia, Qing Wang

Track

ESEC/FSE 2022 Research Papers

Time Zone

The program is currently displayed in (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi.

Use conference time zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, UrumqiSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 14 Nov 2022 11:30 - 11:45 at SRC LT 51 - Empirical I Chair(s): Lingxiao Jiang

Abstract

Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.

Link to Preprint

https://arxiv.org/abs/2207.05579

DOI

https://doi.org/10.1145/3540250.3549145

Lin Shi

ISCAS

China

Fangwen Mu

Institute of Software Chinese Academy of Sciences

Xiao Chen

Institute of Software at Chinese Academy of Sciences

China

Song Wang

York University

Canada

Junjie Wang

Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences

China

Ye Yang

Stevens Institute of Technology

United States

Ge Li

Peking University

China

Xin Xia

Huawei

China

Qing Wang

Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences

China

Time Zone

The program is currently displayed in (GMT+08:00) Beijing, Chongqing, Hong Kong, Urumqi.

Use conference time zone: (GMT+08:00) Beijing, Chongqing, Hong Kong, UrumqiSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 14 Nov
Displayed time zone: Beijing, Chongqing, Hong Kong, Urumqi change

11:00 - 12:30	Empirical IResearch Papers / Industry Paper at SRC LT 51 Chair(s): Lingxiao Jiang Singapore Management University

11:00 15m Talk		What Improves Developer Productivity at Google? Code Quality Industry Paper Lan Cheng Google, Emerson Murphy-Hill Google, Mark Canning Google, Ciera Jaspan Google, Collin Green Google, Andrea Knight Google, Nan Zhang Google, Liz Kammer Google DOI
11:15 15m Talk		Understanding Why We Cannot Model How Long a Code Review Will Take: An Industrial Case Study Industry Paper Lawrence Chen Meta, Peter Rigby Concordia University; Meta, Nachiappan Nagappan Facebook DOI
11:30 15m Talk		Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization Research Papers Lin Shi ISCAS, Fangwen Mu Institute of Software Chinese Academy of Sciences, Xiao Chen Institute of Software at Chinese Academy of Sciences, Song Wang York University, Junjie Wang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Ye Yang Stevens Institute of Technology, Ge Li Peking University, Xin Xia Huawei, Qing Wang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences DOI Pre-print
11:45 15m Talk		Leveraging Test Plan Quality to Improve Code Review Efficacy Industry Paper Lawrence Chen Meta, Rui Abreu Meta Platforms, Tobi Akomolede Meta Platforms, Peter Rigby Concordia University; Meta, Satish Chandra Meta Platforms, Nachiappan Nagappan Facebook DOI