Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study
The availability of large-scale, realistic vulnerability datasets is essential both for benchmarking existing techniques and for developing effective new data-driven approaches for software security. Yet such datasets are critically lacking. A promising solution is to generate such datasets by injecting vulnerabilities into real-world programs, which are richly available. Thus, in this paper, we explore the feasibility of vulnerability injection through neural code editing. With a synthetic dataset and a real-world one, we investigate the potential and gaps of three state-of-the-art neural code editors for vulnerability injection. We find that the studied editors have critical limitations on the real-world dataset, where the best accuracy is only 10.03%, versus 79.40% on the synthetic dataset. While the graph-based editors are more effective (successfully injecting vulnerabilities in up to 34.93% of real-world testing samples) than the sequence-based one (0 success), they still suffer from complex code structures and fall short for long edits due to their insufficient designs of the preprocessing and deep learning (DL) models. We reveal the promise of neural code editing for generating realistic vulnerable samples, as they help boost the effectiveness of DL-based vulnerability detectors by up to 49.51% in terms of F1 score. We also provide insights into the gaps in current editors (e.g., they are good at deleting but not at replacing code) and actionable suggestions for addressing them (e.g., designing effective editing primitives).
Mon 14 NovDisplayed time zone: Beijing, Chongqing, Hong Kong, Urumqi change
14:00 - 15:30 | CommunityResearch Papers / Ideas, Visions and Reflections / Demonstrations / Industry Paper at SRC LT 51 Chair(s): Dirk Riehle University of Bavaria, Erlangen | ||
14:00 15mTalk | In War and Peace: The Impact of World Politics on Software Ecosystems Ideas, Visions and Reflections Raula Gaikovina Kula Nara Institute of Science and Technology, Christoph Treude University of Melbourne DOI | ||
14:15 15mTalk | A Retrospective Study of One Decade of Artifact Evaluations Research Papers Stefan Winter LMU Munich, Christopher Steven Timperley Carnegie Mellon University, Ben Hermann TU Dortmund, Jürgen Cito TU Wien, Jonathan Bell Northeastern University, Michael Hilton Carnegie Mellon University, Dirk Beyer LMU Munich DOI | ||
14:30 15mTalk | Understanding Skills for OSS Communities on GitHub Research Papers Jenny T. Liang University of Washington, Thomas Zimmermann Microsoft Research, Denae Ford Microsoft Research DOI Pre-print Media Attached | ||
14:45 15mTalk | Achievement Unlocked: A Case Study on Gamifying DevOps Practices in Industry Industry Paper Patrick Ayoup Concordia University, Diego Costa Concordia University, Canada, Emad Shihab Concordia University DOI | ||
15:00 7mTalk | iTiger: An Automatic Issue Title Generation Tool Demonstrations Ting Zhang Singapore Management University, Ivana Clairine Irsan Singapore Management University, Ferdian Thung Singapore Management University, DongGyun Han Royal Holloway, University of London, David Lo Singapore Management University, Lingxiao Jiang Singapore Management University | ||
15:08 7mTalk | CodeMatcher: A Tool for Large-Scale Code Search Based on Query Semantics Matching Demonstrations Chao Liu Chongqing University, Xuanlin Bao Chongqing University, Xin Xia Huawei, Meng Yan Chongqing University, David Lo Singapore Management University, Ting Zhang Singapore Management University | ||
15:15 15mTalk | Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study Research Papers Yu Nong Washington State University, Yuzhe Ou University of Texas at Dallas, Michael Pradel University of Stuttgart, Feng Chen University of Texas at Dallas, Haipeng Cai Washington State University DOI Pre-print |