Generating Realistic Vulnerabilities via Neural Code Editing: An Empirical Study
The availability of large-scale, realistic vulnerability datasets is essential both for benchmarking existing techniques and for developing effective new data-driven approaches for software security. Yet such datasets are critically lacking. A promising solution is to generate such datasets by injecting vulnerabilities into real-world programs, which are richly available. Thus, in this paper, we explore the feasibility of vulnerability injection through neural code editing. With a synthetic dataset and a real-world one, we investigate the potential and gaps of three state-of-the-art neural code editors for vulnerability injection. We find that the studied editors have critical limitations on the real-world dataset, where the best accuracy is only 10.03%, versus 79.40% on the synthetic dataset. While the graph-based editors are more effective (successfully injecting vulnerabilities in up to 34.93% of real-world testing samples) than the sequence-based one (0 success), they still suffer from complex code structures and fall short for long edits due to their insufficient designs of the preprocessing and deep learning (DL) models. We reveal the promise of neural code editing for generating realistic vulnerable samples, as they help boost the effectiveness of DL-based vulnerability detectors by up to 49.51% in terms of F1 score. We also provide insights into the gaps in current editors (e.g., they are good at deleting but not at replacing code) and actionable suggestions for addressing them (e.g., designing effective editing primitives).