Semi-supervised Pre-processing for Learning-Based Traceability Framework on Real-World Software Projects
The traceability of software artifacts has been recognized as an important factor to support various activities in software development processes.
However, traceability can be difficult and time-consuming to create and maintain manually, thereby automated approaches have gained much attention.
Unfortunately, existing automated approaches for traceability suffer from practical issues.
This paper aims to gain an understanding of the potential challenges for the underperforming of the state-of-the-art, ML-based trace link classifiers applied in real-world projects.
By investigating different industrial datasets, we found that two critical (and classic) challenges, i.e. data imbalance and sparse problems, lie in real-world projects' traceability automation.
To overcome these challenges, we developed a framework called SPLINT to incorporate hybrid textual similarity measures and semi-supervised learning strategies as enhancements to the learning-based traceability approaches.
We carried out experiments with six open-source platforms and ten industry datasets.
The results confirm that SPLINT is able to operate at higher performance on two communities' datasets. Specifically, the industrial datasets, which significantly suffer from data imbalance and sparsity problems, show an increase in F2-score over 14% and AUC over 8% on average.
The adjusted class-balancing and self-training policies used in SPLINT (CBST-Adjust) also work effectively for the selection of pseudo-labels on minor classes from unlabeled trace sets, demonstrating SPLINT's practicability.