A Study on Identifying Code Author from Real Development
Identifying code authors is important in many research topics, and various approaches have been proposed. Although these approaches achieve promising results on their datasets, their true effectiveness is still in question. To the best of our knowledge, only one large-scale study was conducted to explore the impacts of related factors (\emph{e.g.}, the temporal effect and the distribution of files per author). This study selected Google Code Jam programs as their subjects, but such programs are quite different from the source files that programmers write in daily development. To understand their effectiveness and challenges, we replicate their study and use their approach to analyze source files that are retrieved from real projects. The prior study claims that the temporal effect and the distribution of files per author have only minor impacts on their trained models. In the contrast, we find that in 85.48% pairs of training and testing sets, the accuracy of a trained model is less effective when the temporal effect is considered, and in total, the average accuracy decreases by 0.4298. In addition, when we use the real distribution of files as inputs, their approach can accurately identify only one or two core code authors, although a project can have more than ten authors. By revealing the limitations of the prior approach, our study sheds lights on where to make future improvements.