Pair Programming Conversations with Agents vs. Developers: Challenges and Opportunities for SE Community
Recent research has shown feasibility of an interactive pair-programming conversational agent, but implementing such an agent poses three challenges: a lack of benchmark datasets, absence of software engineering specific labels, and the need to understand developer conversations. To address these challenges, we conducted a Wizard of Oz study with 14 participants pair programming with a simulated agent and collected 4,443 developer-agent utterances. Based on this dataset, we created 26 software engineering labels using an open coding process to develop a hierarchical classification scheme. To understand labeled developer-agent conversations, we compared the accuracy of three state-of-the-art transformer-based language models, BERT, GPT-2, and XLNet, which performed interchangeably. In order to begin creating a developer-agent dataset, researchers and practitioners need to conduct resource intensive Wizard of Oz studies. Presently, there exists vast amounts of developer-developer conversations on video hosting websites. To investigate the feasibility of using developer-developer conversations, we labeled a publicly available developer-developer dataset (3,436 utterances) with our hierarchical classification scheme and found that a BERT model trained on developer-developer data performed \textasciitilde10% worse than the BERT trained on developer-agent data, but when using transfer-learning, accuracy improved. Finally, our qualitative analysis revealed that developer-developer conversations are more implicit, neutral, and opinionated than developer-agent conversations. Our results have implications for software engineering researchers and practitioners developing conversational agents.