Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems (ESEC/FSE 2022 - Research Papers)

Mon 14 - Fri 18 November 2022 Singapore

Who

Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqiang Duan, Dan Pei

Track

ESEC/FSE 2022 Research Papers

Abstract

Fault localization is challenging in an online service system due to its monitoring data's large volume and variety and complex dependencies across/within its components (e.g., services or databases). Furthermore, engineers require fault localization solutions to be actionable and interpretable, which existing research approaches cannot satisfy. Therefore, the common industry practice is that, for a specific online service system, its experienced engineers focus on localization for recurring failures based on the knowledge accumulated about the system and historical failures. More specifically, 1) they can identify the underlying root causes and take mitigation actions when pinpointing a group of indicative metrics on the faulty component; 2) their diagnosis knowledge is roughly based on how one failure might affect the components in the whole system.

Although the above common practice is actionable and interpretable, it is largely manual, thus slow and sometimes inaccurate. In this paper, we aim to automate this practice through machine learning. That is, we propose an actionable and interpretable fault localization approach, DejaVu, for recurring failures in online service systems. For a specific online service system, DejaVu takes historical failures and dependencies in the system as input and trains a localization model offline; for an incoming failure, the trained model online recommends where the failure occurs (i.e., the faulty components) and which kind of failure occurs (i.e., the indicative group of metrics) (thus actionable), which are further interpreted both globally and locally (thus interpretable). Based on the evaluation on 601 failures from three production systems and one open-source benchmark, in less than one second, DejaVu can rank the ground truths at 1.66∼5.03-th among a long candidate list on average, outperforming baselines by 54.52%.

DOI

https://doi.org/10.1145/3540250.3549092

Zeyan Li

Tsinghua University

China

Nengwen Zhao

Tsinghua University

China

Mingjie Li

Tsinghua University

China

Xianglin Lu

Tsinghua University

China

Lixin Wang

China Construction Bank

China

Dongdong Chang

China Construction Bank

China

Xiaohui Nie

BizSeer

China

Li Cao

BizSeer

China

Wenchi Zhang

BizSeer

China

Kaixin Sui

BizSeer

China

Yanhua Wang

China Construction Bank

China

Xu Du

China Construction Bank

China

Guoqiang Duan

China Construction Bank

China

Dan Pei

Tsinghua University

China