ArXiv Preprint
Prompt and accurate detection of system anomalies is essential to ensure the
reliability of software systems. Unlike manual efforts that exploit all
available run-time information, existing approaches usually leverage only a
single type of monitoring data (often logs or metrics) or fail to make
effective use of the joint information among different types of data.
Consequently, many false predictions occur. To better understand the
manifestations of system anomalies, we conduct a systematical study on a large
amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates
that logs and metrics can manifest system anomalies collaboratively and
complementarily, and neither of them only is sufficient. Thus, integrating
heterogeneous data can help recover the complete picture of a system's health
status. In this context, we propose Hades, the first end-to-end semi-supervised
approach to effectively identify system anomalies based on heterogeneous data.
Our approach employs a hierarchical architecture to learn a global
representation of the system status by fusing log semantics and metric
patterns. It captures discriminative features and meaningful interactions from
heterogeneous data via a cross-modal attention module, trained in a
semi-supervised manner. We evaluate Hades extensively on large-scale simulated
data and datasets from Huawei Cloud. The experimental results present the
effectiveness of our model in detecting system anomalies. We also release the
code and the annotated dataset for replication and future research.
Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, Yongqiang Yang, Michael R. Lyu
2023-02-14