以下我為這篇《Rapid Deployment of Anomaly Detection Models for Large Number of Emerging KPI Streams》做的閱讀筆記 - Jeanva
Abstract
Rapid deployment of anomaly detection models for large number of emerging KPI streams, without manual algorithm selection, parameter tuning, or new anomaly labeling for any newly emerging KPI streams.
We propose the first framework ADS that tackles the above problem, via clustering and semi-supervised learning.
with the labels of only the 5 cluster centroids of 70 historical KPI streams, ADS achieves an averaged best F-score of 0.92 on 81 new KPI streams
KPI流太多,沒有辦法去一個個手工異常標記,也沒有辦法去為每一個KPI流手動去做模型調優。這篇文章提出的一個方法是可以自動地為每一個新來的KPI流生成一個異常檢測模型,用到了聚類和半監督學習。
KEYWORDS
ADS (Anomaly Detection through Self-training)
INTRODUCTION
Specifically, when large number of KPI streams emerge continuously and frequently, operators need to deploy accurate anomaly detection models for these new KPI streams as quickly as possible (e.g., within 3 weeks at most)
只需要3個星期就可以為一個KPI stream訓練好檢測模型
For example, in a top gaming company G studied in this paper, on average over ten new games are launched per quarter, which results in more than 6000 new KPI streams per 10 days on average
每10天就有6000個新的KPI流需要分析,數量是非常巨大的
For traditional statistical algorithms, to achieve the best accuracy, operators have to manually select an anomaly detection algorithm and tune its parameters for each KPI stream, which is infeasible for the large number of emerging KPI streams
傳統的方法,需要對每個KPI流挑選模型,調優參數
Supervised learning based methods [2], [10] require manually labeling anomalies for each new KPI stream, which is not feasible for the large number of emerging KPI streams either.
有監督的算法,又需要對KPI流進行手工標記
Unsupervised learning based methods [11], [12] do not require algorithm selection, parameter tuning, or manual labels, but they either suffer from low accuracy [14] or require large amounts of training data for each new KPI stream (e.g., six months worth of data) [12], which do not satisfy the requirement of rapid deployment (e.g., within 3 weeks) of accurate anomaly detection.
無監督的方法,准確度低,需要的數據量大,一般需要半年的數據。
In this paper, we propose ADS, the first framework that enables the rapid deployment of anomaly detection models (say at most 3 weeks) for large number of emerging KPI streams, without manual algorithm selection, parameter tuning, or new anomaly labeling for any newly emerging KPI streams
我們只需要3個星期的數據就可以搞定,還不需要挑選算法,不需要手工標記
Our idea of ADS is based on the following two observations.
(1) In practice, many KPI streams (e.g., the number of queries per server in a well load balanced server cluster) are similar due to their implicit associations and similarities, thus potentially we can use the similar anomaly detection algorithms and parameters for these similar KPI streams.
(2) Clustering methods such as ROCKA [15] can be used to cluster many KPI streams into clusters according to their similarities. The number of clusters are largely determined by the nature of the service (e.g., shopping, gaming, social network, search) and the type of KPIs (e.g., number of queries, CPU usage, memory usage), but not by the scale of the entire system.
這個想法是基於KPI流的相似性比較高,聚成幾個類就大概抓住了所有KPI的特點
Thus for a given service, the number of clusters can be orders of magnitude smaller than the number of KPI streams, and there is a good chance that a newly emerging KPI stream falls into one of the existing clusters resulted from historical KPI streams.
Utilizing the above two observations, ADS proposes to (1) cluster all existing/historical KPI streams into clusters, (2) manually label the anomalies of all cluster centroids, (3) assign each newly emerging KPI stream into one of the existing clusters, and (4) combine the data of the new KPI stream (unlabeled) and it’s cluster centroid (labeled) and use semisupervised learning [16] to train a new model for each new KPI stream.
大致過程:1)先聚類 2)類中心的KPI流需要label 3)對新的KPI流進行分類 4)使用類中心和新的KPI流進行半監督學習
結果:新的KPI流的檢測模型就出來了
Semi-supervised learning can train a new model for a new KPI stream using an existing labeled KPI stream as long as these two KPI streams have similar data distribution, which is the case since they are in the same cluster.
半監督算法就是訓練數據中有label的和沒有label的,類中心的KPI流就是需要label的
BACKGROUND
Anomaly Detection for KPI Streams
Most anomaly detection algorithms, including traditional statistical algorithms, supervised learning based methods and unsupervised learning based methods, compute an anomaly score for a data point to denote how likely this data point is anomalous. Operators then set a threshold to determine whether each data point is anomalous or not. That is, only if the anomaly score at time t exceeds this threshold, \(x_{t}\) will be regarded as an anomalous data point.
時間序列的異常檢測是逐點的檢測
From the above definition, we can see that anomaly detection for a KPI stream is essentially a two-class classification problem – classifying a data point into an anomalous data point or a normal one. Consequently, we can use the intuitive classification metrics of two-class classification methods, including precision, recall, and F-score, to evaluate the performance of anomaly detection algorithms.
異常檢測是一個二分類問題
Anomaly Detection Methods for KPI Streams
Over the years, diverse traditional statistical algorithms have been applied for KPI anomaly detection, including SVD [6], Wavelet [7], ARIMA [8], Time Series Decomposition [1], Holt-Winters [9], etc
傳統的一些異常檢測的算法
To achieve the best accuracy, operators have to manually select an anomaly detection algorithm and tune its parameters for each KPI stream.
問題在於需要人去做算法的選擇和參數的調優
To address the problem posed by algorithm selection and parameter tuning, several supervised learning based methods such as EGADS [10] and Opprentice [2] have been proposed. For each KPI stream, these methods learn (traditional statistical) algorithm selection and parameter tuning from operators’ manual labels of KPI anomalies.
有監督學習的問題雖然可以挑選算法和參數,但是他們需要手動地去標記異常
Unsupervised learning has emerged as a promising field in KPI anomaly detection. For example, isolation based methods [11] and variational autoencoders (VAE) [12] are applied in detecting anomalies in (KPI) streams. These methods are trained without manual labels, and thus they can be applied for large volume of KPI streams.
However, isolation based methods suffer from low accuracy [14] (see Section IV-C for more details). In addition, Donut [12], which is based on VAE, requires a long period (say six months) of training data for newly emerging KPI streams.
無監督算法的問題是准確率低,需要的時間長,長達半年的數據
Semi-supervised learning [16] is halfway between supervised and unsupervised learning. It uses unlabeled data to modify either parameters or models obtained from labeled data alone to maximize the learning performance. [20]–[22] use semi-supervised learning for anomaly detection in other domains, but are not designed for KPI streams (time series).
半監督的方法則是中庸的,使用沒有標記的數據去調整模型的參數,最大化模型的效果。使用CPLE半監督學習方法是這篇文章的主要亮點,具體學習方法可以參照CPLE。
Clustering of KPI Streams
If we can identify similar KPI streams, and group the huge volume of KPI streams into a few clusters, we can reduce the overhead in anomaly detection.
In this work, we adopt ROCKA [15], a rapid clustering algorithm for KPI streams based on their shapes.
It applies moving average to extract baselines which successfully reduce the biases of noises and anomalies.
In addition, it uses shapebased distance as the distance measure, and reduces its algorithm complexity to O(mlog(m)) using Fast Fourier Transform.
Finally, it uses DBSCAN to cluster KPI streams and chooses the centroid for every cluster.
KPI聚類,這個可以參照ROCKA算法。
FRAMEWORK OF ADS
For historical KPI streams, ADS preprocesses them with filling missing points and standardization, clusters the preprocessed KPI streams using ROCKA (Section III-A), and extracts features (namely, output results of different anomaly detection algorithms and parameters) for the KPI streams on cluster centroids (Section III-B).
只對類中心的KPI流才做這些操作,比如特征提取
Similarly, when a new KPI stream is generated, ADS preprocesses it, and classifies it into one of the above clusters (Section III-A), after which the features of this new KPI stream are extracted (Section III-B).
對新的KPI流做同樣的工作。
Based on the features of the new KPI stream, and the features and labels of the new KPI stream’s cluster centroid, ADS trains a semi-supervised learning model using the CPLE algorithm (Section III-C). Finally, ADS detects anomalies in the new KPI stream based on the above model and the severity threshold (aThld henceforth).
Preprocessing and Clustering
In addition, to make KPI steams of different amplitudes and/or length scales comparable, we also standardize KPI streams. This way, different KPI streams are clustered based on their shapes, rather than the absolute values of amplitudes and/or length scales.
As discussed in section II-C, ADS adopts ROCKA [15] to group KPI streams into a few clusters, and obtains a centroid KPI stream for each cluster.
Feature Extraction
The features of the new KPI stream, the features and the anomaly labels of the new KPI stream’s cluster centroid, together form the training set.
we use the output results of different anomaly detectors (namely, the anomaly severities measured by anomaly detectors) as the features of KPI streams. This way, each detector serves as a feature extractor.
When an anomaly detector receives an incoming data point of a KPI stream, it internally produces a non-negative value, called severity, to measure how anomalous that data point is.
每個基礎算法計算出一個非負值來為每個點計算一個嚴重程度,作為一個feature
For example, historical average [24] applies how many times of standard deviations the point is away from the mean as the severity, based on the assumption that the KPI stream data follows Gaussian distribution;
Holt-Winters [9] uses a residual error (namely, the absolute difference between the actual value and the forecast value of each data point) to measure the severity.
The feature extraction, training, and classification (detection) in ADS are all designed to work with individual data points, not anomaly windows, so that the semi-supervised learning algorithm can have enough data for training.
每個KPI的點會對應到很多feature
Semi-Supervised Learning
we try to learn a model which is based on both labeled and unlabeled data, namely a semi-supervised learning model.
self-training based methods [20] apply an existing model to “label” unlabeled data, and employ the newly labeled data together with the actual labeled data to retrain the model until the prediction result no longer changes or iteration ends.
In this work, we adopt CPLE [28], an extension model of self-training. CPLE is a resilient semi-supervised learning framework for any supervised learning classifier (base-model henceforth) including random forest, SVM, decision tree, etc. It takes the prediction probabilities of base-model as input to fully utilize the unlabeled data.
CPLE有點類似EM算法,先給沒有標記的點隨機標記,然后優化參數
1、先利用有標簽的樣本 X 建模,得到估算的 \(\theta_{sup}\) ;
2、初始化 q ;
3、在這個 q 下計算 得到 CPL ;
4、最大化 CPL 得到估算的 \(\theta\) ;
5、將 \(\theta\) 代入 CL ,並最小化 CL ,得到估算的 q ;
6、重復3-5,直到 CPL 收斂,最終得到估算的 \(\theta_{semi}\)
參考https://zhuanlan.zhihu.com/p/53584825
EVALUATION
we compare the performance of ADS with that of supervised learning based method such as Opprentice, and unsupervised learning based method including iForest and Donut (Section IV-C).
Finally, to highlight the importance of semi-supervised learning, we compare ADS with the combination of ROCKA and Opprentice
Data Set
We randomly pick 70 historical KPI streams for clustering and 81 new ones for anomaly detection from a top global online game service.
| KPI | # KPI streams | Interval (minute) | Length (month) | #/percentage of anomalous data points per KPI stream |
|---|---|---|---|---|
| Success rate | 4 | 5 | 1 | 84/0.97% |
| Latency | 19 | 5 | 1 | 150/1.7% |
| # online players | 58 | 5 | 1 | 72/0.83% |
70個歷史KPI stream聚成了5個類,這5個類的類中心是有label的,這里的label是對每個點進行label
然后我們對這個5個KPI以及81個新的KPI stream進行特征提取
Therefore, we randomly selected 151 KPI streams in our evaluation. We believe that the 151 KPI streams are sufficient to evaluate ADS’s performance.
Evaluation Metrics
we use a simple strategy following [12]: if any point in an anomaly segment in the ground truth can be detected by a chosen threshold, we say this segment is detected correctly, and all points in this segment are treated as if they can be detected by this threshold. Meanwhile, the points outside the anomaly segments are treated as usual. The precision, recall, F-score and best F-score are then computed accordingly
預測的時候用預測一個段來表示是否預測成功,段可以認為是一段連續的異常點,避免了要求對每個點的精確預測
Evaluation of The Overall Performance
For each of the 81 new KPI streams, we train ADS using the features extracted from the front 60% of it, as well as the features and manual labels of its cluster centroid (KPI stream). Then we use this ADS model to detect anomalies on the back 40% of this new KPI stream.
對每個KPI stream訓練一個模型,用前60%來訓練,后40%來detect
Figure 4 shows the cumulative distribution functions (CDFs) of the best F-scores of each KPI stream (of the new 81 KPI streams) using the above four methods. Note that in this CDF figure, being closer to the upper left means worse performance, and being closer to the bottom right means better performance
對每種方法看累計分布函數,低F-score的數目越少越好
We can see that both ADS and Opprentice perform superior in detecting anomalies for KPI streams, with 80% of best Fscores over 0.8.
(1) iForest is sensitive to noises in the training set because it does not utilize any labels [11].
(2) Donut is a deep Bayesian model, which requires a lot of training data to get good results (say six months worth of KPI streams) [12]. However, as Table II shows, we have to train the model using 60% of one month, namely 18 days, worth of newly emerging KPI streams, which is a too small amount of training data for Donut.
ADS and Opprentice perform well across all the five clusters and much better than iForst and Donut, demonstrating that it is important to utilize anomaly labels in anomaly detection for newly emerging KPI streams
Opprentice的結果也還可以,它不需要聚類,直接用隨機森林,使用前60%的label,預測后40%的label
Opprentice, performs similarly to ADS, it needs much more labeling works. For example, in this setting, ADS is trained based on only five labeled KPI streams on cluster centroids, while Opprentice is trained using all the 81 labeled KPI streams.
ADS只需要label5個,Opprentice需要label 81個新的KPI stream,所以Opprentice不太可以用
Evaluation of CPLE
We adopt a robust semi-supervised learning model, CPLE, which is suitable for KPI anomaly detection and only requires similar (not necessarily the same) data distribution between the existing labeled KPI stream and the new KPI stream.
We set up ROCKA + Opprentice as follows. We first apply ROCKA to group the 70 historical KPI streams into five clusters, and classify the 81 new KPI streams into these clusters. For each cluster, we train Opprentice using the features and manual labels of its centroid KPI streams
這里沒有使用81個KPI的label,假設81個KPI沒有label,ROCKA+Opprentice反而顯得比Opprentice更差,當然也比ADS差
Here we explain why ADS performs better than ROCKA + Opprentice. KPI stream clustering methods such as ROCKA usually extract baselines (namely underlying shapes) from KPI streams and ignore fluctuations. However, the fluctuations of KPI streams can impact anomaly detection.
ADS addresses the above problem effectively using semisupervised learning. In other words, it learns not only from the labels of the centroid KPI stream, but also from the fluctuation degree of the new KPI stream. This is consistent with the observation that the model trained based on both labeled and unlabeled data should not be worse than the one trained based only on the labeled data [28]
Reference
CPLE [28]
Detectors [2]
Evaluation [12]
