對如下數據進行異常檢測,顯然紅圈中的兩個點是異常點。
1、 使用指標絕對值進行異常檢測
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import svm # 讀取數據 df = pd.read_csv(r'indicator.csv', sep=',') df = df.fillna(method='ffill') df['time'] = range(120) plt.figure() plt.scatter(df['time'], df['indicator']) plt.show() # 使用絕對值預測 # reshape(-1, 1)矩陣轉化為一列 data = np.array(df['indicator']).reshape(-1, 1) # 使用oneclasssvm algorithm = svm.OneClassSVM(nu=0.5, kernel='rbf', gamma=0.1) model = algorithm.fit(data) pre_y = model.predict(data) # 異常檢測結果畫圖 df1 = df.copy() df1['clazz'] = pre_y df2 = df1[df1['clazz'] == 1] df3 = df1[df1['clazz'] == -1] plt.figure() plt.scatter(df2['time'], df2['indicator']) plt.scatter(df3['time'], df3['indicator']) plt.show()
使用OneClassSVM檢測,結果如下:異常點沒有檢測出來,正常點反而被檢測為異常。
顯然時間序列中我們並沒有考慮時間因素,於是我們可以在檢測中引入時間因素。
2、 使用指標絕對值+時間序列進行異常檢測
# 使用絕對值+時間預測 data = np.array(df).reshape(-1, 2) # 使用oneclasssvm algorithm = svm.OneClassSVM(nu=0.5, kernel='rbf', gamma=0.1) model = algorithm.fit(data) pre_y = model.predict(data) # 異常檢測結果畫圖 df1 = df.copy() df1['clazz'] = pre_y df2 = df1[df1['clazz'] == 1] df3 = df1[df1['clazz'] == -1] plt.figure() plt.scatter(df2['time'], df2['indicator']) plt.scatter(df3['time'], df3['indicator']) plt.show()
使用OneClassSVM檢測,結果如下:異常點檢測出來了,但是部分正常點依然被檢測為異常點。(圖截取有問題,就不重做了)
例如對於手機流量進行檢測,上班時流量使用較少,中午或晚上休息時對手機流量使用較多,我們僅僅使用絕對值進行檢測,顯然可能把中午流量使用較多的時刻或者上班時流量使用較少的時刻檢測為異常點,實際上這些點時正常的。
很多情況下,指標變化是連續的,類似流量速率,網站訪問率,cpu使用率,所以我們可以使用一階差分(指標變化速率)或者二階差分來進行異常檢測。
3、 使用指標一階差分進行異常檢測
# 使用指標一階差分進行異常檢測 data = np.array(df['indicator']).reshape(-1, 1) data1 = data.copy() data2 = data.copy() data1 = np.delete(data1, 0, 0) data2 = np.delete(data2, 119, 0) data = data1 - data2 # 使用oneclasssvm algorithm = svm.OneClassSVM(nu=0.5, kernel='rbf', gamma=0.1) model = algorithm.fit(data) pre_y = model.predict(data) # 異常檢測結果畫圖 df1 = df[1:] df1['clazz'] = pre_y df2 = df1[df1['clazz'] == 1] df3 = df1[df1['clazz'] == -1] plt.figure() plt.scatter(df2['time'], df2['indicator']) plt.scatter(df3['time'], df3['indicator']) plt.show()
可以看到異常檢測結果符合我們的預期。
上面的數據過於對稱,不符合實際情況,我們稍微修改一下數據。如下,紅圈中的兩個點為異常點。
我們繼續使用一階差分進行異常檢測,結果如下
可以看到,異常點是檢測出來了,但是不少正常點也被檢測為異常點了。實際上,很多時間序列數據具有季節性的,同一個周期內不同季節有不同的表現,是正常的。比如植物在春夏生長迅速,秋冬生長緩慢,你不能認為秋冬生長緩慢就是異常的。只有春夏中,生長緩慢才是異常的。
所以,對於時間序列的異常檢測,我們不得不考慮其周期性。一般來說,我們監控的指標具有天的周期性,我們怎么判斷其是否具有周期性呢。我們可以通過計算自相關系數,判斷其周期性強度。自相關系數計算公式如下
其中,k為周期,表示時間序列與自身間隔k個時間點的序列的協方差,特別的,
表示方差。自相關系數取值范圍為[0, 1],數值越大,自相關性越高,周期性越強。
4、 計算自相關系數
# 自相關系數 rk = yk/y0 u = df['indicator'].mean() s1 = df['indicator'][: -24] s2 = df['indicator'][24:] s1 = s1 - u s2 = s2 - u # 索引對應相乘 s0 = s1 * s1 # 矩陣對應位置相乘 sk = np.multiply(s1, s2) y0 = sum(s0) yk = sum(sk) rk = yk / y0 print(rk)
以24為周期,計算得到自相關系數為0.9706564686677905,表明數據具有24的周期性。
5、 使用指標同比進行異常檢測
對於具有季節性的時間序列,我們采用對應季節數據進行比較。上述數據以24為周期,所以我們采用當前時間點數據與24個 時間點之前的數據進行差分計算,再進行異常檢測。
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import svm df = pd.read_csv(r'indicator.csv', sep=',') df = df.fillna(method='ffill') df['time'] = range(120) # 自相關系數 rk = yk/y0 u = df['indicator'].mean() s1 = df['indicator'][: -24] s2 = df['indicator'][24:] s1 = s1 - u s2 = s2 - u # 索引對應相乘 s0 = s1 * s1 # 矩陣對應位置相乘 sk = np.multiply(s1, s2) y0 = sum(s0) yk = sum(sk) rk = yk / y0 print(rk) # reshape(-1, 1)矩陣轉化為一列 data = np.array(df['indicator']).reshape(-1, 1) # 同比 data1 = data.copy() data2 = data.copy() data1 = data1[24:] data2 = data2[:-24] data = data1 - data2 algorithm = svm.OneClassSVM(nu=0.5, kernel='rbf', gamma=0.1) model = algorithm.fit(data) pre_y = model.predict(data) df1 = df.copy() df1 = df1[24:] df1['clazz'] = pre_y df2 = df1[df1['clazz'] == 1] df3 = df1[df1['clazz'] == -1] plt.figure() plt.scatter(df2['time'], df2['indicator']) plt.scatter(df3['time'], df3['indicator']) plt.show()
檢測結果和我們預期相符。
上述的具有周期性的時間序列,也叫平穩時間序列。平穩時間序列的異常檢測與非平穩序列的異常檢測方式通常不一致。
測試數據
time,indicator
2018-11-02-02 00:00:00,1
2018-11-02-02 01:00:00,2
2018-11-02-02 02:00:00,3
2018-11-02-02 03:00:00,4
2018-11-02-02 04:00:00,5
2018-11-02-02 05:00:00,6
2018-11-02-02 06:00:00,7
2018-11-02-02 07:00:00,8
2018-11-02-02 08:00:00,9
2018-11-02-02 09:00:00,10
2018-11-02-02 10:00:00,11
2018-11-02-02 11:00:00,12
2018-11-02-02 12:00:00,13
2018-11-02-02 13:00:00,14
2018-11-02-02 14:00:00,15
2018-11-02-02 15:00:00,16
2018-11-02-02 16:00:00,17
2018-11-02-02 17:00:00,18
2018-11-02-02 18:00:00,19
2018-11-02-02 19:00:00,15
2018-11-02-02 20:00:00,11
2018-11-02-02 21:00:00,7
2018-11-02-02 22:00:00,3
2018-11-02-02 23:00:00,1
2018-11-02-03 00:00:00,1
2018-11-02-03 01:00:00,2
2018-11-02-03 02:00:00,3
2018-11-02-03 03:00:00,4
2018-11-02-03 04:00:00,5
2018-11-02-03 05:00:00,6
2018-11-02-03 06:00:00,7
2018-11-02-03 07:00:00,8
2018-11-02-03 08:00:00,9
2018-11-02-03 09:00:00,10
2018-11-02-03 10:00:00,11
2018-11-02-03 11:00:00,12
2018-11-02-03 12:00:00,13
2018-11-02-03 13:00:00,14
2018-11-02-03 14:00:00,15
2018-11-02-03 15:00:00,16
2018-11-02-03 16:00:00,17
2018-11-02-03 17:00:00,18
2018-11-02-03 18:00:00,19
2018-11-02-03 19:00:00,15
2018-11-02-03 20:00:00,11
2018-11-02-03 21:00:00,7
2018-11-02-03 22:00:00,3
2018-11-02-03 23:00:00,1
2018-11-02-04 00:00:00,1
2018-11-02-04 01:00:00,2
2018-11-02-04 02:00:00,3
2018-11-02-04 03:00:00,4
2018-11-02-04 04:00:00,5
2018-11-02-04 05:00:00,6
2018-11-02-04 06:00:00,7
2018-11-02-04 07:00:00,8
2018-11-02-04 08:00:00,9
2018-11-02-04 09:00:00,10
2018-11-02-04 10:00:00,11
2018-11-02-04 11:00:00,12
2018-11-02-04 12:00:00,4
2018-11-02-04 13:00:00,14
2018-11-02-04 14:00:00,15
2018-11-02-04 15:00:00,16
2018-11-02-04 16:00:00,17
2018-11-02-04 17:00:00,18
2018-11-02-04 18:00:00,19
2018-11-02-04 19:00:00,15
2018-11-02-04 20:00:00,11
2018-11-02-04 21:00:00,7
2018-11-02-04 22:00:00,3
2018-11-02-04 23:00:00,1
2018-11-02-05 00:00:00,1
2018-11-02-05 01:00:00,2
2018-11-02-05 02:00:00,3
2018-11-02-05 03:00:00,4
2018-11-02-05 04:00:00,5
2018-11-02-05 05:00:00,6
2018-11-02-05 06:00:00,7
2018-11-02-05 07:00:00,8
2018-11-02-05 08:00:00,9
2018-11-02-05 09:00:00,10
2018-11-02-05 10:00:00,11
2018-11-02-05 11:00:00,12
2018-11-02-05 12:00:00,13
2018-11-02-05 13:00:00,14
2018-11-02-05 14:00:00,15
2018-11-02-05 15:00:00,16
2018-11-02-05 16:00:00,17
2018-11-02-05 17:00:00,18
2018-11-02-05 18:00:00,19
2018-11-02-05 19:00:00,15
2018-11-02-05 20:00:00,11
2018-11-02-05 21:00:00,7
2018-11-02-05 22:00:00,3
2018-11-02-05 23:00:00,1
2018-11-02-06 00:00:00,1
2018-11-02-06 01:00:00,2
2018-11-02-06 02:00:00,3
2018-11-02-06 03:00:00,4
2018-11-02-06 04:00:00,5
2018-11-02-06 05:00:00,6
2018-11-02-06 06:00:00,7
2018-11-02-06 07:00:00,8
2018-11-02-06 08:00:00,9
2018-11-02-06 09:00:00,10
2018-11-02-06 10:00:00,11
2018-11-02-06 11:00:00,12
2018-11-02-06 12:00:00,13
2018-11-02-06 13:00:00,14
2018-11-02-06 14:00:00,15
2018-11-02-06 15:00:00,16
2018-11-02-06 16:00:00,17
2018-11-02-06 17:00:00,18
2018-11-02-06 18:00:00,19
2018-11-02-06 19:00:00,15
2018-11-02-06 20:00:00,8
2018-11-02-06 21:00:00,7
2018-11-02-06 22:00:00,3
2018-11-02-06 23:00:00,1