【原】KMeans與深度學習自編碼AutoEncoder結合提高聚類效果


這幾天在做用戶畫像,特征是用戶的消費商品的消費金額,原始數據(部分)是這樣的:

 1 id      goods_name goods_amount
 2 1    男士手袋       1882.0
 3 2    淑女裝       2491.0
 4 2    女士手袋       345.0
 5 4    基礎內衣       328.0
 6 5    商務正裝       4985.0
 7 5    時尚               969.0
 8 5    女飾品       86.0
 9 6    專業運動       399.0
10 6    童裝(中大童) 2033.0
11 6    男士配件       38.0

 

我們看到同一個id下面有不同的消費記錄,這個數據不能直接拿來用,寫了python程序來進行處理:test.py

 1 #!/usr/bin/python
 2 #coding:utf-8
 3 #Author:Charlotte
 4 import pandas as pd
 5 import numpy as np
 6 import time
 7 
 8 #加載數據文件(你可以加載自己的文件,文件格式如上所示)
 9 x=pd.read_table('test.txt',sep = " ")
10 
11 #去除NULL值
12 x.dropna()
13 
14 a1=list(x.iloc[:,0])
15 a2=list(x.iloc[:,1])
16 a3=list(x.iloc[:,2])
17 
18 #A是商品類別
19 dicta=dict(zip(a2,zip(a1,a3)))
20 A=list(dicta.keys())
21 #B是用戶id
22 B=list(set(a1))
23 
24 # data_class = pd.DataFrame(A,lista)
25 
26 #創建商品類別字典
27 a = np.arange(len(A))
28 lista = list(a)
29 dict_class = dict(zip(A,lista))
30 print dict_class
31 
32 f=open('class.txt','w')
33 for k ,v in dict_class.items():
34     f.write(str(k)+'\t'+str(v)+'\n')
35 f.close()
36 
37 #計算運行時間
38 start=time.clock()
39 
40 #創建大字典存儲數據
41 dictall = {}
42 for i in xrange(len(a1)):
43     if a1[i] in dictall.keys():
44         value = dictall[a1[i]]
45         j = dict_class[a2[i]]
46         value[j] = a3[i]
47         dictall[a1[i]]=value
48     else:
49         value = list(np.zeros(len(A)))
50         j = dict_class[a2[i]]
51         value[j] = a3[i]
52         dictall[a1[i]]=value
53 
54 #將字典轉化為dataframe
55 dictall1 = pd.DataFrame(dictall)
56 dictall_matrix = dictall1.T
57 print dictall_matrix
58 
59 end = time.clock()
60 print "賦值過程運行時間是:%f s"%(end-start)

輸出結果:

{'\xe4\xb8\x93\xe4\xb8\x9a\xe8\xbf\x90\xe5\x8a\xa8': 4, '\xe7\x94\xb7\xe5\xa3\xab\xe6\x89\x8b\xe8\xa2\x8b': 1, '\xe5\xa5\xb3\xe5\xa3\xab\xe6\x89\x8b\xe8\xa2\x8b': 2, '\xe7\xab\xa5\xe8\xa3\x85\xef\xbc\x88\xe4\xb8\xad\xe5\xa4\xa7\xe7\xab\xa5)': 3, '\xe7\x94\xb7\xe5\xa3\xab\xe9\x85\x8d\xe4\xbb\xb6': 9, '\xe5\x9f\xba\xe7\xa1\x80\xe5\x86\x85\xe8\xa1\xa3': 8, '\xe6\x97\xb6\xe5\xb0\x9a': 6, '\xe6\xb7\x91\xe5\xa5\xb3\xe8\xa3\x85': 7, '\xe5\x95\x86\xe5\x8a\xa1\xe6\xad\xa3\xe8\xa3\x85': 5, '\xe5\xa5\xb3\xe9\xa5\xb0\xe5\x93\x81': 0}

    0     1    2     3    4     5    6     7    8   9
1   0  1882    0     0    0     0    0     0    0   0
2   0     0  345     0    0     0    0  2491    0   0
4   0     0    0     0    0     0    0     0  328   0
5  86     0    0     0    0  4985  969     0    0   0
6   0     0    0  2033  399     0    0     0    0  38
賦值過程運行時間是:0.004497 s
 linux環境下字符編碼不同,class.txt:
專業運動    4
男士手袋    1
女士手袋    2
童裝(中大童)    3
男士配件    9
基礎內衣    8
時尚    6
淑女裝    7
商務正裝    5
女飾品    0

得到的dicta_matrix 就是我們拿來跑數據的格式,每一列是商品名稱,每一行是用戶id

 

   現在我們來跑AE模型(Auto-encoder),簡單說說AE模型,主要步驟很簡單,有三層,輸入-隱含-輸出,把數據input進去,encode然后再decode,cost_function就是output與input之間的“差值”(有公式),差值越小,目標函數值越優。簡單地說,就是你輸入n維的數據,輸出的還是n維的數據,有人可能會問,這有什么用呢,其實也沒什么用,主要是能夠把數據縮放,如果你輸入的維數比較大,譬如實際的特征是幾千維的,全部拿到算法里跑,效果不見得好,因為並不是所有特征都是有用的,用AE模型后,你可以壓縮成m維(就是隱含層的節點數),如果輸出的數據和原始數據的大小變換比例差不多,就證明這個隱含層的數據是可用的。這樣看來好像和降維的思想類似,當然AE模型的用法遠不止於此,具體貼一篇梁博的博文

 

不過梁博的博文是用c++寫的,這里使用python寫的代碼(開源代碼,有少量改動):

  1 #/usr/bin/python
  2 #coding:utf-8
  3 
  4 import pandas as pd
  5 import numpy as np
  6 import matplotlib.pyplot as plt
  7 from sklearn import preprocessing
  8 
  9 class AutoEncoder():
 10     """ Auto Encoder  
 11     layer      1     2    ...    ...    L-1    L
 12       W        0     1    ...    ...    L-2
 13       B        0     1    ...    ...    L-2
 14       Z              0     1     ...    L-3    L-2
 15       A              0     1     ...    L-3    L-2
 16     """
 17     
 18     def __init__(self, X, Y, nNodes):
 19         # training samples
 20         self.X = X
 21         self.Y = Y
 22         # number of samples
 23         self.M = len(self.X)
 24         # layers of networks
 25         self.nLayers = len(nNodes)
 26         # nodes at layers
 27         self.nNodes = nNodes
 28         # parameters of networks
 29         self.W = list()
 30         self.B = list()
 31         self.dW = list()
 32         self.dB = list()
 33         self.A = list()
 34         self.Z = list()
 35         self.delta = list()
 36         for iLayer in range(self.nLayers - 1):
 37             self.W.append( np.random.rand(nNodes[iLayer]*nNodes[iLayer+1]).reshape(nNodes[iLayer],nNodes[iLayer+1]) ) 
 38             self.B.append( np.random.rand(nNodes[iLayer+1]) )
 39             self.dW.append( np.zeros([nNodes[iLayer], nNodes[iLayer+1]]) )
 40             self.dB.append( np.zeros(nNodes[iLayer+1]) )
 41             self.A.append( np.zeros(nNodes[iLayer+1]) )
 42             self.Z.append( np.zeros(nNodes[iLayer+1]) )
 43             self.delta.append( np.zeros(nNodes[iLayer+1]) )
 44             
 45         # value of cost function
 46         self.Jw = 0.0
 47         # active function (logistic function)
 48         self.sigmod = lambda z: 1.0 / (1.0 + np.exp(-z))
 49         # learning rate 1.2
 50         self.alpha = 2.5
 51         # steps of iteration 30000
 52         self.steps = 10000
 53         
 54     def BackPropAlgorithm(self):
 55         # clear values
 56         self.Jw -= self.Jw
 57         for iLayer in range(self.nLayers-1):
 58             self.dW[iLayer] -= self.dW[iLayer]
 59             self.dB[iLayer] -= self.dB[iLayer]
 60         # propagation (iteration over M samples)    
 61         for i in range(self.M):
 62             # Forward propagation
 63             for iLayer in range(self.nLayers - 1):
 64                 if iLayer==0: # first layer
 65                     self.Z[iLayer] = np.dot(self.X[i], self.W[iLayer])
 66                 else:
 67                     self.Z[iLayer] = np.dot(self.A[iLayer-1], self.W[iLayer])
 68                 self.A[iLayer] = self.sigmod(self.Z[iLayer] + self.B[iLayer])            
 69             # Back propagation
 70             for iLayer in range(self.nLayers - 1)[::-1]: # reserve
 71                 if iLayer==self.nLayers-2:# last layer
 72                     self.delta[iLayer] = -(self.X[i] - self.A[iLayer]) * (self.A[iLayer]*(1-self.A[iLayer]))
 73                     self.Jw += np.dot(self.Y[i] - self.A[iLayer], self.Y[i] - self.A[iLayer])/self.M
 74                 else:
 75                     self.delta[iLayer] = np.dot(self.W[iLayer].T, self.delta[iLayer+1]) * (self.A[iLayer]*(1-self.A[iLayer]))
 76                 # calculate dW and dB 
 77                 if iLayer==0:
 78                     self.dW[iLayer] += self.X[i][:, np.newaxis] * self.delta[iLayer][:, np.newaxis].T
 79                 else:
 80                     self.dW[iLayer] += self.A[iLayer-1][:, np.newaxis] * self.delta[iLayer][:, np.newaxis].T
 81                 self.dB[iLayer] += self.delta[iLayer] 
 82         # update
 83         for iLayer in range(self.nLayers-1):
 84             self.W[iLayer] -= (self.alpha/self.M)*self.dW[iLayer]
 85             self.B[iLayer] -= (self.alpha/self.M)*self.dB[iLayer]
 86         
 87     def PlainAutoEncoder(self):
 88         for i in range(self.steps):
 89             self.BackPropAlgorithm()
 90             print "step:%d" % i, "Jw=%f" % self.Jw
 91 
 92     def ValidateAutoEncoder(self):
 93         for i in range(self.M):
 94             print self.X[i]
 95             for iLayer in range(self.nLayers - 1):
 96                 if iLayer==0: # input layer
 97                     self.Z[iLayer] = np.dot(self.X[i], self.W[iLayer])
 98                 else:
 99                     self.Z[iLayer] = np.dot(self.A[iLayer-1], self.W[iLayer])
100                 self.A[iLayer] = self.sigmod(self.Z[iLayer] + self.B[iLayer])
101                 print "\t layer=%d" % iLayer, self.A[iLayer]        
102 
103 data=[]
104 index=[]
105 f=open('./data_matrix.txt','r')
106 for line in f.readlines():
107     ss=line.replace('\n','').split('\t')
108     index.append(ss[0])
109     ss1=ss[1].split(' ')
110     tmp=[]
111     for i in xrange(len(ss1)):
112         tmp.append(float(ss1[i]))
113     data.append(tmp)
114 f.close()
115 
116 x = np.array(data)
117 #歸一化處理
118 xx = preprocessing.scale(x)
119 nNodes = np.array([ 10, 5, 10])
120 ae3 = AutoEncoder(xx,xx,nNodes)
121 ae3.PlainAutoEncoder()
122 ae3.ValidateAutoEncoder()
123 
124 #這是個例子,輸出的結果也是這個
125 # xx = np.array([[0,0,0,0,0,0,0,1], [0,0,0,0,0,0,1,0], [0,0,0,0,0,1,0,0], [0,0,0,0,1,0,0,0],[0,0,0,1,0,0,0,0], [0,0,1,0,0,0,0,0]])
126 # nNodes = np.array([ 8, 3, 8 ])
127 # ae2 = AutoEncoder(xx,xx,nNodes)
128 # ae2.PlainAutoEncoder()
129 # ae2.ValidateAutoEncoder()

 

這里我拿的例子做的結果,真實數據在服務器上跑,大家看看這道啥意思就行了

 

[0 0 0 0 0 0 0 1]
     layer=0 [ 0.76654705  0.04221051  0.01185895]
     layer=1 [  4.67403977e-03   5.18624788e-03   2.03185410e-02   1.24383559e-02
   1.54423619e-02   1.69197292e-03   2.34471751e-05   9.72956513e-01]
[0 0 0 0 0 0 1 0]
     layer=0 [ 0.08178768  0.96348458  0.98583155]
     layer=1 [  8.18926274e-04   7.30041977e-04   1.06452565e-02   9.94423121e-03
   3.47329848e-03   1.32582980e-02   9.80648863e-01   8.42319408e-08]
[0 0 0 0 0 1 0 0]
     layer=0 [ 0.04752084  0.01144966  0.67313608]
     layer=1 [  4.38577163e-03   4.12704649e-03   1.83408905e-02   1.59209302e-05
   2.32400619e-02   9.71429772e-01   1.78538577e-02   2.20897151e-03]
[0 0 0 0 1 0 0 0]
     layer=0 [ 0.00819346  0.37410028  0.0207633 ]
     layer=1 [  8.17965283e-03   7.94760145e-03   4.59916741e-05   2.03558668e-02
   9.68811657e-01   2.09241369e-02   6.19909778e-03   1.51964053e-02]
[0 0 0 1 0 0 0 0]
     layer=0 [ 0.88632868  0.9892662   0.07575306]
     layer=1 [  1.15787916e-03   1.25924912e-03   3.72748604e-03   9.79510789e-01
   1.09439392e-02   7.81892291e-08   1.06705286e-02   1.77993321e-02]
[0 0 1 0 0 0 0 0]
     layer=0 [ 0.9862938   0.2677048   0.97331042]
     layer=1 [  6.03115828e-04   6.37411444e-04   9.75530999e-01   4.06825647e-04
   2.66386294e-07   1.27802666e-02   8.66599313e-03   1.06025228e-02]

 

可以很明顯看layer1和原始數據是對應的,所以我們可以把layer0作為降維后的新數據。

 

最后在進行聚類,這個就比較簡單了,用sklearn的包,就幾行代碼:

 

 1 # !/usr/bin/python
 2 # coding:utf-8
 3 # Author :Charlotte
 4 
 5 from matplotlib import pyplot
 6 import scipy as sp
 7 import numpy as np
 8 import matplotlib.pyplot as plt
 9 from sklearn.cluster   import KMeans
10 from scipy import sparse
11 import pandas as pd 
12 import Pycluster as pc
13 from sklearn import preprocessing
14 from sklearn.preprocessing import StandardScaler
15 from sklearn import metrics
16 import pickle
17 from sklearn.externals import joblib
18 
19 
20 #加載數據
21 data = pd.read_table('data_new.txt',header = None,sep = " ")
22 x = data.ix[:,1:141]
23 card = data.ix[:,0]
24 x1 = np.array(x)
25 xx = preprocessing.scale(x1)
26 num_clusters = 5
27 
28 clf = KMeans(n_clusters=num_clusters,  n_init=1, n_jobs = -1,verbose=1)
29 clf.fit(xx)
30 print(clf.labels_)
31 labels = clf.labels_
32 #score是輪廓系數
33 score = metrics.silhouette_score(xx, labels)
34 # clf.inertia_用來評估簇的個數是否合適,距離越小說明簇分的越好
35 print clf.inertia_
36 print score

 

  這個數據是拿來做例子的,維度少,效果不明顯,真實環境下的數據是30W*142維的,寫的mapreduce程序進行數據處理,然后通過AE模型降到50維后,兩者的clf.inertia_和silhouette(輪廓系數)有顯著差異:

 

 

clf.inertia_

silhouette

base版本

252666.064229

0.676239435

AE模型跑后的版本

662.704257502

 0.962147623

 

 

所以可以看到沒有用AE模型直接聚類的模型跑完后的clf.inertia_比用了AE模型之后跑完的clf.inertia_大了幾個數量級,AE的效果還是很顯著的。

 

以上是隨手整理的,如有錯誤,歡迎指正:)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM