這幾天在做用戶畫像,特征是用戶的消費商品的消費金額,原始數據(部分)是這樣的:
1 id goods_name goods_amount 2 1 男士手袋 1882.0 3 2 淑女裝 2491.0 4 2 女士手袋 345.0 5 4 基礎內衣 328.0 6 5 商務正裝 4985.0 7 5 時尚 969.0 8 5 女飾品 86.0 9 6 專業運動 399.0 10 6 童裝(中大童) 2033.0 11 6 男士配件 38.0
我們看到同一個id下面有不同的消費記錄,這個數據不能直接拿來用,寫了python程序來進行處理:test.py
1 #!/usr/bin/python 2 #coding:utf-8 3 #Author:Charlotte 4 import pandas as pd 5 import numpy as np 6 import time 7 8 #加載數據文件(你可以加載自己的文件,文件格式如上所示) 9 x=pd.read_table('test.txt',sep = " ") 10 11 #去除NULL值 12 x.dropna() 13 14 a1=list(x.iloc[:,0]) 15 a2=list(x.iloc[:,1]) 16 a3=list(x.iloc[:,2]) 17 18 #A是商品類別 19 dicta=dict(zip(a2,zip(a1,a3))) 20 A=list(dicta.keys()) 21 #B是用戶id 22 B=list(set(a1)) 23 24 # data_class = pd.DataFrame(A,lista) 25 26 #創建商品類別字典 27 a = np.arange(len(A)) 28 lista = list(a) 29 dict_class = dict(zip(A,lista)) 30 print dict_class 31 32 f=open('class.txt','w') 33 for k ,v in dict_class.items(): 34 f.write(str(k)+'\t'+str(v)+'\n') 35 f.close() 36 37 #計算運行時間 38 start=time.clock() 39 40 #創建大字典存儲數據 41 dictall = {} 42 for i in xrange(len(a1)): 43 if a1[i] in dictall.keys(): 44 value = dictall[a1[i]] 45 j = dict_class[a2[i]] 46 value[j] = a3[i] 47 dictall[a1[i]]=value 48 else: 49 value = list(np.zeros(len(A))) 50 j = dict_class[a2[i]] 51 value[j] = a3[i] 52 dictall[a1[i]]=value 53 54 #將字典轉化為dataframe 55 dictall1 = pd.DataFrame(dictall) 56 dictall_matrix = dictall1.T 57 print dictall_matrix 58 59 end = time.clock() 60 print "賦值過程運行時間是:%f s"%(end-start)
輸出結果:
{'\xe4\xb8\x93\xe4\xb8\x9a\xe8\xbf\x90\xe5\x8a\xa8': 4, '\xe7\x94\xb7\xe5\xa3\xab\xe6\x89\x8b\xe8\xa2\x8b': 1, '\xe5\xa5\xb3\xe5\xa3\xab\xe6\x89\x8b\xe8\xa2\x8b': 2, '\xe7\xab\xa5\xe8\xa3\x85\xef\xbc\x88\xe4\xb8\xad\xe5\xa4\xa7\xe7\xab\xa5)': 3, '\xe7\x94\xb7\xe5\xa3\xab\xe9\x85\x8d\xe4\xbb\xb6': 9, '\xe5\x9f\xba\xe7\xa1\x80\xe5\x86\x85\xe8\xa1\xa3': 8, '\xe6\x97\xb6\xe5\xb0\x9a': 6, '\xe6\xb7\x91\xe5\xa5\xb3\xe8\xa3\x85': 7, '\xe5\x95\x86\xe5\x8a\xa1\xe6\xad\xa3\xe8\xa3\x85': 5, '\xe5\xa5\xb3\xe9\xa5\xb0\xe5\x93\x81': 0} 0 1 2 3 4 5 6 7 8 9 1 0 1882 0 0 0 0 0 0 0 0 2 0 0 345 0 0 0 0 2491 0 0 4 0 0 0 0 0 0 0 0 328 0 5 86 0 0 0 0 4985 969 0 0 0 6 0 0 0 2033 399 0 0 0 0 38 賦值過程運行時間是:0.004497 s linux環境下字符編碼不同,class.txt: 專業運動 4 男士手袋 1 女士手袋 2 童裝(中大童) 3 男士配件 9 基礎內衣 8 時尚 6 淑女裝 7 商務正裝 5 女飾品 0 得到的dicta_matrix 就是我們拿來跑數據的格式,每一列是商品名稱,每一行是用戶id
現在我們來跑AE模型(Auto-encoder),簡單說說AE模型,主要步驟很簡單,有三層,輸入-隱含-輸出,把數據input進去,encode然后再decode,cost_function就是output與input之間的“差值”(有公式),差值越小,目標函數值越優。簡單地說,就是你輸入n維的數據,輸出的還是n維的數據,有人可能會問,這有什么用呢,其實也沒什么用,主要是能夠把數據縮放,如果你輸入的維數比較大,譬如實際的特征是幾千維的,全部拿到算法里跑,效果不見得好,因為並不是所有特征都是有用的,用AE模型后,你可以壓縮成m維(就是隱含層的節點數),如果輸出的數據和原始數據的大小變換比例差不多,就證明這個隱含層的數據是可用的。這樣看來好像和降維的思想類似,當然AE模型的用法遠不止於此,具體貼一篇梁博的博文
不過梁博的博文是用c++寫的,這里使用python寫的代碼(開源代碼,有少量改動):
1 #/usr/bin/python 2 #coding:utf-8 3 4 import pandas as pd 5 import numpy as np 6 import matplotlib.pyplot as plt 7 from sklearn import preprocessing 8 9 class AutoEncoder(): 10 """ Auto Encoder 11 layer 1 2 ... ... L-1 L 12 W 0 1 ... ... L-2 13 B 0 1 ... ... L-2 14 Z 0 1 ... L-3 L-2 15 A 0 1 ... L-3 L-2 16 """ 17 18 def __init__(self, X, Y, nNodes): 19 # training samples 20 self.X = X 21 self.Y = Y 22 # number of samples 23 self.M = len(self.X) 24 # layers of networks 25 self.nLayers = len(nNodes) 26 # nodes at layers 27 self.nNodes = nNodes 28 # parameters of networks 29 self.W = list() 30 self.B = list() 31 self.dW = list() 32 self.dB = list() 33 self.A = list() 34 self.Z = list() 35 self.delta = list() 36 for iLayer in range(self.nLayers - 1): 37 self.W.append( np.random.rand(nNodes[iLayer]*nNodes[iLayer+1]).reshape(nNodes[iLayer],nNodes[iLayer+1]) ) 38 self.B.append( np.random.rand(nNodes[iLayer+1]) ) 39 self.dW.append( np.zeros([nNodes[iLayer], nNodes[iLayer+1]]) ) 40 self.dB.append( np.zeros(nNodes[iLayer+1]) ) 41 self.A.append( np.zeros(nNodes[iLayer+1]) ) 42 self.Z.append( np.zeros(nNodes[iLayer+1]) ) 43 self.delta.append( np.zeros(nNodes[iLayer+1]) ) 44 45 # value of cost function 46 self.Jw = 0.0 47 # active function (logistic function) 48 self.sigmod = lambda z: 1.0 / (1.0 + np.exp(-z)) 49 # learning rate 1.2 50 self.alpha = 2.5 51 # steps of iteration 30000 52 self.steps = 10000 53 54 def BackPropAlgorithm(self): 55 # clear values 56 self.Jw -= self.Jw 57 for iLayer in range(self.nLayers-1): 58 self.dW[iLayer] -= self.dW[iLayer] 59 self.dB[iLayer] -= self.dB[iLayer] 60 # propagation (iteration over M samples) 61 for i in range(self.M): 62 # Forward propagation 63 for iLayer in range(self.nLayers - 1): 64 if iLayer==0: # first layer 65 self.Z[iLayer] = np.dot(self.X[i], self.W[iLayer]) 66 else: 67 self.Z[iLayer] = np.dot(self.A[iLayer-1], self.W[iLayer]) 68 self.A[iLayer] = self.sigmod(self.Z[iLayer] + self.B[iLayer]) 69 # Back propagation 70 for iLayer in range(self.nLayers - 1)[::-1]: # reserve 71 if iLayer==self.nLayers-2:# last layer 72 self.delta[iLayer] = -(self.X[i] - self.A[iLayer]) * (self.A[iLayer]*(1-self.A[iLayer])) 73 self.Jw += np.dot(self.Y[i] - self.A[iLayer], self.Y[i] - self.A[iLayer])/self.M 74 else: 75 self.delta[iLayer] = np.dot(self.W[iLayer].T, self.delta[iLayer+1]) * (self.A[iLayer]*(1-self.A[iLayer])) 76 # calculate dW and dB 77 if iLayer==0: 78 self.dW[iLayer] += self.X[i][:, np.newaxis] * self.delta[iLayer][:, np.newaxis].T 79 else: 80 self.dW[iLayer] += self.A[iLayer-1][:, np.newaxis] * self.delta[iLayer][:, np.newaxis].T 81 self.dB[iLayer] += self.delta[iLayer] 82 # update 83 for iLayer in range(self.nLayers-1): 84 self.W[iLayer] -= (self.alpha/self.M)*self.dW[iLayer] 85 self.B[iLayer] -= (self.alpha/self.M)*self.dB[iLayer] 86 87 def PlainAutoEncoder(self): 88 for i in range(self.steps): 89 self.BackPropAlgorithm() 90 print "step:%d" % i, "Jw=%f" % self.Jw 91 92 def ValidateAutoEncoder(self): 93 for i in range(self.M): 94 print self.X[i] 95 for iLayer in range(self.nLayers - 1): 96 if iLayer==0: # input layer 97 self.Z[iLayer] = np.dot(self.X[i], self.W[iLayer]) 98 else: 99 self.Z[iLayer] = np.dot(self.A[iLayer-1], self.W[iLayer]) 100 self.A[iLayer] = self.sigmod(self.Z[iLayer] + self.B[iLayer]) 101 print "\t layer=%d" % iLayer, self.A[iLayer] 102 103 data=[] 104 index=[] 105 f=open('./data_matrix.txt','r') 106 for line in f.readlines(): 107 ss=line.replace('\n','').split('\t') 108 index.append(ss[0]) 109 ss1=ss[1].split(' ') 110 tmp=[] 111 for i in xrange(len(ss1)): 112 tmp.append(float(ss1[i])) 113 data.append(tmp) 114 f.close() 115 116 x = np.array(data) 117 #歸一化處理 118 xx = preprocessing.scale(x) 119 nNodes = np.array([ 10, 5, 10]) 120 ae3 = AutoEncoder(xx,xx,nNodes) 121 ae3.PlainAutoEncoder() 122 ae3.ValidateAutoEncoder() 123 124 #這是個例子,輸出的結果也是這個 125 # xx = np.array([[0,0,0,0,0,0,0,1], [0,0,0,0,0,0,1,0], [0,0,0,0,0,1,0,0], [0,0,0,0,1,0,0,0],[0,0,0,1,0,0,0,0], [0,0,1,0,0,0,0,0]]) 126 # nNodes = np.array([ 8, 3, 8 ]) 127 # ae2 = AutoEncoder(xx,xx,nNodes) 128 # ae2.PlainAutoEncoder() 129 # ae2.ValidateAutoEncoder()
這里我拿的例子做的結果,真實數據在服務器上跑,大家看看這道啥意思就行了
[0 0 0 0 0 0 0 1] layer=0 [ 0.76654705 0.04221051 0.01185895] layer=1 [ 4.67403977e-03 5.18624788e-03 2.03185410e-02 1.24383559e-02 1.54423619e-02 1.69197292e-03 2.34471751e-05 9.72956513e-01] [0 0 0 0 0 0 1 0] layer=0 [ 0.08178768 0.96348458 0.98583155] layer=1 [ 8.18926274e-04 7.30041977e-04 1.06452565e-02 9.94423121e-03 3.47329848e-03 1.32582980e-02 9.80648863e-01 8.42319408e-08] [0 0 0 0 0 1 0 0] layer=0 [ 0.04752084 0.01144966 0.67313608] layer=1 [ 4.38577163e-03 4.12704649e-03 1.83408905e-02 1.59209302e-05 2.32400619e-02 9.71429772e-01 1.78538577e-02 2.20897151e-03] [0 0 0 0 1 0 0 0] layer=0 [ 0.00819346 0.37410028 0.0207633 ] layer=1 [ 8.17965283e-03 7.94760145e-03 4.59916741e-05 2.03558668e-02 9.68811657e-01 2.09241369e-02 6.19909778e-03 1.51964053e-02] [0 0 0 1 0 0 0 0] layer=0 [ 0.88632868 0.9892662 0.07575306] layer=1 [ 1.15787916e-03 1.25924912e-03 3.72748604e-03 9.79510789e-01 1.09439392e-02 7.81892291e-08 1.06705286e-02 1.77993321e-02] [0 0 1 0 0 0 0 0] layer=0 [ 0.9862938 0.2677048 0.97331042] layer=1 [ 6.03115828e-04 6.37411444e-04 9.75530999e-01 4.06825647e-04 2.66386294e-07 1.27802666e-02 8.66599313e-03 1.06025228e-02]
可以很明顯看layer1和原始數據是對應的,所以我們可以把layer0作為降維后的新數據。
最后在進行聚類,這個就比較簡單了,用sklearn的包,就幾行代碼:
1 # !/usr/bin/python 2 # coding:utf-8 3 # Author :Charlotte 4 5 from matplotlib import pyplot 6 import scipy as sp 7 import numpy as np 8 import matplotlib.pyplot as plt 9 from sklearn.cluster import KMeans 10 from scipy import sparse 11 import pandas as pd 12 import Pycluster as pc 13 from sklearn import preprocessing 14 from sklearn.preprocessing import StandardScaler 15 from sklearn import metrics 16 import pickle 17 from sklearn.externals import joblib 18 19 20 #加載數據 21 data = pd.read_table('data_new.txt',header = None,sep = " ") 22 x = data.ix[:,1:141] 23 card = data.ix[:,0] 24 x1 = np.array(x) 25 xx = preprocessing.scale(x1) 26 num_clusters = 5 27 28 clf = KMeans(n_clusters=num_clusters, n_init=1, n_jobs = -1,verbose=1) 29 clf.fit(xx) 30 print(clf.labels_) 31 labels = clf.labels_ 32 #score是輪廓系數 33 score = metrics.silhouette_score(xx, labels) 34 # clf.inertia_用來評估簇的個數是否合適,距離越小說明簇分的越好 35 print clf.inertia_ 36 print score
這個數據是拿來做例子的,維度少,效果不明顯,真實環境下的數據是30W*142維的,寫的mapreduce程序進行數據處理,然后通過AE模型降到50維后,兩者的clf.inertia_和silhouette(輪廓系數)有顯著差異:
clf.inertia_ |
silhouette |
|
base版本 | 252666.064229 |
0.676239435 |
AE模型跑后的版本 | 662.704257502 |
0.962147623 |
所以可以看到沒有用AE模型直接聚類的模型跑完后的clf.inertia_比用了AE模型之后跑完的clf.inertia_大了幾個數量級,AE的效果還是很顯著的。
以上是隨手整理的,如有錯誤,歡迎指正:)