簡單理解SimRank

圖1.二部圖
所謂二部圖(bipartite graphs),是指圖中的節點可以分這兩個子集,任意一條邊關聯的兩個節點分別來自於這兩個子集。用I(v)和O(v)分別表示節點v的in-neighbors和out-neighbors。看上面的二部圖,我們把A、B當成兩個人,把a、b、c當成三件商品,有向邊代表人購買的商品。simrank的基本思想是:如果兩個實體相似,那么跟它們相關的實體應該也相似。比如在上圖中如果a和c相似,那么A和B應該也相似,因為A和a相關,而B和c相關。
SimRank的基本公式:
\begin{equation}s(a,b)=\frac{C}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s(I_i(a),I_j(b))\label{basic}\end{equation}
s(a,b)是節點a和b的相似度,當a=b時,s(a,b)=1。$I_i(a)$表示a的第i個in-neighbor。當$I(a)=\emptyset$或$I(b)=\emptyset$時式\eqref{basic}為0。\eqref{basic}式用一句話描述就是:a和b的相似度等於a的in-neighbors和b的in-neighbors相似度的平均值。參數C是個阻尼系數,它的含義可以這么理解:假如I(a)=I(b)={A},按照\eqref{basic}式計算出sim(a,b)=C*sim(A,A)=C,所以$C\in(0,1)$。
把式\eqref{basic}應用於圖1所示的二部圖就是:
\begin{equation}s(A,B)=\frac{C_1}{|O(A)||O(B)|}\sum_{i=1}^{|O(A)|}\sum_{j=1}^{|O(B)|}s(O_i(A),O_j(B))\ \ \ \ for\ A\ne{B}\label{out}\end{equation}
\begin{equation}s(a,b)=\frac{C_2}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s(I_i(a),I_j(b))\ \ \ \ for\ a\ne{b}\label{in}\end{equation}
忽略$C_1$和$C_2$,\eqref{out}式是說買家A和B的相似度等於他們購買的物品之間相似度的平均值,\eqref{in}式是說物品a和b的相似度是購買它們的買家之間相似度的平均值。
對於非二部圖的情況,一個節點既可能有in-neighbors也可能有out-neighbors,比如在論文引用的場景下,一篇論文既可能引用其他論文,也可能被其他論文引用。站在引用的角度,兩篇論文的相似度為
\begin{equation}s_1(a,b)=\frac{C_1}{|O(a)||O(b)|}\sum_{i=1}^{|O(a)|}\sum_{j=1}^{|O(b)|}s_2(O_i(a),O_j(b))\label{hout}\end{equation}
站在被引用的角度,兩篇論文的相似度為
\begin{equation}s_2(a,b)=\frac{C_2}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s_1(I_i(a),I_j(b))\label{hin}\end{equation}
根據實際的應用場景,最終的s(a,b)可以采用\eqref{hout}和\eqref{hin}其中的任何一個,或者是兩者的綜合。
Naive SimRank
SimRank迭代算法:
$R_0(a,b)=\left\{\begin{matrix}0\ \ \ \ if\ a\ne{b}\\1\ \ \ \ if\ a=b\end{matrix}\right.$
$R_{k+1}(a,b)=\left\{\begin{matrix}\frac{C}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}R_k(I_i(a),I_j(b))\ \ \ \ if\ a\ne{b}\\1\ \ \ \ if\ a=b\end{matrix}\right.$
$R_k(*,*)$是k的單調不減函數,$lim_{k\to\infty}R_k(a,b)=s(a,b)$,實踐中發現$R_k(*,*)$收斂得很快,k不需要設得太大。
下面給出矩陣的形式,因為直接使用上面的迭代公式很難展開並行計算,數量稍微大一些(比如上十萬)時在單機上跑時間和空間開銷非常大。
\begin{equation}\left\{\begin{matrix}S^{(0)}=(1-c)\cdot{I_n}\\S^{(k+1)}=c\cdot{Q^T}\cdot{S^{(k)}}\cdot{Q}+(1-c)\cdot{I_n}\end{matrix}\right.\label{juzhen}\end{equation}
S是相似度矩陣。Q是轉移概率矩陣,它的每一列和為1,如果從節點i可以轉移到節點j,並且這樣的節點i一共有n個,則$Q_{i,j}=\frac{1}{n}$
迭代誤差的上界:
\begin{equation}\|S^{(k)}-S\|_{max}\le{c}^{k+1}\ \ \ \ (\vee{k}=0,1,2\ldots)\label{err_ceil}\end{equation}
矩陣的max范數定義為:
$\|X_{p\times{q}}\|_{max}\stackrel{def}{=}max_{i=1}^{p}max_{j=1}^{q}\mid{x}_{i,j}\mid$
好了,來做個小練習。圖2是一張網頁鏈接關系圖,表示一所大學的主頁上放了A、B兩個教授的個人主頁的鏈接,教授B和學生B的個人主頁互相鏈接了對方,等等。下面我們要通過這種鏈接關系求這5個節點的相似度。

圖2. 網頁鏈接關系圖
首先用一個文本文件來存儲上面的有向圖。
linkgraph
univ profA profB profA studentA studentA univ profB studentB studentB profB
文件中每一行的首列是一個節點,后面的列是首列的out-neighbors,即在圖上游走時只能順着箭頭的方向。對於二部圖情況就不一樣了,\eqref{out}式是順着二部圖箭頭的方向,\eqref{in}式是逆着二部圖箭頭的方向,即每一條邊都是允許雙向游走的。於是圖1所示的二部圖可以表示為:
linkgraph_bipartite
A a b B b c a A b A B c B
naive_simrank.py
#!/usr/bin/env python
# coding=utf-8
import numpy as np
import scipy as sp
nodes = [] # 所有的節點存入數組
nodesnum = 0 # 所有節點的數目
nodes_index = {} # <節點名,節點在nodes數組中的編號>
damp = 0.8 # 阻尼系數
trans_matrix = np.matrix(0) # 轉移概率矩陣
sim_matrix = np.matrix(0) # 節點相似度矩陣
def initParam(graphFile):
'''
構建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
輸入文件行格式要求:node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
'''
global nodes
global nodes_index
global trans_matrix
global sim_matrix
global damp
global nodesnum
link_in = {}
for line in open(graphFile, "r", 1024):
arr = line.strip("\n").split()
node = arr[0]
nodeid = -1
if node in nodes_index:
nodeid = nodes_index[node]
else:
nodeid = len(nodes)
nodes_index[node] = nodeid
nodes.append(node)
for ele in arr[1:]:
outneighbor = ele
outneighborid = -1
if outneighbor in nodes_index:
outneighborid = nodes_index[outneighbor]
else:
outneighborid = len(nodes)
nodes_index[outneighbor] = outneighborid
nodes.append(outneighbor)
inneighbors = []
if outneighborid in link_in:
inneighbors = link_in[outneighborid]
inneighbors.append(nodeid)
link_in[outneighborid] = inneighbors
nodesnum = len(nodes)
trans_matrix = np.zeros((nodesnum, nodesnum))
for node, inneighbors in link_in.items():
num = len(inneighbors)
prob = 1.0 / num
for neighbor in inneighbors:
trans_matrix[neighbor, node] = prob
sim_matrix = np.identity(nodesnum) * (1 - damp)
def iterate():
'''
迭代更新相似度矩陣
'''
global trans_matrix
global sim_matrix
global damp
global nodesnum
sim_matrix = damp * np.dot(np.dot(trans_matrix.transpose(),
sim_matrix), trans_matrix) + (1 - damp) * np.identity(nodesnum)
def printResult(sim_node_file):
'''
打印輸出相似度計算結果
'''
global sim_matrix
global link_out
global link_in
global nodes
global nodesnum
# 打印node之間的相似度
f_out_user = open(sim_node_file, "w")
for i in range(nodesnum):
f_out_user.write(nodes[i] + "\t")
neighbour = []
for j in range(nodesnum):
if i != j:
sim = sim_matrix[i, j]
if sim == None:
sim = 0
if sim > 0:
neighbour.append((j, sim))
# 按相似度由大到小排序
neighbour = sorted(
neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
for (u, sim) in neighbour:
f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
f_out_user.write("\n")
f_out_user.close()
def simrank(graphFile, maxIteration):
global nodes_index
global trans_matrix
global sim_matrix
initParam(graphFile)
print "nodes:"
print nodes_index
print "trans ratio:"
print trans_matrix
for i in range(maxIteration):
print "iteration %d:" % (i + 1)
iterate()
print sim_matrix
if __name__ == '__main__':
graphFile = "../data/linkgraph"
sim_node_file = "../data/nodesim_naive"
maxIteration = 10
simrank(graphFile, maxIteration)
printResult(sim_node_file)
最終得到5個節點兩兩之間的相似度
nodesim_naive
univ profB:0.10803511296 studentB:0.02203058176 profA profB:0.36478881792 studentB:0.08159625216 profB profA:0.36478881792 univ:0.10803511296 studentB:0.0642220032 studentA:0.03022258176 studentA studentB:0.28216737792 profB:0.03022258176 studentB studentA:0.28216737792 profA:0.08159625216 profB:0.0642220032 univ:0.02203058176
平方緩存法
\eqref{err_ceil}已經證明了simrank的收斂速度是非常快的,下面給出一種可以加速收斂的方法--平方緩存法。
\begin{equation}\left\{\begin{matrix}S_{\left \langle 2 \right \rangle }^{(0)}=(1-c)\cdot{I_n}\\S_{\left \langle 2 \right \rangle }^{(k+1)}=S_{\left \langle 2 \right \rangle }^{(k)}+c^{2^k}\cdot{(Q^{2^k})^T}\cdot{S_{\left \langle 2 \right \rangle }^{(k)}}\cdot{Q^{2^k}}\end{matrix}\right.\label{square_cache}\end{equation}
平方緩存法的收斂速度更驚人:
\begin{equation}\|S_{\left \langle 2 \right \rangle }^{(k)}-S\|_{max}\le{c}^{2^k}\ \ \ \ (\vee{k}=0,1,2\ldots)\label{square_err_ceil}\end{equation}
注意$Q^{2^k}=(Q^{2^{(k-1)}})^2$即每次迭代不必從頭計算$Q^{2^k}$,只需要在上一次迭代的基礎上平方一下就可以了,這其實就是\eqref{square_cache}比\eqref{juzhen}快的全部原因。事實上:
$S_{\left \langle 2 \right \rangle }^{(k)}=S^{(2^k-1)}$
我們用平方緩存法重新計算圖2中各節點的相似度。
square_cache_simrank.py
#!/usr/bin/env python
# coding=utf-8
import numpy as np
import scipy as sp
nodes = [] # 所有的節點存入數組
nodesnum = 0 # 所有節點的數目
nodes_index = {} # <節點名,節點在nodes數組中的編號>
damp = 0.8 # 阻尼系數
trans_matrix = np.matrix(0) # 轉移概率矩陣
sim_matrix = np.matrix(0) # 節點相似度矩陣
def initParam(graphFile):
'''
構建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
輸入文件行格式要求:node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
'''
global nodes
global nodes_index
global trans_matrix
global sim_matrix
global damp
global nodesnum
link_in = {}
for line in open(graphFile, "r", 1024):
arr = line.strip("\n").split()
node = arr[0]
nodeid = -1
if node in nodes_index:
nodeid = nodes_index[node]
else:
nodeid = len(nodes)
nodes_index[node] = nodeid
nodes.append(node)
for ele in arr[1:]:
outneighbor = ele
outneighborid = -1
if outneighbor in nodes_index:
outneighborid = nodes_index[outneighbor]
else:
outneighborid = len(nodes)
nodes_index[outneighbor] = outneighborid
nodes.append(outneighbor)
inneighbors = []
if outneighborid in link_in:
inneighbors = link_in[outneighborid]
inneighbors.append(nodeid)
link_in[outneighborid] = inneighbors
nodesnum = len(nodes)
trans_matrix = np.zeros((nodesnum, nodesnum))
for node, inneighbors in link_in.items():
num = len(inneighbors)
prob = 1.0 / num
for neighbor in inneighbors:
trans_matrix[node, neighbor] = prob
sim_matrix = np.identity(nodesnum) * (1 - damp)
def iterate():
'''
迭代更新相似度矩陣
'''
global trans_matrix
global sim_matrix
global damp
global nodesnum
damp=damp**2
trans_matrix=np.dot(trans_matrix,trans_matrix)
sim_matrix = damp * np.dot(np.dot(trans_matrix,
sim_matrix), trans_matrix.transpose()) + sim_matrix
def printResult(sim_node_file):
'''
打印輸出相似度計算結果
'''
global sim_matrix
global link_out
global link_in
global nodes
global nodesnum
# 打印node之間的相似度
f_out_user = open(sim_node_file, "w")
for i in range(nodesnum):
f_out_user.write(nodes[i] + "\t")
neighbour = []
for j in range(nodesnum):
if i != j:
sim = sim_matrix[i, j]
if sim == None:
sim = 0
if sim > 0:
neighbour.append((j, sim))
# 按相似度由大到小排序
neighbour = sorted(
neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
for (u, sim) in neighbour:
f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
f_out_user.write("\n")
f_out_user.close()
def simrank(graphFile, maxIteration):
global nodes_index
global trans_matrix
global sim_matrix
initParam(graphFile)
print "nodes:"
print nodes_index
print "trans ratio:"
print trans_matrix
for i in range(maxIteration):
print "iteration %d:" % (i + 1)
iterate()
print sim_matrix
if __name__ == '__main__':
graphFile = "../data/linkgraph"
sim_node_file = "../data/nodesim_square"
maxIteration = 10
simrank(graphFile, maxIteration)
printResult(sim_node_file)
最終算得的節點相似度雖然跟naive方法不一致,但相似度排序跟naive方法是完全一致的。
Arnoldi迭代降維法
采用\eqref{square_cache}式每次迭代都要作矩陣的乘法,當矩陣稍微大一些時對時間和空間的消耗是非常驚人的,兩個$n\times{n}$的方陣相乘時間復雜度為$n^3$,即使采用strassen分治法也需要$O(n^{2.81})$的時間復雜度,況且strassen要求兩個矩陣是方陣且邊長為2的整次冪。如果我們能夠對\eqref{square_cache}式中的Q和S進行降維,勢必會節省大量計算時間。
Arnoldi迭代法是一種對高維稀疏矩陣(說的是上文中的轉移概率矩陣Q)進行降維的方法。
\begin{equation}Q=V\cdot{H}\cdot{V^T}\label{decompose}\end{equation}
其中V的各列正交,且模長為1。$V=[v_0\mid{v_1}\mid\ldots{v}_{r-1}]$,其中$v_i$是V的第i列。
H是$r\times{r}$的上三角矩陣,r是Q的秩,形如:
$H_{r\times{r}}=\begin{bmatrix}h_{1,1} & h_{1,2} & h_{1,3} & \cdots & h_{1,r}\\h_{2,1} & h_{2,2} & h_{2,3} & \cdots & h_{2,r}\\0 & h_{3,2} & h_{3,3} & \cdots & h_{3,r}\\\vdots & \ddots & \ddots & \ddots & \vdots\\0 & \cdots & 0 & h_{r,r-1} & h_{r,r}\end{bmatrix}$
V和H的計算流程如下:
| $step1.$ $v_0=[1,0,0,\ldots]^T$ $step2.$ $for\ k \in [1,\alpha]$ $\ \ \ \ v_k=Q\cdot{v_{k-1}}$ $\ \ \ \ for\ j \in [0,k)$ $\ \ \ \ \ \ \ \ H[j][k-1]=v_j^T\cdot{v_k}$ $\ \ \ \ \ \ \ \ v_k=v_k-H[j][k-1]\cdot{v_j}$ $\ \ \ \ norm2=\|v_k\|_2$ $\ \ \ \ if\ norm2=0$ $\ \ \ \ \ \ \ \ break$ $\ \ \ \ H[k][k-1]=norm2$ $\ \ \ \ v_k=\frac{v_k}{norm2}$ $step3.$ $V舍棄最后一列,H舍棄最后一行$ |
在step2中,如果是由於norm2=0導致的最外層for循退出,則得到的H的邊長為Q的秩即r;如果是由於達到了人為設定的循環上限$\alpha$,則得到的H的邊長為$\alpha$。
arnoldi.py
#!/usr/bin/env pyton # coding=utf-8 import numpy as np from sparse_matrix import SparseMatrix def arnoldi_iteration(Q, n, rank): ''' 對Q進行分解,QV=VH。 Q是輸入參數,numpy.matrix類型,n行n列,Q的秩為r。 V和H都是輸出參數,numpy.matrix類型。 V是n行r+1列,每列模長為1且各列正交。V的轉置與逆相等。 H是r+1行r列的上三角矩陣。 rank用於限制循環次數,r<=rank。 ''' if rank > n or rank <= 0: rank = n V = np.zeros((n, 1)) V[0, 0] = 1 h_col_list=[] k = 1 while k <= rank: h_col = [] v_k = Q.mulMatrixCol(V,k-1) for j in range(k): product = np.dot(np.matrix(V[:,j]).reshape(n,1).transpose(), v_k)[0,0] h_col.append(product) v_k = v_k - product * (np.matrix(V[:,j]).reshape(n,1)) norm2 = np.linalg.norm(v_k, ord=2) if norm2 == 0: print "norm2=0, will break" break h_col.append(norm2) h_col_list.append(h_col) v_k = v_k / norm2 V = np.hstack((V, np.matrix(v_k))) k += 1 r = len(h_col_list) print "r=", r H = np.zeros((r, r)) for i in range(r): h_col = h_col_list[i] for j in range(len(h_col)): if j < r: H[j, i] = h_col[j] V = V[:, :r] return (V, H) if __name__ == '__main__': Q=SparseMatrix() Q.set(0,1,0.5) Q.set(0,2,1) Q.set(1,0,0.5) Q.set(1,3,1) Q.set(2,0,0.5) Q.set(3,1,0.5) (V,H)=arnoldi_iteration(Q,4,-1) print "V=" print V print "H=" print H print "VHV^T=" print np.dot(np.dot(V,H),V.transpose()) print "QV=" print Q.mulMatrix(V) print "VH=" print np.dot(V,H)
sparse_matrix.py
#!/usr/bin/env python
# coding=utf-8
import numpy as np
class SparseMatrix():
'''
用dict實現高度稀疏的矩陣,dict的key是元素在matrix中的二維坐標,dict的value是元素的值
'''
def __init__(self):
self.arr = {}
def get(self, row, col):
key = (row, col)
if key in self.arr:
return self.arr[key]
else:
return 0
def set(self, row, col, value):
key = (row, col)
self.arr[key] = value
def mul(self, vec):
'''
與一個一維向量相乘,返回一個list
'''
length = len(vec)
rect = [0] * length
for k, v in self.arr.items():
i = k[0]
j = k[1]
rect[i] += v * vec[j]
return rect
def mulMatrixCol(self, matrix, col):
'''
與矩陣的第col列相乘,返回一個n*1的矩陣
'''
length = matrix.shape[0]
rect = np.zeros((length, 1))
for k, v in self.arr.items():
i = k[0]
j = k[1]
rect[i, 0] += v * matrix[j, col]
return rect
def mulMatrix(self, matrix):
'''
與一個矩陣相乘
'''
col_num = matrix.shape[1]
rect = self.mulMatrixCol(matrix, 0)
for i in range(1,col_num):
rect = np.hstack((rect, self.mulMatrixCol(matrix, i)))
return rect
def transmul(self, vec):
'''
矩陣轉置后與一維向量相乘
'''
length = len(vec)
rect = [0] * length
for k, v in self.arr.items():
i = k[1]
j = k[0]
rect[i] += v * vec[i]
return rect
用H代替\eqref{square_cache}中的Q,用H的邊長代替\eqref{square_cache}中的n,執行SimRank迭代。迭代后得到的$\hat{S}$實際上就是S在krylov子空間上的映射,按照\eqref{back}式把$\hat{S}$還原回S。
\begin{equation}S=V\cdot{\hat{S}}\cdot{V^T}\label{back}\end{equation}
krylov_simrank.py
#!/usr/bin/env python
# coding=utf-8
import numpy as np
import scipy as sp
import arnoldi
from sparse_matrix import SparseMatrix
from topk import TopkHeap
nodes = [] # 所有的節點存入數組
nodesnum = 0 # 所有節點的數目
nodes_index = {} # <節點名,節點在nodes數組中的編號>
damp = 0.8 # 阻尼系數
sim_matrix = np.matrix(0) # 在krylov子空間上的節點相似度矩陣
# 轉移概率矩陣分解為orthogonal*hessenberg*orthogonal.transpose()
alpha = 0
orthogonal = [] # nodesnum行,alpha列
hessenberg = np.matrix(0) # alpha行,alpha列
def initParam(graphFile):
'''
構建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
輸入文件行格式要求:node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
'''
global nodes
global nodes_index
global sim_matrix
global damp
global nodesnum
global orthogonal
global hessenberg
global alpha
link_in = {}
for line in open(graphFile, "r", 1024):
arr = line.strip("\n").split()
node = arr[0]
nodeid = -1
if node in nodes_index:
nodeid = nodes_index[node]
else:
nodeid = len(nodes)
nodes_index[node] = nodeid
nodes.append(node)
for ele in arr[1:]:
outneighbor = ele
outneighborid = -1
if outneighbor in nodes_index:
outneighborid = nodes_index[outneighbor]
else:
outneighborid = len(nodes)
nodes_index[outneighbor] = outneighborid
nodes.append(outneighbor)
inneighbors = []
if outneighborid in link_in:
inneighbors = link_in[outneighborid]
inneighbors.append(nodeid)
link_in[outneighborid] = inneighbors
nodesnum = len(nodes)
print "node count is %d:" % (nodesnum)
trans_matrix = SparseMatrix() # 真實的轉移概率矩陣
for node, inneighbors in link_in.items():
num = len(inneighbors)
prob = 1.0 / num
for neighbor in inneighbors:
trans_matrix.set(node, neighbor, prob)
(orthogonal, hessenberg) = arnoldi.arnoldi_iteration(
trans_matrix, nodesnum, nodesnum / 100) #人為設定H的規模是trans_matrix的1/100
alpha = hessenberg.shape[0]
sim_matrix = np.identity(alpha) * (1 - damp)
def iterate():
'''
迭代更新相似度矩陣
'''
global hessenberg
global sim_matrix
global damp
global nodesnum
damp = damp ** 2
hessenberg = np.dot(hessenberg, hessenberg)
sim_matrix = damp * np.dot(np.dot(hessenberg,
sim_matrix), hessenberg.transpose()) + sim_matrix
def printResult(sim_node_file):
'''
打印輸出相似度計算結果
'''
# global變量在子進程中可以讀,但修改無效
global sim_matrix
global orthogonal
global nodes
global nodesnum
# 打印node之間的相似度
f_out_user = open(sim_node_file, "w")
for i in range(nodesnum):
f_out_user.write(nodes[i] + "\t")
heap = TopkHeap(10) # 只取相似度最高的前10個節點
similarity = restoreSim(orthogonal, sim_matrix, i)
for j in range(nodesnum):
if i != j:
sim = similarity[j]
if sim == None:
sim = 0
if sim > 1E-2:
heap.Push((j, sim))
neighbour = heap.TopK()
# 按相似度由大到小排序
# TODO 上線時可以不對neighbour排序,排序是為了數據驗證階段調參數
neighbour = sorted(
neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
simcnt = 0
for (u, sim) in neighbour:
f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
simcnt += 1
if simcnt >= 100:
break
f_out_user.write("\n")
f_out_user.close()
def restoreSim(A, B, i):
'''
還原第i個節點與其他節點的相似度
'''
m = A.shape[0]
n = A.shape[1]
part = [0] * n
for q in range(n):
for p in range(n):
if q + 1 >= p: # B是上三角矩陣,左下角的部分全為0,可以不管
part[q] += A[i, p] * B[p, q]
rect = [0] * m
for j in range(m):
for q in range(n):
rect[j] += A[j, q] * part[q]
return rect
def simrank(graphFile, maxIteration):
initParam(graphFile)
for i in range(maxIteration):
print "iteration %d:" % (i + 1)
iterate()
print hessenberg
if __name__ == '__main__':
graphFile = "../data/linkgraph"
sim_node_file = "../data/nodesim_krylov"
maxIteration = 10
simrank(graphFile, maxIteration)
printResult(sim_node_file)
topk.py
import heapq
import random
class TopkHeap(object):
def __init__(self, k):
self.k = k
self.data = []
def Push(self, elem):
if len(self.data) < self.k:
heapq.heappush(self.data, elem)
else:
topk_small = self.data[0][1]
if elem[1] > topk_small:
heapq.heapreplace(self.data, elem)
def TopK(self):
return [x for x in reversed([heapq.heappop(self.data) for x in xrange(len(self.data))])]
if __name__ == "__main__":
K=10
list_rand = random.sample(xrange(1000000), 100)
th = TopkHeap(K)
for i in list_rand:
th.Push(('a',i))
print th.TopK()
print sorted(list_rand, reverse=True)[0:K]
最終得到的節點相似度與平方緩存法有局部的不一致,說明S在投影到低維空間后產生了精度損失。restoreSim()這個函數只負責根據H和V還原S的第i行,即第i個節點跟其他所有節點的相似度。如果要根據\eqref{decompose}還原整個S,那又是兩個大規模(主要是V的規模大)矩陣的乘法,會很耗時,所以實踐中restoreSim()函數應該被多進程並發調用。
分布式SimRank
通過上文我們也看到,即使我們通過平方緩存法優化了SimRank的迭代次數,通過對Q進行降維避開了大矩陣相乘帶來的時空消耗的困擾,但是到最后還原S的時候還是避不開大矩陣相乘,所以最根本的解決辦法還是使用Hadoop進行矩陣的並行運算。下面描述如何用MapReduce實現矩陣的乘法。
SimRank++
simrankpp.py
1 #!/usr/bin/env python 2 # coding=utf-8 3 4 from sym_matrix import SymMatrix 5 import mathutil 6 import math 7 8 user_index = {} # user編號,從0開始 9 position_index = {} # position編號,從0開始 10 link_out = {} # user到position的鄰接矩陣 11 link_in = {} # position到user的鄰接矩陣 12 user_common_count = SymMatrix() # user共同瀏覽過的職位數 13 position_common_count = SymMatrix() # position共同被瀏覽過的用戶數 14 user_sim_matrix = SymMatrix() # user之間的相似度矩陣 15 position_sim_matrix = SymMatrix() # position之間的相似度矩陣 16 user_damp = 0.8 # user阻尼系數 17 position_damp = 0.8 # position阻尼系數 18 evidence_size = 20 19 evidence = [] # evidence常量數組 20 user_weight_sum_variance = {} # 1個user節點上所有邊權值的和以及方差 21 position_weight_sum_variance = {} # 1個position節點上所有邊權值的和以及方差 22 23 24 def initEvidence(): 25 ''' 26 計算evidence常量數組 27 ''' 28 global evidence_size 29 global evidence 30 31 for i in range(evidence_size): 32 evidence.append(0) 33 evidence[0] = 0.5 34 for i in range(1, evidence_size): 35 evidence[i] = evidence[i - 1] + 1.0 / math.pow(2, i + 1) 36 37 38 def indexUserAndPosition(viewfile): 39 ''' 40 讀取user瀏覽position頁面的記錄文件,為user和position建立編號 41 ''' 42 global user_index 43 global position_index 44 user_count = 0 45 position_count = 0 46 for line in open(viewfile, "r"): 47 arr = line.strip("\n").split() 48 user = arr[0] 49 if user not in user_index: 50 user_index[user] = user_count 51 user_count += 1 52 for ele in arr[1:]: 53 brr = ele.split(":") 54 if len(brr) == 2: 55 position = brr[0] 56 if position not in position_index: 57 position_index[position] = position_count 58 position_count += 1 59 60 61 def readLink(viewfile): 62 ''' 63 讀取user瀏覽position頁面的記錄文件,建構出度和入度鄰接矩陣 64 ''' 65 global link_out 66 global link_in 67 global position_index 68 global user_index 69 70 for line in open(viewfile, "r"): 71 arr = line.strip("\n").split() 72 user = arr[0] 73 userid = user_index[user] 74 position_count_collection = {} 75 for ele in arr[1:]: 76 brr = ele.split(":") 77 if len(brr) == 2: 78 position = brr[0] 79 positionid = position_index[position] 80 count = int(brr[1]) 81 position_count_collection[positionid] = count 82 user_count_collection = {} 83 if positionid in link_in: 84 user_count_collection = link_in[positionid] 85 user_count_collection[userid] = count 86 link_in[positionid] = user_count_collection 87 link_out[userid] = position_count_collection 88 89 90 def initWeightSumAndVariance(): 91 ''' 92 計算user_weight_sum_variance和position_weight_sum_variance 93 ''' 94 global link_out 95 global link_in 96 97 for k, v in link_out.items(): 98 ps = [] 99 for p, c in v.items(): 100 ps.append(c) 101 tup = mathutil.getSumAndVariance(ps) 102 user_weight_sum_variance[k] = tup 103 for k, v in link_in.items(): 104 us = [] 105 for u, c in v.items(): 106 us.append(c) 107 tup = mathutil.getSumAndVariance(us) 108 position_weight_sum_variance[k] = tup 109 110 111 def initSimMatrix(): 112 ''' 113 初始化相似度矩陣position_sim_matrix/user_sim_matrix 114 ''' 115 global link_out 116 global link_in 117 global user_sim_matrix 118 global position_sim_matrix 119 global user_common_count 120 global position_common_count 121 122 usernum = len(link_out) # user節點的數量 123 positionnum = len(link_in) # position節點的數量 124 125 user_sim_matrix = SymMatrix() 126 position_sim_matrix = SymMatrix() 127 user_common_count = SymMatrix() 128 position_common_count = SymMatrix() 129 130 # 計算user兩兩之間的相似度 131 # print "user_sim_matrix=" 132 for k1, v1 in link_out.items(): # user到position的鏈接 133 ps1 = [] 134 for p, c in v1.items(): 135 ps1.append(p) 136 for k2, v2 in link_out.items(): 137 if k1 < k2: 138 ps2 = [] 139 for p, c in v2.items(): 140 ps2.append(p) 141 common = mathutil.findCommonEle( 142 sorted(ps1), sorted(ps2)) # 2個user鏈接到的position的交集 143 commonLen = len(common) 144 if commonLen > 0: 145 user_common_count.set(k1, k2, commonLen) 146 # 2個user鏈接到至少同一個position時,他們的初始相似度為user_damp 147 user_sim_matrix.set(k1, k2, user_damp) 148 # print "(%d,%d)%f" % (k1, k2, user_sim_matrix.get(k1, k2)), 149 user_sim_matrix.set(k1, k1, 1) # 自己與自己的相似度為1 150 # print 151 152 # 計算position兩兩之間的相似度 153 # print "position_sim_matrix=" 154 for k1, v1 in link_in.items(): 155 usr1 = [] 156 for u, c in v1.items(): 157 usr1.append(u) 158 for k2, v2 in link_in.items(): 159 if k1 < k2: 160 usr2 = [] 161 for u, c in v2.items(): 162 usr2.append(u) 163 common = mathutil.findCommonEle(sorted(usr1), sorted(usr2)) 164 commonLen = len(common) 165 if commonLen > 0: 166 position_common_count.set(k1, k2, commonLen) 167 position_sim_matrix.set(k1, k2, position_damp) 168 # print "(%d,%d)%f" % (k1, k2, position_sim_matrix.get(k1, k2)), 169 position_sim_matrix.set(k1, k1, 1) 170 # print 171 172 173 def updateSim(): 174 ''' 175 迭代更新相似度 176 ''' 177 global link_out 178 global link_in 179 global user_common_count 180 global user_sim_matrix 181 global position_common_count 182 global position_sim_matrix 183 global user_damp 184 global position_damp 185 global evidence_size 186 global evidence 187 188 usernum = len(link_out) # user節點的數量 189 positionnum = len(link_in) # position節點的數量 190 # 計算user兩兩之間的相似度 191 # print "user_sim_matrix=" 192 for k1, v1 in link_out.items(): # user到position的鏈接 193 ps1 = [] # 存儲第1個user的out-neighbour 194 for p, c in v1.items(): 195 ps1.append(p) 196 for k2, v2 in link_out.items(): 197 commCount = user_common_count.get(k1, k2) # 兩user共同鏈接的position數 198 if commCount == None: 199 commCount = 0 200 # 數據量大時可以采用這種修剪策略,認為沒有相同in-neighbour(或out-neighbour)的節點之間相似度為0,迭代時不去更新它 201 if k1 < k2 and commCount > 0: 202 # if k1 < k2: 203 ps2 = [] # 存儲第2個user的out-neighbour 204 for p, c in v2.items(): 205 ps2.append(p) 206 if commCount > evidence_size: 207 commCount = evidence_size 208 edv = evidence[commCount - 1] 209 sum_part = 0.0 210 for p1 in ps1: 211 for p2 in ps2: 212 sim_p1_p2 = position_sim_matrix.get(p1, p2) 213 if sim_p1_p2 == None: 214 continue 215 spread_p1 = math.pow( 216 math.e, -position_weight_sum_variance[p1][1]) 217 weight_u1_p1 = link_out[k1][p1] 218 weight_u1_sum = user_weight_sum_variance[k1][0] 219 normalized_weight_u1_p1 = 1.0 * \ 220 weight_u1_p1 / weight_u1_sum 221 spread_p2 = math.pow( 222 math.e, -position_weight_sum_variance[p2][1]) 223 weight_u2_p2 = link_out[k2][p2] 224 weight_u2_sum = user_weight_sum_variance[k2][0] 225 normalized_weight_u2_p2 = 1.0 * \ 226 weight_u2_p2 / weight_u2_sum 227 sum_part += spread_p1 * normalized_weight_u1_p1 * \ 228 spread_p2 * normalized_weight_u2_p2 * sim_p1_p2 229 new_sim = edv * user_damp * sum_part 230 user_sim_matrix.set(k1, k2, new_sim) 231 # print "(%d,%d)%f" % (k1, k2, user_sim_matrix.get(k1, k2)), 232 # print 233 234 # 計算position兩兩之間的相似度 235 # print "position_sim_matrix=" 236 for k1, v1 in link_in.items(): # position到user的鏈接 237 us1 = [] # 存儲第1個position的in-neighbour 238 for u, c in v1.items(): 239 us1.append(u) 240 for k2, v2 in link_in.items(): 241 commCount = position_common_count.get( 242 k1, k2) # 兩position共同鏈接的user數 243 if commCount == None: 244 commCount = 0 245 # 數據量大時可以采用這種修剪策略,認為沒有相同in-neighbour(或out-neighbour)的節點之間相似度為0,迭代時不去更新它 246 if k1 < k2 and commCount > 0: 247 # if k1 < k2: 248 us2 = [] # 存儲第2個position的in-neighbour 249 for u, c in v2.items(): 250 us2.append(u) 251 if commCount > evidence_size: 252 commCount = evidence_size 253 edv = evidence[commCount - 1] 254 sum_part = 0.0 255 for u1 in us1: 256 for u2 in us2: 257 sim_u1_u2 = user_sim_matrix.get(u1, u2) 258 if sim_u1_u2 == None: 259 continue 260 spread_u1 = math.pow( 261 math.e, -user_weight_sum_variance[u1][1]) 262 weight_p1_u1 = link_in[k1][u1] 263 weight_p1_sum = position_weight_sum_variance[k1][0] 264 normalized_weight_p1_u1 = 1.0 * \ 265 weight_p1_u1 / weight_p1_sum 266 spread_u2 = math.pow( 267 math.e, -user_weight_sum_variance[u2][1]) 268 weight_p2_u2 = link_in[k2][u2] 269 weight_p2_sum = position_weight_sum_variance[k2][0] 270 normalized_weight_p2_u2 = 1.0 * \ 271 weight_p2_u2 / weight_p2_sum 272 sum_part += spread_u1 * normalized_weight_p1_u1 * \ 273 spread_u2 * normalized_weight_p2_u2 * sim_u1_u2 274 new_sim = edv * position_damp * sum_part 275 position_sim_matrix.set(k1, k2, new_sim) 276 # print "(%d,%d)%f" % (k1, k2, position_sim_matrix.get(k1, k2)), 277 # print 278 279 280 def simrank(linkfile, iteration): 281 initEvidence() 282 indexUserAndPosition(linkfile) 283 readLink(linkfile) 284 initWeightSumAndVariance() 285 print "iteration 0:" 286 initSimMatrix() 287 print 288 for i in range(iteration): 289 print "iteration %d:" % (i + 1) 290 updateSim() 291 print 292 293 294 def printResult(sim_user_file, sim_position_file): 295 ''' 296 打印輸出相似度計算結果 297 ''' 298 global position_sim_matrix 299 global user_sim_matrix 300 global link_out 301 global link_in 302 usernum = len(link_out) # user節點的數量 303 positionnum = len(link_in) # position節點的數量 304 305 # 打印user之間的相似度 306 f_out_user = open(sim_user_file, "w") 307 for i in range(usernum): 308 f_out_user.write(str(i) + "\t") 309 neighbour = [] 310 for j in range(usernum): 311 if i != j: 312 sim = user_sim_matrix.get(i, j) 313 if sim == None: 314 sim = 0 315 if sim > 0: 316 neighbour.append((j, sim)) 317 # 按相似度由大到小排序 318 neighbour = sorted( 319 neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True) 320 for (u, sim) in neighbour: 321 f_out_user.write(str(u) + ":" + str(sim) + "\t") 322 f_out_user.write("\n") 323 f_out_user.close() 324 325 # 打印position之間的相似度 326 f_out_position = open(sim_position_file, "w") 327 for i in range(positionnum): 328 f_out_position.write(str(i) + "\t") 329 neighbour = [] 330 for j in range(positionnum): 331 if i != j: 332 sim = position_sim_matrix.get(i, j) 333 if sim == None: 334 sim = 0 335 if sim > 0: 336 neighbour.append((j, sim)) 337 neighbour = sorted( 338 neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True) 339 for (p, sim) in neighbour: 340 f_out_position.write(str(p) + ":" + str(sim) + "\t") 341 f_out_position.write("\n") 342 f_out_position.close() 343 344 345 if __name__ == '__main__': 346 linkfile = "../data/delivery" 347 sim_user_file = "../data/sim_user.txt" 348 sim_position_file = "../data/sim_position.txt" 349 simrank(linkfile, 5) 350 printResult(sim_user_file, sim_position_file)
mathutil.py
1 #!/usr/bin/env python 2 # coding=utf-8 3 4 def findCommonEle(list1,list2): 5 ''' 6 返回兩個序列的公共元素。list1和list2中的元素必須按從小到大排好序 7 ''' 8 rect=[] 9 length1=len(list1) 10 length2=len(list2) 11 idx1=0 12 idx2=0 13 while idx1<length1 and idx2<length2: 14 ele1=list1[idx1] 15 ele2=list2[idx2] 16 if ele1==ele2: 17 rect.append(ele1) 18 idx1+=1 19 idx2+=1 20 elif ele1<ele2: 21 idx1+=1 22 else: 23 idx2+=1 24 return rect 25 26 def getSumAndVariance(li): 27 ''' 28 計算數列的和與方差 29 ''' 30 assert len(li)>0 31 total=0.0 32 sumSquare=0.0 33 for ele in li: 34 total+=ele 35 sumSquare+=ele*ele 36 mean=total/len(li) 37 variance=sumSquare/len(li)-mean*mean 38 return (total,variance)
基於隨機游走的SimRank
期望-相遇距離模型說明:在圖 G 中,頂點 a 和 b 之間的相似度取決於 a 和 b 在圖中隨機游走直至相 遇所經過的路徑的長度以及能相遇的次數,基於這一模型 SimRank 的計算如式(9)和式(10)所示。
\begin{equation}s(a,b)=\sum_{t(a,b)\to(x,x)}{P[t]C^{l(t)}}\label{rw_sim}\end{equation}
\begin{equation}s^{k+1}(a,b)=\sum_{t(a,b)\to(x,x);l(t)\le{k+1}}{P[t]C^{l(t)}}\label{rw_itr}\end{equation}
$t(a,b)\to(x,x)$表示以a、b為起點的兩條隨機游走路徑,首次在x點相遇,兩條路徑的長度相同,均為$l(t)$。假設兩條路徑分別為 $t_1=(v_1,v_2,\cdots,v_m,x)$,$t_2=(w_1,w_2,\cdots,w_m,x)$。游 走 $t_1$的 概 率$P[t_1]=\prod_{i=1}^m{\frac{1}{\mid{O(v_i)}\mid}}$,游 走$t_2$的概率$P[t_2]=\prod_{i=1}^m{\frac{1}{\mid{O(w_i)}\mid}}$,a和b通過路徑$t_1$和$t_2$在頂點 x 相遇的概率$P[t]=P[t_1]P[t_2]=\prod_{i=1}^m{\frac{1}{\mid{O(v_i)}\mid\mid{O(w_i)}\mid}}$。
通過隨機游走路徑方法來計算$s^{k+1}$ 時,需要進行如下步驟:
- 找到圖中所有長度小於等於$k+1$的路徑,並且記錄路徑的概率。
- 將終點和路徑長度都相同的路徑按照式\eqref{rw_sim}計算出起點的部分相似度。
- 將第2步得到的部分相似度相加。
