SimRank:基於圖結構的相似度計算方法

本文轉載自查看原文 2015-06-16 13:06 19575 DataMining

簡單理解SimRank

圖1.二部圖

所謂二部圖(bipartite graphs)，是指圖中的節點可以分這兩個子集，任意一條邊關聯的兩個節點分別來自於這兩個子集。用I(v)和O(v)分別表示節點v的in-neighbors和out-neighbors。看上面的二部圖，我們把A、B當成兩個人，把a、b、c當成三件商品，有向邊代表人購買的商品。simrank的基本思想是：如果兩個實體相似，那么跟它們相關的實體應該也相似。比如在上圖中如果a和c相似，那么A和B應該也相似，因為A和a相關，而B和c相關。

SimRank的基本公式：

\begin{equation}s(a,b)=\frac{C}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s(I_i(a),I_j(b))\label{basic}\end{equation}

s(a,b)是節點a和b的相似度，當a=b時，s(a,b)=1。$I_i(a)$表示a的第i個in-neighbor。當$I(a)=\emptyset$或$I(b)=\emptyset$時式\eqref{basic}為0。\eqref{basic}式用一句話描述就是：a和b的相似度等於a的in-neighbors和b的in-neighbors相似度的平均值。參數C是個阻尼系數，它的含義可以這么理解：假如I(a)=I(b)={A}，按照\eqref{basic}式計算出sim(a,b)=C*sim(A,A)=C，所以$C\in(0,1)$。

把式\eqref{basic}應用於圖1所示的二部圖就是：

\begin{equation}s(A,B)=\frac{C_1}{|O(A)||O(B)|}\sum_{i=1}^{|O(A)|}\sum_{j=1}^{|O(B)|}s(O_i(A),O_j(B))\ \ \ \ for\ A\ne{B}\label{out}\end{equation}

\begin{equation}s(a,b)=\frac{C_2}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s(I_i(a),I_j(b))\ \ \ \ for\ a\ne{b}\label{in}\end{equation}

忽略$C_1$和$C_2$，\eqref{out}式是說買家A和B的相似度等於他們購買的物品之間相似度的平均值，\eqref{in}式是說物品a和b的相似度是購買它們的買家之間相似度的平均值。

對於非二部圖的情況，一個節點既可能有in-neighbors也可能有out-neighbors，比如在論文引用的場景下，一篇論文既可能引用其他論文，也可能被其他論文引用。站在引用的角度，兩篇論文的相似度為

\begin{equation}s_1(a,b)=\frac{C_1}{|O(a)||O(b)|}\sum_{i=1}^{|O(a)|}\sum_{j=1}^{|O(b)|}s_2(O_i(a),O_j(b))\label{hout}\end{equation}

站在被引用的角度，兩篇論文的相似度為

\begin{equation}s_2(a,b)=\frac{C_2}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s_1(I_i(a),I_j(b))\label{hin}\end{equation}

根據實際的應用場景，最終的s(a,b)可以采用\eqref{hout}和\eqref{hin}其中的任何一個，或者是兩者的綜合。

Naive SimRank

SimRank迭代算法：

$R_0(a,b)=\left\{\begin{matrix}0\ \ \ \ if\ a\ne{b}\\1\ \ \ \ if\ a=b\end{matrix}\right.$

$R_{k+1}(a,b)=\left\{\begin{matrix}\frac{C}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}R_k(I_i(a),I_j(b))\ \ \ \ if\ a\ne{b}\\1\ \ \ \ if\ a=b\end{matrix}\right.$

$R_k(*,*)$是k的單調不減函數，$lim_{k\to\infty}R_k(a,b)=s(a,b)$，實踐中發現$R_k(*,*)$收斂得很快，k不需要設得太大。

下面給出矩陣的形式，因為直接使用上面的迭代公式很難展開並行計算，數量稍微大一些（比如上十萬）時在單機上跑時間和空間開銷非常大。

\begin{equation}\left\{\begin{matrix}S^{(0)}=(1-c)\cdot{I_n}\\S^{(k+1)}=c\cdot{Q^T}\cdot{S^{(k)}}\cdot{Q}+(1-c)\cdot{I_n}\end{matrix}\right.\label{juzhen}\end{equation}

S是相似度矩陣。Q是轉移概率矩陣，它的每一列和為1，如果從節點i可以轉移到節點j，並且這樣的節點i一共有n個，則$Q_{i,j}=\frac{1}{n}$

迭代誤差的上界：

\begin{equation}\|S^{(k)}-S\|_{max}\le{c}^{k+1}\ \ \ \ (\vee{k}=0,1,2\ldots)\label{err_ceil}\end{equation}

矩陣的max范數定義為：

$\|X_{p\times{q}}\|_{max}\stackrel{def}{=}max_{i=1}^{p}max_{j=1}^{q}\mid{x}_{i,j}\mid$

好了，來做個小練習。圖2是一張網頁鏈接關系圖，表示一所大學的主頁上放了A、B兩個教授的個人主頁的鏈接，教授B和學生B的個人主頁互相鏈接了對方，等等。下面我們要通過這種鏈接關系求這5個節點的相似度。

圖2. 網頁鏈接關系圖

首先用一個文本文件來存儲上面的有向圖。

linkgraph

univ	profA	profB
profA	studentA
studentA	univ
profB	studentB
studentB	profB

文件中每一行的首列是一個節點，后面的列是首列的out-neighbors，即在圖上游走時只能順着箭頭的方向。對於二部圖情況就不一樣了，\eqref{out}式是順着二部圖箭頭的方向，\eqref{in}式是逆着二部圖箭頭的方向，即每一條邊都是允許雙向游走的。於是圖1所示的二部圖可以表示為：

linkgraph_bipartite

A    a    b
B    b    c
a    A
b    A    B
c    B

naive_simrank.py

#!/usr/bin/env python
# coding=utf-8

import numpy as np
import scipy as sp

nodes = []  # 所有的節點存入數組
nodesnum = 0  # 所有節點的數目
nodes_index = {}  # <節點名，節點在nodes數組中的編號>
damp = 0.8  # 阻尼系數
trans_matrix = np.matrix(0)  # 轉移概率矩陣
sim_matrix = np.matrix(0)  # 節點相似度矩陣


def initParam(graphFile):
    '''
    構建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
    輸入文件行格式要求：node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
    '''
    global nodes
    global nodes_index
    global trans_matrix
    global sim_matrix
    global damp
    global nodesnum

    link_in = {}
    for line in open(graphFile, "r", 1024):
        arr = line.strip("\n").split()
        node = arr[0]
        nodeid = -1
        if node in nodes_index:
            nodeid = nodes_index[node]
        else:
            nodeid = len(nodes)
            nodes_index[node] = nodeid
            nodes.append(node)
        for ele in arr[1:]:
            outneighbor = ele
            outneighborid = -1
            if outneighbor in nodes_index:
                outneighborid = nodes_index[outneighbor]
            else:
                outneighborid = len(nodes)
                nodes_index[outneighbor] = outneighborid
                nodes.append(outneighbor)
            inneighbors = []
            if outneighborid in link_in:
                inneighbors = link_in[outneighborid]
            inneighbors.append(nodeid)
            link_in[outneighborid] = inneighbors

    nodesnum = len(nodes)
    trans_matrix = np.zeros((nodesnum, nodesnum))
    for node, inneighbors in link_in.items():
        num = len(inneighbors)
        prob = 1.0 / num
        for neighbor in inneighbors:
            trans_matrix[neighbor, node] = prob

    sim_matrix = np.identity(nodesnum) * (1 - damp)


def iterate():
    '''
    迭代更新相似度矩陣
    '''
    global trans_matrix
    global sim_matrix
    global damp
    global nodesnum

    sim_matrix = damp * np.dot(np.dot(trans_matrix.transpose(),
                                      sim_matrix), trans_matrix) + (1 - damp) * np.identity(nodesnum)


def printResult(sim_node_file):
    '''
    打印輸出相似度計算結果
    '''
    global sim_matrix
    global link_out
    global link_in
    global nodes
    global nodesnum

    # 打印node之間的相似度
    f_out_user = open(sim_node_file, "w")
    for i in range(nodesnum):
        f_out_user.write(nodes[i] + "\t")
        neighbour = []
        for j in range(nodesnum):
            if i != j:
                sim = sim_matrix[i, j]
                if sim == None:
                    sim = 0
                if sim > 0:
                    neighbour.append((j, sim))
        # 按相似度由大到小排序
        neighbour = sorted(
            neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
        for (u, sim) in neighbour:
            f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
        f_out_user.write("\n")
    f_out_user.close()


def simrank(graphFile, maxIteration):
    global nodes_index
    global trans_matrix
    global sim_matrix

    initParam(graphFile)
    print "nodes:"
    print nodes_index
    print "trans ratio:"
    print trans_matrix
    for i in range(maxIteration):
        print "iteration %d:" % (i + 1)
        iterate()
        print sim_matrix


if __name__ == '__main__':
    graphFile = "../data/linkgraph"
    sim_node_file = "../data/nodesim_naive"
    maxIteration = 10
    simrank(graphFile, maxIteration)
    printResult(sim_node_file)

最終得到5個節點兩兩之間的相似度

nodesim_naive

univ	profB:0.10803511296	studentB:0.02203058176	
profA	profB:0.36478881792	studentB:0.08159625216	
profB	profA:0.36478881792	univ:0.10803511296	studentB:0.0642220032	studentA:0.03022258176	
studentA	studentB:0.28216737792	profB:0.03022258176	
studentB	studentA:0.28216737792	profA:0.08159625216	profB:0.0642220032	univ:0.02203058176

平方緩存法

\eqref{err_ceil}已經證明了simrank的收斂速度是非常快的，下面給出一種可以加速收斂的方法--平方緩存法。

\begin{equation}\left\{\begin{matrix}S_{\left \langle 2 \right \rangle }^{(0)}=(1-c)\cdot{I_n}\\S_{\left \langle 2 \right \rangle }^{(k+1)}=S_{\left \langle 2 \right \rangle }^{(k)}+c^{2^k}\cdot{(Q^{2^k})^T}\cdot{S_{\left \langle 2 \right \rangle }^{(k)}}\cdot{Q^{2^k}}\end{matrix}\right.\label{square_cache}\end{equation}

平方緩存法的收斂速度更驚人：

\begin{equation}\|S_{\left \langle 2 \right \rangle }^{(k)}-S\|_{max}\le{c}^{2^k}\ \ \ \ (\vee{k}=0,1,2\ldots)\label{square_err_ceil}\end{equation}

注意$Q^{2^k}=(Q^{2^{(k-1)}})^2$即每次迭代不必從頭計算$Q^{2^k}$，只需要在上一次迭代的基礎上平方一下就可以了，這其實就是\eqref{square_cache}比\eqref{juzhen}快的全部原因。事實上：

$S_{\left \langle 2 \right \rangle }^{(k)}=S^{(2^k-1)}$

我們用平方緩存法重新計算圖2中各節點的相似度。

square_cache_simrank.py

#!/usr/bin/env python
# coding=utf-8

import numpy as np
import scipy as sp

nodes = []  # 所有的節點存入數組
nodesnum = 0  # 所有節點的數目
nodes_index = {}  # <節點名，節點在nodes數組中的編號>
damp = 0.8  # 阻尼系數
trans_matrix = np.matrix(0)  # 轉移概率矩陣
sim_matrix = np.matrix(0)  # 節點相似度矩陣


def initParam(graphFile):
    '''
    構建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
    輸入文件行格式要求：node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
    '''
    global nodes
    global nodes_index
    global trans_matrix
    global sim_matrix
    global damp
    global nodesnum

    link_in = {}
    for line in open(graphFile, "r", 1024):
        arr = line.strip("\n").split()
        node = arr[0]
        nodeid = -1
        if node in nodes_index:
            nodeid = nodes_index[node]
        else:
            nodeid = len(nodes)
            nodes_index[node] = nodeid
            nodes.append(node)
        for ele in arr[1:]:
            outneighbor = ele
            outneighborid = -1
            if outneighbor in nodes_index:
                outneighborid = nodes_index[outneighbor]
            else:
                outneighborid = len(nodes)
                nodes_index[outneighbor] = outneighborid
                nodes.append(outneighbor)
            inneighbors = []
            if outneighborid in link_in:
                inneighbors = link_in[outneighborid]
            inneighbors.append(nodeid)
            link_in[outneighborid] = inneighbors

    nodesnum = len(nodes)
    trans_matrix = np.zeros((nodesnum, nodesnum))
    for node, inneighbors in link_in.items():
        num = len(inneighbors)
        prob = 1.0 / num
        for neighbor in inneighbors:
            trans_matrix[node, neighbor] = prob

    sim_matrix = np.identity(nodesnum) * (1 - damp)


def iterate():
    '''
    迭代更新相似度矩陣
    '''
    global trans_matrix
    global sim_matrix
    global damp
    global nodesnum

    damp=damp**2
    trans_matrix=np.dot(trans_matrix,trans_matrix)
    sim_matrix = damp * np.dot(np.dot(trans_matrix,
                                      sim_matrix), trans_matrix.transpose()) + sim_matrix


def printResult(sim_node_file):
    '''
    打印輸出相似度計算結果
    '''
    global sim_matrix
    global link_out
    global link_in
    global nodes
    global nodesnum

    # 打印node之間的相似度
    f_out_user = open(sim_node_file, "w")
    for i in range(nodesnum):
        f_out_user.write(nodes[i] + "\t")
        neighbour = []
        for j in range(nodesnum):
            if i != j:
                sim = sim_matrix[i, j]
                if sim == None:
                    sim = 0
                if sim > 0:
                    neighbour.append((j, sim))
        # 按相似度由大到小排序
        neighbour = sorted(
            neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
        for (u, sim) in neighbour:
            f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
        f_out_user.write("\n")
    f_out_user.close()


def simrank(graphFile, maxIteration):
    global nodes_index
    global trans_matrix
    global sim_matrix

    initParam(graphFile)
    print "nodes:"
    print nodes_index
    print "trans ratio:"
    print trans_matrix
    for i in range(maxIteration):
        print "iteration %d:" % (i + 1)
        iterate()
        print sim_matrix


if __name__ == '__main__':
    graphFile = "../data/linkgraph"
    sim_node_file = "../data/nodesim_square"
    maxIteration = 10
    simrank(graphFile, maxIteration)
    printResult(sim_node_file)

最終算得的節點相似度雖然跟naive方法不一致，但相似度排序跟naive方法是完全一致的。

Arnoldi迭代降維法

采用\eqref{square_cache}式每次迭代都要作矩陣的乘法，當矩陣稍微大一些時對時間和空間的消耗是非常驚人的，兩個$n\times{n}$的方陣相乘時間復雜度為$n^3$，即使采用strassen分治法也需要$O(n^{2.81})$的時間復雜度，況且strassen要求兩個矩陣是方陣且邊長為2的整次冪。如果我們能夠對\eqref{square_cache}式中的Q和S進行降維，勢必會節省大量計算時間。

Arnoldi迭代法是一種對高維稀疏矩陣（說的是上文中的轉移概率矩陣Q）進行降維的方法。

\begin{equation}Q=V\cdot{H}\cdot{V^T}\label{decompose}\end{equation}

其中V的各列正交，且模長為1。$V=[v_0\mid{v_1}\mid\ldots{v}_{r-1}]$，其中$v_i$是V的第i列。

H是$r\times{r}$的上三角矩陣，r是Q的秩，形如：
$H_{r\times{r}}=\begin{bmatrix}h_{1,1} & h_{1,2} & h_{1,3} & \cdots & h_{1,r}\\h_{2,1} & h_{2,2} & h_{2,3} & \cdots & h_{2,r}\\0 & h_{3,2} & h_{3,3} & \cdots & h_{3,r}\\\vdots & \ddots & \ddots & \ddots & \vdots\\0 & \cdots & 0 & h_{r,r-1} & h_{r,r}\end{bmatrix}$

V和H的計算流程如下：

$step1.$

$v_0=[1,0,0,\ldots]^T$

$step2.$

$for\ k \in [1,\alpha]$

$\ \ \ \ v_k=Q\cdot{v_{k-1}}$

$\ \ \ \ for\ j \in [0,k)$

$\ \ \ \ \ \ \ \ H[j][k-1]=v_j^T\cdot{v_k}$

$\ \ \ \ \ \ \ \ v_k=v_k-H[j][k-1]\cdot{v_j}$

$\ \ \ \ norm2=\|v_k\|_2$

$\ \ \ \ if\ norm2=0$

$\ \ \ \ \ \ \ \ break$

$\ \ \ \ H[k][k-1]=norm2$

$\ \ \ \ v_k=\frac{v_k}{norm2}$

$step3.$

$V舍棄最后一列，H舍棄最后一行$

在step2中，如果是由於norm2=0導致的最外層for循退出，則得到的H的邊長為Q的秩即r；如果是由於達到了人為設定的循環上限$\alpha$，則得到的H的邊長為$\alpha$。

arnoldi.py

#!/usr/bin/env pyton
# coding=utf-8

import numpy as np
from sparse_matrix import SparseMatrix

def arnoldi_iteration(Q, n, rank):
	'''
	對Q進行分解，QV=VH。
	Q是輸入參數，numpy.matrix類型，n行n列，Q的秩為r。
	V和H都是輸出參數，numpy.matrix類型。
	V是n行r+1列，每列模長為1且各列正交。V的轉置與逆相等。
	H是r+1行r列的上三角矩陣。
	rank用於限制循環次數，r<=rank。
	'''
	if rank > n or rank <= 0:
		rank = n
	V = np.zeros((n, 1))
	V[0, 0] = 1
	h_col_list=[]
	k = 1
	while k <= rank:
		h_col = []
		v_k = Q.mulMatrixCol(V,k-1)
		for j in range(k):
			product = np.dot(np.matrix(V[:,j]).reshape(n,1).transpose(), v_k)[0,0]
			h_col.append(product)
			v_k = v_k - product * (np.matrix(V[:,j]).reshape(n,1))
		norm2 = np.linalg.norm(v_k, ord=2)
		if norm2 == 0:
			print "norm2=0, will break"
			break
		h_col.append(norm2)
		h_col_list.append(h_col)
		v_k = v_k / norm2 
		V = np.hstack((V, np.matrix(v_k)))
		k += 1
	r = len(h_col_list)
	print "r=", r
	H = np.zeros((r, r))
	for i in range(r):
		h_col = h_col_list[i]
		for j in range(len(h_col)):
				if j < r:
					H[j, i] = h_col[j]
	V = V[:, :r]
	return (V, H)

if __name__ == '__main__':
	Q=SparseMatrix()
	Q.set(0,1,0.5)
	Q.set(0,2,1)
	Q.set(1,0,0.5)
	Q.set(1,3,1)
	Q.set(2,0,0.5)
	Q.set(3,1,0.5)
	(V,H)=arnoldi_iteration(Q,4,-1)
	print "V="
	print V
	print "H="
	print H
	print "VHV^T="
	print np.dot(np.dot(V,H),V.transpose())
	print "QV="
	print Q.mulMatrix(V)
	print "VH="
	print np.dot(V,H)

sparse_matrix.py

#!/usr/bin/env python
# coding=utf-8

import numpy as np


class SparseMatrix():

    '''
    用dict實現高度稀疏的矩陣，dict的key是元素在matrix中的二維坐標，dict的value是元素的值
    '''

    def __init__(self):
        self.arr = {}

    def get(self, row, col):
        key = (row, col)
        if key in self.arr:
            return self.arr[key]
        else:
            return 0

    def set(self, row, col, value):
        key = (row, col)
        self.arr[key] = value

    def mul(self, vec):
        '''
        與一個一維向量相乘，返回一個list
        '''
        length = len(vec)
        rect = [0] * length
        for k, v in self.arr.items():
            i = k[0]
            j = k[1]
            rect[i] += v * vec[j]
        return rect

    def mulMatrixCol(self, matrix, col):
        '''
        與矩陣的第col列相乘,返回一個n*1的矩陣
        '''
        length = matrix.shape[0]
        rect = np.zeros((length, 1))
        for k, v in self.arr.items():
            i = k[0]
            j = k[1]
            rect[i, 0] += v * matrix[j, col]
        return rect

    def mulMatrix(self, matrix):
        '''
        與一個矩陣相乘
        '''
        col_num = matrix.shape[1]
        rect = self.mulMatrixCol(matrix, 0)
        for i in range(1,col_num):
            rect = np.hstack((rect, self.mulMatrixCol(matrix, i)))
        return rect

    def transmul(self, vec):
        '''
        矩陣轉置后與一維向量相乘
        '''
        length = len(vec)
        rect = [0] * length
        for k, v in self.arr.items():
            i = k[1]
            j = k[0]
            rect[i] += v * vec[i]
        return rect

用H代替\eqref{square_cache}中的Q，用H的邊長代替\eqref{square_cache}中的n，執行SimRank迭代。迭代后得到的$\hat{S}$實際上就是S在krylov子空間上的映射，按照\eqref{back}式把$\hat{S}$還原回S。

\begin{equation}S=V\cdot{\hat{S}}\cdot{V^T}\label{back}\end{equation}

krylov_simrank.py

#!/usr/bin/env python
# coding=utf-8

import numpy as np
import scipy as sp
import arnoldi
from sparse_matrix import SparseMatrix
from topk import TopkHeap

nodes = []  # 所有的節點存入數組
nodesnum = 0  # 所有節點的數目
nodes_index = {}  # <節點名，節點在nodes數組中的編號>
damp = 0.8  # 阻尼系數
sim_matrix = np.matrix(0)  # 在krylov子空間上的節點相似度矩陣
# 轉移概率矩陣分解為orthogonal*hessenberg*orthogonal.transpose()
alpha = 0
orthogonal = []  # nodesnum行，alpha列
hessenberg = np.matrix(0)  # alpha行，alpha列


def initParam(graphFile):
    '''
    構建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
    輸入文件行格式要求：node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
    '''
    global nodes
    global nodes_index
    global sim_matrix
    global damp
    global nodesnum
    global orthogonal
    global hessenberg
    global alpha

    link_in = {}
    for line in open(graphFile, "r", 1024):
        arr = line.strip("\n").split()
        node = arr[0]
        nodeid = -1
        if node in nodes_index:
            nodeid = nodes_index[node]
        else:
            nodeid = len(nodes)
            nodes_index[node] = nodeid
            nodes.append(node)
        for ele in arr[1:]:
            outneighbor = ele
            outneighborid = -1
            if outneighbor in nodes_index:
                outneighborid = nodes_index[outneighbor]
            else:
                outneighborid = len(nodes)
                nodes_index[outneighbor] = outneighborid
                nodes.append(outneighbor)
            inneighbors = []
            if outneighborid in link_in:
                inneighbors = link_in[outneighborid]
            inneighbors.append(nodeid)
            link_in[outneighborid] = inneighbors

    nodesnum = len(nodes)
    print "node count is  %d:" % (nodesnum)
    trans_matrix = SparseMatrix()  # 真實的轉移概率矩陣
    for node, inneighbors in link_in.items():
        num = len(inneighbors)
        prob = 1.0 / num
        for neighbor in inneighbors:
            trans_matrix.set(node, neighbor, prob)
    (orthogonal, hessenberg) = arnoldi.arnoldi_iteration(
        trans_matrix, nodesnum, nodesnum / 100) #人為設定H的規模是trans_matrix的1/100
    alpha = hessenberg.shape[0]

    sim_matrix = np.identity(alpha) * (1 - damp)


def iterate():
    '''
    迭代更新相似度矩陣
    '''
    global hessenberg
    global sim_matrix
    global damp
    global nodesnum

    damp = damp ** 2
    hessenberg = np.dot(hessenberg, hessenberg)
    sim_matrix = damp * np.dot(np.dot(hessenberg,
                                      sim_matrix), hessenberg.transpose()) + sim_matrix


def printResult(sim_node_file):
    '''
    打印輸出相似度計算結果
    '''
    # global變量在子進程中可以讀，但修改無效
    global sim_matrix
    global orthogonal
    global nodes
    global nodesnum

    # 打印node之間的相似度
    f_out_user = open(sim_node_file, "w")
    for i in range(nodesnum):
        f_out_user.write(nodes[i] + "\t")
        heap = TopkHeap(10)  # 只取相似度最高的前10個節點
        similarity = restoreSim(orthogonal, sim_matrix, i)
        for j in range(nodesnum):
            if i != j:
                sim = similarity[j]
                if sim == None:
                    sim = 0
                if sim > 1E-2:
                    heap.Push((j, sim))
        neighbour = heap.TopK()
        # 按相似度由大到小排序
        # TODO 上線時可以不對neighbour排序，排序是為了數據驗證階段調參數
        neighbour = sorted(
            neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
        simcnt = 0
        for (u, sim) in neighbour:
            f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
            simcnt += 1
            if simcnt >= 100:
                break
        f_out_user.write("\n")
    f_out_user.close()


def restoreSim(A, B, i):
    '''
    還原第i個節點與其他節點的相似度
    '''
    m = A.shape[0]
    n = A.shape[1]
    part = [0] * n
    for q in range(n):
        for p in range(n):
            if q + 1 >= p:  # B是上三角矩陣，左下角的部分全為0，可以不管
                part[q] += A[i, p] * B[p, q]
    rect = [0] * m
    for j in range(m):
        for q in range(n):
            rect[j] += A[j, q] * part[q]
    return rect


def simrank(graphFile, maxIteration):
    initParam(graphFile)
    for i in range(maxIteration):
        print "iteration %d:" % (i + 1)
        iterate()
        print hessenberg


if __name__ == '__main__':
    graphFile = "../data/linkgraph"
    sim_node_file = "../data/nodesim_krylov"
    maxIteration = 10
    simrank(graphFile, maxIteration)
    printResult(sim_node_file)

topk.py

import heapq
import random

class TopkHeap(object):
    def __init__(self, k):
        self.k = k
        self.data = []

    def Push(self, elem):
        if len(self.data) < self.k:
            heapq.heappush(self.data, elem)
        else:
            topk_small = self.data[0][1]
            if elem[1] > topk_small:
                heapq.heapreplace(self.data, elem)

    def TopK(self):
        return [x for x in reversed([heapq.heappop(self.data) for x in xrange(len(self.data))])]

if __name__ == "__main__":
    K=10
    list_rand = random.sample(xrange(1000000), 100)
    th = TopkHeap(K)
    for i in list_rand:
        th.Push(('a',i))
    print th.TopK()
    print sorted(list_rand, reverse=True)[0:K]

最終得到的節點相似度與平方緩存法有局部的不一致，說明S在投影到低維空間后產生了精度損失。restoreSim()這個函數只負責根據H和V還原S的第i行，即第i個節點跟其他所有節點的相似度。如果要根據\eqref{decompose}還原整個S，那又是兩個大規模（主要是V的規模大）矩陣的乘法，會很耗時，所以實踐中restoreSim()函數應該被多進程並發調用。

分布式SimRank

通過上文我們也看到，即使我們通過平方緩存法優化了SimRank的迭代次數，通過對Q進行降維避開了大矩陣相乘帶來的時空消耗的困擾，但是到最后還原S的時候還是避不開大矩陣相乘，所以最根本的解決辦法還是使用Hadoop進行矩陣的並行運算。下面描述如何用MapReduce實現矩陣的乘法。

SimRank++

simrankpp.py

  1 #!/usr/bin/env python
  2 # coding=utf-8
  3 
  4 from sym_matrix import SymMatrix
  5 import mathutil
  6 import math
  7 
  8 user_index = {}  # user編號，從0開始
  9 position_index = {}  # position編號，從0開始
 10 link_out = {}  # user到position的鄰接矩陣
 11 link_in = {}  # position到user的鄰接矩陣
 12 user_common_count = SymMatrix()  # user共同瀏覽過的職位數
 13 position_common_count = SymMatrix()  # position共同被瀏覽過的用戶數
 14 user_sim_matrix = SymMatrix()  # user之間的相似度矩陣
 15 position_sim_matrix = SymMatrix()  # position之間的相似度矩陣
 16 user_damp = 0.8  # user阻尼系數
 17 position_damp = 0.8  # position阻尼系數
 18 evidence_size = 20
 19 evidence = []  # evidence常量數組
 20 user_weight_sum_variance = {}  # 1個user節點上所有邊權值的和以及方差
 21 position_weight_sum_variance = {}  # 1個position節點上所有邊權值的和以及方差
 22 
 23 
 24 def initEvidence():
 25     '''
 26     計算evidence常量數組
 27     '''
 28     global evidence_size
 29     global evidence
 30 
 31     for i in range(evidence_size):
 32         evidence.append(0)
 33     evidence[0] = 0.5
 34     for i in range(1, evidence_size):
 35         evidence[i] = evidence[i - 1] + 1.0 / math.pow(2, i + 1)
 36 
 37 
 38 def indexUserAndPosition(viewfile):
 39     '''
 40     讀取user瀏覽position頁面的記錄文件，為user和position建立編號
 41     '''
 42     global user_index
 43     global position_index
 44     user_count = 0
 45     position_count = 0
 46     for line in open(viewfile, "r"):
 47         arr = line.strip("\n").split()
 48         user = arr[0]
 49         if user not in user_index:
 50             user_index[user] = user_count
 51             user_count += 1
 52         for ele in arr[1:]:
 53             brr = ele.split(":")
 54             if len(brr) == 2:
 55                 position = brr[0]
 56                 if position not in position_index:
 57                     position_index[position] = position_count
 58                     position_count += 1
 59 
 60 
 61 def readLink(viewfile):
 62     '''
 63     讀取user瀏覽position頁面的記錄文件，建構出度和入度鄰接矩陣
 64     '''
 65     global link_out
 66     global link_in
 67     global position_index
 68     global user_index
 69 
 70     for line in open(viewfile, "r"):
 71         arr = line.strip("\n").split()
 72         user = arr[0]
 73         userid = user_index[user]
 74         position_count_collection = {}
 75         for ele in arr[1:]:
 76             brr = ele.split(":")
 77             if len(brr) == 2:
 78                 position = brr[0]
 79                 positionid = position_index[position]
 80                 count = int(brr[1])
 81                 position_count_collection[positionid] = count
 82                 user_count_collection = {}
 83                 if positionid in link_in:
 84                     user_count_collection = link_in[positionid]
 85                 user_count_collection[userid] = count
 86                 link_in[positionid] = user_count_collection
 87         link_out[userid] = position_count_collection
 88 
 89 
 90 def initWeightSumAndVariance():
 91     '''
 92     計算user_weight_sum_variance和position_weight_sum_variance
 93     '''
 94     global link_out
 95     global link_in
 96 
 97     for k, v in link_out.items():
 98         ps = []
 99         for p, c in v.items():
100             ps.append(c)
101         tup = mathutil.getSumAndVariance(ps)
102         user_weight_sum_variance[k] = tup
103     for k, v in link_in.items():
104         us = []
105         for u, c in v.items():
106             us.append(c)
107         tup = mathutil.getSumAndVariance(us)
108         position_weight_sum_variance[k] = tup
109 
110 
111 def initSimMatrix():
112     '''
113     初始化相似度矩陣position_sim_matrix/user_sim_matrix
114     '''
115     global link_out
116     global link_in
117     global user_sim_matrix
118     global position_sim_matrix
119     global user_common_count
120     global position_common_count
121 
122     usernum = len(link_out)  # user節點的數量
123     positionnum = len(link_in)  # position節點的數量
124 
125     user_sim_matrix = SymMatrix()
126     position_sim_matrix = SymMatrix()
127     user_common_count = SymMatrix()
128     position_common_count = SymMatrix()
129 
130     # 計算user兩兩之間的相似度
131     # print "user_sim_matrix="
132     for k1, v1 in link_out.items():  # user到position的鏈接
133         ps1 = []
134         for p, c in v1.items():
135             ps1.append(p)
136         for k2, v2 in link_out.items():
137             if k1 < k2:
138                 ps2 = []
139                 for p, c in v2.items():
140                     ps2.append(p)
141                 common = mathutil.findCommonEle(
142                     sorted(ps1), sorted(ps2))  # 2個user鏈接到的position的交集
143                 commonLen = len(common)
144                 if commonLen > 0:
145                     user_common_count.set(k1, k2, commonLen)
146                     # 2個user鏈接到至少同一個position時，他們的初始相似度為user_damp
147                     user_sim_matrix.set(k1, k2, user_damp)
148             # print "(%d,%d)%f" % (k1, k2, user_sim_matrix.get(k1, k2)),
149         user_sim_matrix.set(k1, k1, 1)  # 自己與自己的相似度為1
150         # print
151 
152     # 計算position兩兩之間的相似度
153     # print "position_sim_matrix="
154     for k1, v1 in link_in.items():
155         usr1 = []
156         for u, c in v1.items():
157             usr1.append(u)
158         for k2, v2 in link_in.items():
159             if k1 < k2:
160                 usr2 = []
161                 for u, c in v2.items():
162                     usr2.append(u)
163                 common = mathutil.findCommonEle(sorted(usr1), sorted(usr2))
164                 commonLen = len(common)
165                 if commonLen > 0:
166                     position_common_count.set(k1, k2, commonLen)
167                     position_sim_matrix.set(k1, k2, position_damp)
168             # print "(%d,%d)%f" % (k1, k2, position_sim_matrix.get(k1, k2)),
169         position_sim_matrix.set(k1, k1, 1)
170         # print
171 
172 
173 def updateSim():
174     '''
175     迭代更新相似度
176     '''
177     global link_out
178     global link_in
179     global user_common_count
180     global user_sim_matrix
181     global position_common_count
182     global position_sim_matrix
183     global user_damp
184     global position_damp
185     global evidence_size
186     global evidence
187 
188     usernum = len(link_out)  # user節點的數量
189     positionnum = len(link_in)  # position節點的數量
190     # 計算user兩兩之間的相似度
191     # print "user_sim_matrix="
192     for k1, v1 in link_out.items():  # user到position的鏈接
193         ps1 = []  # 存儲第1個user的out-neighbour
194         for p, c in v1.items():
195             ps1.append(p)
196         for k2, v2 in link_out.items():
197             commCount = user_common_count.get(k1, k2)  # 兩user共同鏈接的position數
198             if commCount == None:
199                 commCount = 0
200             # 數據量大時可以采用這種修剪策略，認為沒有相同in-neighbour(或out-neighbour)的節點之間相似度為0，迭代時不去更新它
201             if k1 < k2 and commCount > 0:
202                 # if k1 < k2:
203                 ps2 = []  # 存儲第2個user的out-neighbour
204                 for p, c in v2.items():
205                     ps2.append(p)
206                 if commCount > evidence_size:
207                     commCount = evidence_size
208                 edv = evidence[commCount - 1]
209                 sum_part = 0.0
210                 for p1 in ps1:
211                     for p2 in ps2:
212                         sim_p1_p2 = position_sim_matrix.get(p1, p2)
213                         if sim_p1_p2 == None:
214                             continue
215                         spread_p1 = math.pow(
216                             math.e, -position_weight_sum_variance[p1][1])
217                         weight_u1_p1 = link_out[k1][p1]
218                         weight_u1_sum = user_weight_sum_variance[k1][0]
219                         normalized_weight_u1_p1 = 1.0 * \
220                             weight_u1_p1 / weight_u1_sum
221                         spread_p2 = math.pow(
222                             math.e, -position_weight_sum_variance[p2][1])
223                         weight_u2_p2 = link_out[k2][p2]
224                         weight_u2_sum = user_weight_sum_variance[k2][0]
225                         normalized_weight_u2_p2 = 1.0 * \
226                             weight_u2_p2 / weight_u2_sum
227                         sum_part += spread_p1 * normalized_weight_u1_p1 * \
228                             spread_p2 * normalized_weight_u2_p2 * sim_p1_p2
229                 new_sim = edv * user_damp * sum_part
230                 user_sim_matrix.set(k1, k2, new_sim)
231             # print "(%d,%d)%f" % (k1, k2, user_sim_matrix.get(k1, k2)),
232        # print
233 
234     # 計算position兩兩之間的相似度
235     # print "position_sim_matrix="
236     for k1, v1 in link_in.items():  # position到user的鏈接
237         us1 = []  # 存儲第1個position的in-neighbour
238         for u, c in v1.items():
239             us1.append(u)
240         for k2, v2 in link_in.items():
241             commCount = position_common_count.get(
242                 k1, k2)  # 兩position共同鏈接的user數
243             if commCount == None:
244                 commCount = 0
245             # 數據量大時可以采用這種修剪策略，認為沒有相同in-neighbour(或out-neighbour)的節點之間相似度為0，迭代時不去更新它
246             if k1 < k2 and commCount > 0:
247                 # if k1 < k2:
248                 us2 = []  # 存儲第2個position的in-neighbour
249                 for u, c in v2.items():
250                     us2.append(u)
251                 if commCount > evidence_size:
252                     commCount = evidence_size
253                 edv = evidence[commCount - 1]
254                 sum_part = 0.0
255                 for u1 in us1:
256                     for u2 in us2:
257                         sim_u1_u2 = user_sim_matrix.get(u1, u2)
258                         if sim_u1_u2 == None:
259                             continue
260                         spread_u1 = math.pow(
261                             math.e, -user_weight_sum_variance[u1][1])
262                         weight_p1_u1 = link_in[k1][u1]
263                         weight_p1_sum = position_weight_sum_variance[k1][0]
264                         normalized_weight_p1_u1 = 1.0 * \
265                             weight_p1_u1 / weight_p1_sum
266                         spread_u2 = math.pow(
267                             math.e, -user_weight_sum_variance[u2][1])
268                         weight_p2_u2 = link_in[k2][u2]
269                         weight_p2_sum = position_weight_sum_variance[k2][0]
270                         normalized_weight_p2_u2 = 1.0 * \
271                             weight_p2_u2 / weight_p2_sum
272                         sum_part += spread_u1 * normalized_weight_p1_u1 * \
273                             spread_u2 * normalized_weight_p2_u2 * sim_u1_u2
274                 new_sim = edv * position_damp * sum_part
275                 position_sim_matrix.set(k1, k2, new_sim)
276             # print "(%d,%d)%f" % (k1, k2, position_sim_matrix.get(k1, k2)),
277         # print
278 
279 
280 def simrank(linkfile, iteration):
281     initEvidence()
282     indexUserAndPosition(linkfile)
283     readLink(linkfile)
284     initWeightSumAndVariance()
285     print "iteration 0:"
286     initSimMatrix()
287     print
288     for i in range(iteration):
289         print "iteration %d:" % (i + 1)
290         updateSim()
291         print
292 
293 
294 def printResult(sim_user_file, sim_position_file):
295     '''
296     打印輸出相似度計算結果
297     '''
298     global position_sim_matrix
299     global user_sim_matrix
300     global link_out
301     global link_in
302     usernum = len(link_out)  # user節點的數量
303     positionnum = len(link_in)  # position節點的數量
304 
305     # 打印user之間的相似度
306     f_out_user = open(sim_user_file, "w")
307     for i in range(usernum):
308         f_out_user.write(str(i) + "\t")
309         neighbour = []
310         for j in range(usernum):
311             if i != j:
312                 sim = user_sim_matrix.get(i, j)
313                 if sim == None:
314                     sim = 0
315                 if sim > 0:
316                     neighbour.append((j, sim))
317         # 按相似度由大到小排序
318         neighbour = sorted(
319             neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
320         for (u, sim) in neighbour:
321             f_out_user.write(str(u) + ":" + str(sim) + "\t")
322         f_out_user.write("\n")
323     f_out_user.close()
324 
325     # 打印position之間的相似度
326     f_out_position = open(sim_position_file, "w")
327     for i in range(positionnum):
328         f_out_position.write(str(i) + "\t")
329         neighbour = []
330         for j in range(positionnum):
331             if i != j:
332                 sim = position_sim_matrix.get(i, j)
333                 if sim == None:
334                     sim = 0
335                 if sim > 0:
336                     neighbour.append((j, sim))
337         neighbour = sorted(
338             neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
339         for (p, sim) in neighbour:
340             f_out_position.write(str(p) + ":" + str(sim) + "\t")
341         f_out_position.write("\n")
342     f_out_position.close()
343 
344 
345 if __name__ == '__main__':
346     linkfile = "../data/delivery"
347     sim_user_file = "../data/sim_user.txt"
348     sim_position_file = "../data/sim_position.txt"
349     simrank(linkfile, 5)
350     printResult(sim_user_file, sim_position_file)

View Code

mathutil.py

 1 #!/usr/bin/env python
 2 # coding=utf-8
 3 
 4 def findCommonEle(list1,list2):
 5     '''
 6     返回兩個序列的公共元素。list1和list2中的元素必須按從小到大排好序
 7     '''
 8     rect=[]
 9     length1=len(list1)
10     length2=len(list2)
11     idx1=0
12     idx2=0
13     while idx1<length1 and idx2<length2:
14         ele1=list1[idx1]
15         ele2=list2[idx2]
16         if ele1==ele2:
17             rect.append(ele1)
18             idx1+=1
19             idx2+=1
20         elif ele1<ele2:
21             idx1+=1
22         else:
23             idx2+=1
24     return rect
25 
26 def getSumAndVariance(li):
27     '''
28     計算數列的和與方差
29     '''
30     assert len(li)>0
31     total=0.0
32     sumSquare=0.0
33     for ele in li:
34         total+=ele
35         sumSquare+=ele*ele
36     mean=total/len(li)
37     variance=sumSquare/len(li)-mean*mean
38     return (total,variance)

View Code

基於隨機游走的SimRank

期望-相遇距離模型說明:在圖 G 中,頂點 a 和 b 之間的相似度取決於 a 和 b 在圖中隨機游走直至相遇所經過的路徑的長度以及能相遇的次數,基於這一模型 SimRank 的計算如式(9)和式(10)所示。

\begin{equation}s(a,b)=\sum_{t(a,b)\to(x,x)}{P[t]C^{l(t)}}\label{rw_sim}\end{equation}

\begin{equation}s^{k+1}(a,b)=\sum_{t(a,b)\to(x,x);l(t)\le{k+1}}{P[t]C^{l(t)}}\label{rw_itr}\end{equation}

$t(a,b)\to(x,x)$表示以a、b為起點的兩條隨機游走路徑，首次在x點相遇，兩條路徑的長度相同，均為$l(t)$。假設兩條路徑分別為 $t_1=(v_1,v_2,\cdots,v_m,x)$,$t_2=(w_1,w_2,\cdots,w_m,x)$。游走 $t_1$的概率$P[t_1]=\prod_{i=1}^m{\frac{1}{\mid{O(v_i)}\mid}}$,游走$t_2$的概率$P[t_2]=\prod_{i=1}^m{\frac{1}{\mid{O(w_i)}\mid}}$，a和b通過路徑$t_1$和$t_2$在頂點 x 相遇的概率$P[t]=P[t_1]P[t_2]=\prod_{i=1}^m{\frac{1}{\mid{O(v_i)}\mid\mid{O(w_i)}\mid}}$。

通過隨機游走路徑方法來計算$s^{k+1}$ 時,需要進行如下步驟:

找到圖中所有長度小於等於$k+1$的路徑,並且記錄路徑的概率。
將終點和路徑長度都相同的路徑按照式\eqref{rw_sim}計算出起點的部分相似度。
將第2步得到的部分相似度相加。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 相似度計算方法句子相似度計算方法 Mahout中相似度計算方法介紹數據點間的相似度－距離計算方法相似度計算方法學習總結相似對角化定義以及計算方法常見的距離算法和相似度（相關系數）計算方法結構體大小的計算方法計算文本相似度方法總結（二）計算文本相似度方法總結（一）