1.閔可夫斯基距離:計算用戶相似度
閔可夫斯基距離可以概括曼哈頓距離與歐幾里得距離。
其中r越大,單個維度差值大小會對整體產生更大的影響。這個很好理解,假設當r=2時一個正方形對角線長度,永遠是r=3時正方體對角線的投影,因此r越大,單個維度差異會有更大影響。(所以這也可能是很多公司的推薦算法並不准確的原因之一)
我們在對一個新用戶進行推薦時,可以計算在同等維度下其他用戶的閔可夫斯基距離。這種海量數據的二維表格,用pandas處理十分方便

下面有一個閔可夫距離計算的實例
from math import sqrt users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0}, "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0}, "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0}, "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0}, "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0}, "Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0}, "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0}, "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0} } def minkefu(rating1, rating2, n): """Computes the Manhattan distance. Both rating1 and rating2 are dictionaries of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}""" distance = 0 commonRatings = False for key in rating1: if key in rating2: distance += abs((rating1[key] - rating2[key])**n) commonRatings = True if commonRatings: return distance**1/n else: return -1 #Indicates no ratings in common def computeNearestNeighbor(username, users): """creates a sorted list of users based on their distance to username""" distances = [] for user in users: if user != username: distance = minkefu(users[user], users[username], 2) distances.append((distance, user)) # sort based on distance -- closest first distances.sort() return distances def recommend(username, users): """Give list of recommendations""" # first find nearest neighbor nearest = computeNearestNeighbor(username, users)[0][1] recommendations = [] # now find bands neighbor rated that user didn't neighborRatings = users[nearest] userRatings = users[username] for artist in neighborRatings: if not artist in userRatings: recommendations.append((artist, neighborRatings[artist])) # using the fn sorted for variety - sort is more efficient return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True) # examples - uncomment to run print( recommend('Hailey', users))
2.皮爾遜相關系數:如何解決主觀評價差異帶來的推薦誤差
在第一部分提到r越大,單個維度差值大小會對整體產生更大的影響。因此,我們需要有一個解決方案來應對個體的主觀評價差異。這個東西就是皮爾遜相關系數。
上述公式計算復雜度很大,需要進行n!*m!次遍歷,后續有一個近似計算公式可以大大降低算法復雜度。
皮爾遜相關系數(-1,1)用於衡量兩個向量(用戶)的相關性,如兩個用戶意見基本一致,那皮爾遜相關系數靠近1,如果兩個用戶意見基本相反,那皮爾遜相關系數結果靠近-1。在這里,我們需要弄明白兩個問題:
(1)怎么確定多維向量之間的皮爾遜相關系數
(2)怎么利用閔可夫斯基距離結合起來,優化我們的推薦模型
對第(1)問題,在這里有一個近似計算公式
用代碼來表示則為
def pearson(rating1, rating2): sum_xy = 0 sum_x = 0 sum_y = 0 sum_x2 = 0 sum_y2 = 0 n = 0 for key in rating1: if key in rating2: n += 1 x = rating1[key] y = rating2[key] sum_xy += x * y sum_x += x sum_y += y sum_x2 += pow(x, 2) sum_y2 += pow(y, 2) # now compute denominator denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n) if denominator == 0: return 0 else: return (sum_xy - (sum_x * sum_y) / n) / denominator
(2)對於問題2,假設一個場景:
現在Anne需要聽一首歌,從相似的三個相似用戶中可以看出他們的皮爾遜系數為:
三個人對這首歌的推薦均有貢獻,那我們怎么確認比重呢?由於0.8+0.7+0.5=2,因此可以按1的百分比取,則
因此,這首歌的最后得分是4.5*0.25+5*0.35+3.5*0.4=4.275
這樣計算的好處是將多個用戶推薦權重整合起來,這樣不會因為單一用戶的個人喜好或者經歷導致推薦失誤。這也是接下來要說的K臨近算法。
3.稀疏矩陣的處理辦法:余弦相似度
如果數據是密集的,則用閔可夫斯基距離來計算距離;
如果數據是稀疏的,則使用余弦相似度來計算相似度(-1,1)
4.避免個人因素而生成的錯誤推薦:K臨近算法
見2中的例子(2)
python已有現成的KNN算法庫,本質是找到跟目標最近距離的幾個點,詳情可以參考:http://python.jobbole.com/83794/