皮爾遜相關系數

本文轉載自查看原文 2017-07-28 19:19 1567 MachineLearning

皮爾遜相關系數是比歐幾里德距離更加復雜的可以判斷人們興趣的相似度的一種方法。該相關系數是判斷兩組數據與某一直線擬合程序的一種試題。它在數據不是很規范的時候，會傾向於給出更好的結果。

如圖，Mick Lasalle為<<Superman>>評了3分，而Gene Seyour則評了5分，所以該影片被定位中圖中的(3,5)處。在圖中還可以看到一條直線。其繪制原則是盡可能地靠近圖上的所有坐標點，被稱為最佳擬合線。如果兩位評論者對所有影片的評分情況都相同，那么這條直線將成為對角線，並且會與圖上所有的坐標點都相交，從而得到一個結果為1的理想相關度評價。

假設有兩個變量X、Y，那么兩變量間的皮爾遜相關系數可通過以下公式計算：

公式一：

皮爾遜相關系數計算公式

公式二：

皮爾遜相關系數計算公式

公式三：

皮爾遜相關系數計算公式

公式四：

皮爾遜相關系數計算公式

以上列出的四個公式等價，其中E是數學期望，cov表示協方差，N表示變量取值的個數。

皮爾遜相關度評價算法首先會找出兩位評論者都曾評論過的物品，然后計算兩者的評分總和與平方和，並求得評分的乘積之各。利用上面的公式四計算出皮爾遜相關系數。

critics = {'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,  
                         'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,  
                         'The Night Listener': 3.0},  
           'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,  
                            'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,  
                            'You, Me and Dupree': 3.5},  
           'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,  
                                'Superman Returns': 3.5, 'The Night Listener': 4.0},  
           'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,  
                            'The Night Listener': 4.5, 'Superman Returns': 4.0,  
                            'You, Me and Dupree': 2.5},  
           'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,  
                            'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,  
                            'You, Me and Dupree': 2.0},  
           'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,  
                             'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},  
           'Toby': {'Snakes on a Plane': 4.5, 'You, Me and Dupree': 1.0, 'Superman Returns': 4.0}}  
  
  
from math import sqrt  
  
def sim_pearson(prefs, p1, p2):  
    # Get the list of mutually rated items  
    si = {}  
    for item in prefs[p1]:  
        if item in prefs[p2]:  
            si[item] = 1  
  
    # if they are no ratings in common, return 0  
    if len(si) == 0:  
        return 0  
  
    # Sum calculations  
    n = len(si)  
  
    # Sums of all the preferences  
    sum1 = sum([prefs[p1][it] for it in si])  
    sum2 = sum([prefs[p2][it] for it in si])  
  
    # Sums of the squares  
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])  
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])  
  
    # Sum of the products  
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])  
  
    # Calculate r (Pearson score)  
    num = pSum - (sum1 * sum2 / n)  
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))  
    if den == 0:  
        return 0  
  
    r = num / den  
  
    return r  
  
print(sim_pearson(critics,'Lisa Rose','Gene Seymour'))  
0.396059017191

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Pearson（皮爾遜）相關系數皮爾遜相關系數理解皮爾遜相關系數及其MATLAB實現皮爾遜相關系數計算皮爾遜相關系數與余弦相似性的關系皮爾遜相關系數和余弦相似性的關系 Pearson（皮爾遜）相關系數與Spearman（斯皮爾曼）相關系數及其SPSS實現皮爾遜相關系數與斯皮爾曼等級相關系數皮爾遜相關系數的計算(python代碼版) Python計算皮爾遜 pearson相關系數