參考文獻:
1.python 皮爾森相關系數 https://www.cnblogs.com/lxnz/p/7098954.html
2.統計學之三大相關性系數(pearson、spearman、kendall) http://blog.sina.com.cn/s/blog_69e75efd0102wmd2.html
皮爾森系數
重點關注第一個等號后面的公式,最后面的是推導計算,暫時不用管它們。看到沒有,兩個變量(X, Y)的皮爾森相關性系數(ρX,Y)等於它們之間的協方差cov(X,Y)除以它們各自標准差的乘積(σX, σY)。
公式的分母是變量的標准差,這就意味着計算皮爾森相關性系數時,變量的標准差不能為0(分母不能為0),也就是說你的兩個變量中任何一個的值不能都是相同的。如果沒有變化,用皮爾森相關系數是沒辦法算出這個變量與另一個變量之間是不是有相關性的。
皮爾森相關系數(Pearson correlation coefficient)也稱皮爾森積矩相關系數(Pearson product-moment correlation coefficient) ,是一種線性相關系數。皮爾森相關系數是用來反映兩個變量線性相關程度的統計量。相關系數用r表示,其中n為樣本量,分別為兩個變量的觀測值和均值。r描述的是兩個變量間線性相關強弱的程度。r的絕對值越大表明相關性越強。
簡單的相關系數的分類
0.8-1.0 極強相關
0.6-0.8 強相關
0.4-0.6 中等程度相關
0.2-0.4 弱相關
0.0-0.2 極弱相關或無相關
r描述的是兩個變量間線性相關強弱的程度。r的取值在-1與+1之間,若r>0,表明兩個變量是正相關,即一個變量的值越大,另一個變量的值也會越大;若r<0,表明兩個變量是負相關,即一個變量的值越大另一個變量的值反而會越小。r 的絕對值越大表明相關性越強,要注意的是這里並不存在因果關系。
spearman correlation coefficient(斯皮爾曼相關性系數)
斯皮爾曼相關性系數,通常也叫斯皮爾曼秩相關系數。“秩”,可以理解成就是一種順序或者排序,那么它就是根據原始數據的排序位置進行求解,這種表征形式就沒有了求皮爾森相關性系數時那些限制。下面來看一下它的計算公式:
計算過程就是:首先對兩個變量(X, Y)的數據進行排序,然后記下排序以后的位置(X’, Y’),(X’, Y’)的值就稱為秩次,秩次的差值就是上面公式中的di,n就是變量中數據的個數,最后帶入公式就可求解結果
帶入公式,求得斯皮爾曼相關性系數:ρs= 1-6*(1+1+1+9)/6*35=0.657
而且,即便在變量值沒有變化的情況下,也不會出現像皮爾森系數那樣分母為0而無法計算的情況。另外,即使出現異常值,由於異常值的秩次通常不會有明顯的變化(比如過大或者過小,那要么排第一,要么排最后),所以對斯皮爾曼相關性系數的影響也非常小!
由於斯皮爾曼相關性系數沒有那些數據條件要求,適用的范圍就廣多了。
kendall correlation coefficient(肯德爾相關性系數)
肯德爾相關性系數,又稱肯德爾秩相關系數,它也是一種秩相關系數,不過它所計算的對象是分類變量。
分類變量可以理解成有類別的變量,可以分為
無序的,比如性別(男、女)、血型(A、B、O、AB);
有序的,比如肥胖等級(重度肥胖,中度肥胖、輕度肥胖、不肥胖)。
通常需要求相關性系數的都是有序分類變量。
Nc表示主客觀評價值中一致的值的個數,Nd則表示了主觀評估值和客觀評估值不一樣的個數
舉個例子。比如評委對選手的評分(優、中、差等),我們想看兩個(或者多個)評委對幾位選手的評價標准是否一致;或者醫院的尿糖化驗報告,想檢驗各個醫院對尿糖的化驗結果是否一致,這時候就可以使用肯德爾相關性系數進行衡量。
pandas代碼實現
pandas.DataFrame.corr()
DataFrame.corr(method='pearson', min_periods=1)[source]
Compute pairwise correlation of columns, excluding NA/null values
Parameters:
method : {‘pearson’, ‘kendall’, ‘spearman’}
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
Returns:y : DataFrame
import pandas as pd df = pd.DataFrame({'A':[5,91,3],'B':[90,15,66],'C':[93,27,3]}) print(df.corr()) print(df.corr('spearman')) print(df.corr('kendall')) df2 = pd.DataFrame({'A':[7,93,5],'B':[88,13,64],'C':[93,27,3]}) print(df2.corr()) print(df2.corr('spearman')) print(df2.corr('kendall'))
numpy代碼實現
numpy.corrcoef(x,y = None,rowvar = True,bias = <class'numpy._globals._NoValue'>,ddof = <class'numpy._globals._NoValue'> )
返回Pearson乘積矩相關系數。
cov有關更多詳細信息,請參閱文檔。相關系數矩陣R和協方差矩陣C之間的關系為
R的值在-1和1之間(含)。
參數:
x:array_like
包含多個變量和觀察值的1維或2維數組。x的每一行代表一個變量,每一列都是對所有這些變量的單獨觀察。另請參閱下面的rowvar。
y:array_like,可選
一組額外的變量和觀察。y的形狀與x相同。
rowvar:布爾,可選
如果rowvar為True(默認),則每行表示一個變量,並在列中有觀察值。否則,該關系將被轉置:每列表示一個變量,而行包含觀察值。
bias : _NoValue, optional Has no effect, do not use. Deprecated since version 1.10.0.
ddof : _NoValue, optional Has no effect, do not use. Deprecated since version 1.10.0.
返回:
R:ndarray 變量的相關系數矩陣。
import numpy as np vc=[1,2,39,0,8] vb=[1,2,38,0,8] print(np.mean(np.multiply((vc-np.mean(vc)),(vb-np.mean(vb))))/(np.std(vb)*np.std(vc))) #corrcoef得到相關系數矩陣(向量的相似程度) print(np.corrcoef(vc,vb))
Spearman’s Rank Correlation
Spearman’s rank correlation is named for Charles Spearman.
It may also be called Spearman’s correlation coefficient and is denoted by the lowercase greek letter rho (p). As such, it may be referred to as Spearman’s rho.
This statistical method quantifies the degree to which ranked variables are associated by a monotonic function, meaning an increasing or decreasing relationship. As a statistical hypothesis test, the method assumes that the samples are uncorrelated (fail to reject H0).
The Spearman rank-order correlation is a statistical procedure that is designed to measure the relationship between two variables on an ordinal scale of measurement.
— Page 124, Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach, 2009.
The intuition for the Spearman’s rank correlation is that it calculates a Pearson’s correlation (e.g. a parametric measure of correlation) using the rank values instead of the real values. Where the Pearson’s correlation is the calculation of the covariance (or expected difference of observations from the mean) between the two variables normalized by the variance or spread of both variables.
Spearman’s rank correlation can be calculated in Python using the spearmanr() SciPy function.
The function takes two real-valued samples as arguments and returns both the correlation coefficient in the range between -1 and 1 and the p-value for interpreting the significance of the coefficient.
1
2
|
# calculate spearman's correlation
coef, p = spearmanr(data1, data2)
|
We can demonstrate the Spearman’s rank correlation on the test dataset. We know that there is a strong association between the variables in the dataset and we would expect the Spearman’s test to find this association.
The complete example is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# calculate the spearman's correlation between two variables
from numpy.random import rand
from numpy.random import seed
from scipy.stats import spearmanr
# seed random number generator
seed(1)
# prepare data
data1 = rand(1000) * 20
data2 = data1 + (rand(1000) * 10)
# calculate spearman's correlation
coef, p = spearmanr(data1, data2)
print('Spearmans correlation coefficient: %.3f' % coef)
# interpret the significance
alpha = 0.05
if p > alpha:
print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)
else:
print('Samples are correlated (reject H0) p=%.3f' % p)
|
Running the example calculates the Spearman’s correlation coefficient between the two variables in the test dataset.
The statistical test reports a strong positive correlation with a value of 0.9. The p-value is close to zero, which means that the likelihood of observing the data given that the samples are uncorrelated is very unlikely (e.g. 95% confidence) and that we can reject the null hypothesis that the samples are uncorrelated.
1
2
|
Spearmans correlation coefficient: 0.900
Samples are correlated (reject H0) p=0.000
|
Kendall’s Rank Correlation
Kendall’s rank correlation is named for Maurice Kendall.
It is also called Kendall’s correlation coefficient, and the coefficient is often referred to by the lowercase Greek letter tau (t). In turn, the test may be called Kendall’s tau.
The intuition for the test is that it calculates a normalized score for the number of matching or concordant rankings between the two samples. As such, the test is also referred to as Kendall’s concordance test.
The Kendall’s rank correlation coefficient can be calculated in Python using the kendalltau() SciPy function. The test takes the two data samples as arguments and returns the correlation coefficient and the p-value. As a statistical hypothesis test, the method assumes (H0) that there is no association between the two samples.
1
2
|
# calculate kendall's correlation
coef, p = kendalltau(data1, data2)
|
We can demonstrate the calculation on the test dataset, where we do expect a significant positive association to be reported.
The complete example is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# calculate the kendall's correlation between two variables
from numpy.random import rand
from numpy.random import seed
from scipy.stats import kendalltau
# seed random number generator
seed(1)
# prepare data
data1 = rand(1000) * 20
data2 = data1 + (rand(1000) * 10)
# calculate kendall's correlation
coef, p = kendalltau(data1, data2)
print('Kendall correlation coefficient: %.3f' % coef)
# interpret the significance
alpha = 0.05
if p > alpha:
print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)
else:
print('Samples are correlated (reject H0) p=%.3f' % p)
|
Running the example calculates the Kendall’s correlation coefficient as 0.7, which is highly correlated.
The p-value is close to zero (and printed as zero), as with the Spearman’s test, meaning that we can confidently reject the null hypothesis that the samples are uncorrelated.
1
2
|
Kendall correlation coefficient: 0.709
Samples are correlated (reject H0) p=0.000
|
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
- List three examples where calculating a nonparametric correlation coefficient might be useful during a machine learning project.
- Update each example to calculate the correlation between uncorrelated data samples drawn from a non-Gaussian distribution.
- Load a standard machine learning dataset and calculate the pairwise nonparametric correlation between all variables.
If you explore any of these extensions, I’d love to know.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
- Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach, 2009.
- Applied Nonparametric Statistical Methods, Fourth Edition, 2007.
- Rank Correlation Methods, 1990.
API
Articles
- Nonparametric statistics on Wikipedia
- Rank correlation on Wikipedia
- Spearman’s rank correlation coefficient on Wikipedia
- Kendall rank correlation coefficient on Wikipedia
- Goodman and Kruskal’s gamma on Wikipedia
- Somers’ D on Wikipedia
Summary
In this tutorial, you discovered rank correlation methods for quantifying the association between variables with a non-Gaussian distribution.
Specifically, you learned:
- How rank correlation methods work and the methods are that are available.
- How to calculate and interpret the Spearman’s rank correlation coefficient in Python.
- How to calculate and interpret the Kendall’s rank correlation coefficient in Python.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Spearmans Rank Correlation
Preliminaries
import numpy as np import pandas as pd import scipy.stats
Create Data
# Create two lists of random values x = [1,2,3,4,5,6,7,8,9] y = [2,1,2,4.5,7,6.5,6,9,9.5]
Calculate Spearman’s Rank Correlation
Spearman’s rank correlation is the Pearson’s correlation coefficient of the ranked version of the variables.
# Create a function that takes in x's and y's def spearmans_rank_correlation(xs, ys): # Calculate the rank of x's xranks = pd.Series(xs).rank() # Caclulate the ranking of the y's yranks = pd.Series(ys).rank() # Calculate Pearson's correlation coefficient on the ranked versions of the data return scipy.stats.pearsonr(xranks, yranks) # Run the function spearmans_rank_correlation(x, y)[0] 0.90377360145618091
Calculate Spearman’s Correlation Using SciPy
# Just to check our results, here it Spearman's using Scipy scipy.stats.spearmanr(x, y)[0] 0.90377360145618102