這幾個概念不能混淆,估計大部分人都沒有完全搞懂這幾個概念。
看下這個,非常有用:Interpret the key results for Correlation
euclidean | maximum | manhattan | canberra | binary | minkowski
初級
先演示一下相關性:
a <- c(1,2,3,4) b <- c(2,4,6,8) c <- data.frame(x=a,y=b) plot(c) cor(t(c))
> cor(t(c)) [,1] [,2] [,3] [,4] [1,] 1 1 1 1 [2,] 1 1 1 1 [3,] 1 1 1 1 [4,] 1 1 1 1
初步結論:
1. 相關性是用來度量兩個變量之間的線性關系的;
2. 如果在不同的sample中,x隨着y的增大而增大,那么x和y就是正相關,反之則是負相關;
接下來求距離:
> dist(c, method = "euclidean") 1 2 3 2 2.236068 3 4.472136 2.236068 4 6.708204 4.472136 2.236068
> sqrt(2^2+1^2) [1] 2.236068
初步結論:
1. 距離是在特定的坐標體系中,兩點之間的距離求解;
2. 距離可以用來表征相似度,比如1和2就比1和4更相似;
3. 歐氏距離就是我們最常見的幾何距離,比較直觀;
那么什么時候求相關性,什么時候求相似度呢?
基因表達當然要求相關性了,共表達都是在求相關性,就是基因A和B會在不同樣本間同增同減,所以相關性是對變量而言的,暫時還沒聽說對樣品求相關性,沒有意義,總不能說在這些基因中,某些樣本的表達同增同減吧。
那么樣本最常見的應該是求相似度了,我想知道樣本A是和樣本B更相似,還是和樣本C更相似,在共同的基因坐標下求個距離就知道了。
進階
1. 不同的求相關性的方法有何差異?
2. 不同的距離計算的方法有何差異?
3. 相關性分析有哪些局限性?
簡單介紹一下pearson和spearman的區別
x=(1:100) y=exp(x) cor(x,y,method = "pearson") # 0.25 cor(x,y,method = "spearman") # 1 plot(x,y)
結論:
pearson判斷的是線性相關性,
而spearman還可以判斷非線性的。monotonic relationship,更專業的說是單調性。
參考:Correlation (Pearson, Kendall, Spearman)
outlier對相關性的影響
x = 1:100 y = 1:100 cor(x,y,method = "pearson") # 1 y[100] <- 1000 cor(x,y,method = "pearson") # 0.448793 cor(x,y,method = "spearman") # 1 y[99] <- 0 cor(x,y,method = "spearman") # 0.9417822
結論:
單個的outlier對pearson的影響非常大,但是對spearman的影響則比較小。
皮爾遜相關系數 其實是衡量 兩個變量線性相關程度大小的指標,但它的值的大小並不能完全地反映兩個變量的真實關系。
只有當兩個變量的標准差都不為零,相關系數才有意義。
Pearson, Kendall, Spearman三種相關性的差異
distance measure
euclidean | maximum | manhattan | canberra | binary | minkowski
k-NN 4: which distance function?
distance measure euclidean
euclidean:
Usual distance between the two vectors (2 norm aka L_2), sqrt(sum((x_i - y_i)^2)).
maximum:
Maximum distance between two components of x and y (supremum norm)
manhattan:
Absolute distance between the two vectors (1 norm aka L_1).
canberra:
sum(|x_i - y_i| / (|x_i| + |y_i|)). Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.
This is intended for non-negative values (e.g., counts), in which case the denominator can be written in various equivalent ways; Originally, R used x_i + y_i, then from 1998 to 2017, |x_i + y_i|, and then the correct |x_i| + |y_i|.
binary:
(aka asymmetric binary): The vectors are regarded as binary bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The distance is the proportion of bits in which only one is on amongst those in which at least one is on.
minkowski:
The p norm, the pth root of the sum of the pth powers of the differences of the components.
待續~