R 中Cluster包總結

本文轉載自查看原文 2013-04-19 03:55 8065 Cluster

1. 怎樣快速入門一個R包

這兩天查看了十幾個R包，也算是對看R入門一個R包有一些經驗了把～

所有的R包都附帶一個manual（有些R包還會有一個小manual，做簡介用，那就更好了）。把manual下下來，看看它的最前面有沒有一個對整個包general的介紹，如果有，那么一定要仔細閱讀，這個非常重要！它會把整個包的大致情況介紹：要解決什么問題，用什么算法解決的......另外，它還會介紹最重要的函數（其實一個R包經常用到的函數不會一般超過五個！）了解這些信息，便很容易上手了～

但是並不是所有的R包都會有這樣一個對本包的概述，對於這樣的情況，可以在R包的主頁上面找一個‘citation'的東西：這是與這個R包配套發表出來的論文。然后找到這篇論文后，去google拉，pubmed拉～～～搜一下把它下下來讀一讀就好了。當然，這個論文里面往往會有很多理論性特別強的大段公式，不過，看不懂跳過去就好，找到introduction，看看圖基本就了解了！

總之，這兩種方法配合起來使用入手一個R包還是蠻快的。

2. R 中的cluster包總結：

1）pvclust：這個包是為hierarchical clustering的檢測專門設計的，而且特別適合用在microarray的data上面，看到這篇論文的cited次數為300+就知道其受歡迎程度了。這個包解決的問題就是：在microarray data中，sample數量是固定的，我們根據這個sample來做一個clustering，然后得出有幾個cluster。可是，問題出來了：這些cluster都穩定么？你有沒有想過如果在加進去幾個sample的話你的cluster還會是這樣？還是你的cluster就散掉了重新組成別的樣子了。所以這個包提供了一個bootstrap resampling的方法來確定這個cluster是否穩定，並且在cluster上面表上相應的統計系數，還可以給robust的cluster套上紅色長方形做標記。

值得注意的一點是，它的主函數pvclust是針對列，而默認的hclust函數卻是針對行的，所以要想得到與hclust相同的結果一定要注意將matrix進行轉置！

library('pvclust')
library（car)
mydata = na.omit(mtcars)

#hierarchical clustering clustering
  # Ward Hierarchical Clustering
  d <- dist(t(mydata), method = "euclidean") # distance matrix
  fit <- hclust(d, method="ward") 
  plot(fit,  cex = .5) # normal hcluster dendrogram
  groups <- cutree(fit2, k=3) # cut tree into 3 clusters
  #draw dendogram with red borders around  the 3 clusters 
  rect.hclust(fit2, k=3, border="red")

#pv cluster
  # Ward Hierarchical Clustering with Bootstrapped p values
  library(pvclust)
  fit2 <- pvclust(mydata, method.hclust="ward",
                 method.dist="euclidean")
  plot(fit2, cex = .5) # display pv cluster dendogram
  # add rectangles around groups highly supported by the data
  pvrect(fit, alpha=.95) #alpha即為au值

正常hclust pvclust做圖

可以看到，兩個clust完全一樣，只是pvclust的圖有紅色和綠色的圖標，分別代表：

AU (Approximately Unbiased) p-value and BP (BootstrapProbability) value. AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value than BP value computed by normal bootstrap resampling

2）mclust

這是一個model based clustering的包，與kmeans等聚類不同，mclust是基於統計模型的聚類：即假設不同的cluster來自不同的分布，而整個data則是一個mixed model 的分布。mclust提供幾種模型，如multivariate normal distribution等等，根據幾種不同的模型分別擬合data，利用特定算法找出相應的模型以及其對應的部分data。

library(mclust)
  fit <- Mclust(mydata)
  plot(fit) # plot results

這里會出現數張圖，第一張是類似下圖的一張圖，每種顏色代表一個模型（具體什么還沒仔細看），而mclust的任務就是找出哪個模型得到最大的BIC值（如圖可見，VEV在四個group的時候BIC最大），即得到組數！第二張圖則是如下右所示，是根據cluster的情況對數據進行lattice散點圖：不同顏色代表不同的組，根據對角線上的參數確定橫縱坐標。

mclust plot mclust第二張圖

不過mclust並不提供bootstrap的重抽樣，所以並不好用bootstrap方法來判斷一個cluster的穩定性。當然，會有其他度量來衡量其穩定性的。

3）. flexclust

flexclust中主要函數有兩個，第一個是bootFlexclust。這里重點說一下得出來的盒形圖：其

橫坐標為k值，縱坐標為RAND index，用來衡量cluster的穩定性：RAND Index = 1則說明cluster穩定，RAND Index = 0則說明cluster比較隨機，通過這個盒形圖可以看出cluster的趨勢

library('flexclust')
x <- matrix(runif(400), ncol=2)
cl <- FALSE #This is multi-core processor index
## 50 bootstrap replicates for speed in example,
## use more for real applications
mydata = na.omit(mtcars)
bcl <- bootFlexclust(mydata, k=2:7, nboot=50, FUN=cclust, multicore=cl)

bcl  #包含分組情況、一些index
summary(bcl) #Rand index隨k值（2-7）的分布情況

plot(bcl)  #Rand Index隨k值變化的盒形圖：橫坐標為k值，縱坐標為RAND index，用來衡量cluster的穩定性：RAND Index = 1則說明cluster穩定，RAND Index = 0則說明cluster比較隨機
densityplot(bcl, from=0) #Rand Index隨k值變化的密度圖， 橫坐標為RAND Index，縱坐標為密度

#compare with hclust 
plot(hclust(dist(mydata)))
bcl
summary(bcl)

另外一個常用函數為kcca

說是話感覺kcca和kmeans，pam什么的應該差不多，不知道它強在哪里，可能是它的做圖功能把～

kcca1 = kcca(na.omit(mtcars), k=2)
kcca1

image(kcca1)
points(na.omit(mtcars))

OK，今天先寫這么多把～以后有用得着的話在寫其他包～～貼一個R cluster資源匯總：

http://cran.r-project.org/web/views/Cluster.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 R語言常用包分類總結 R語言：recommenderlab包的總結與應用案例 redis cluster 實踐總結一個R包—reticulate—在R中調用Python 用R包中heatmap畫熱圖 linux系統中安裝R包在R中查看已經安裝了的包 R語言caret包中createFolds函數【R】R基本函數總結 Python實現R包brainwaver中的compute.FDR函數