R語言中不同類型的聚類方法比較

本文轉載自查看原文 2019-09-09 20:45 1432

原文鏈接：http://tecdat.cn/?p=6454

聚類方法用於識別從營銷，生物醫學和地理空間等領域收集的多變量數據集中的相似對象。它們是不同類型的聚類方法，包括：

划分方法
分層聚類
模糊聚類
基於密度的聚類
基於模型的聚類

數據准備

演示數據集：名為USArrest的內置R數據集
刪除丟失的數據
縮放變量以使它們具有可比性

# Load  and prepare the data

my_data <- USArrests %>%
  na.omit() %>%          # Remove missing values (NA)
  scale()                # Scale variables

# View the firt 3 rows
head(my_data, n = 3)

##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288

距離

get_dist()：用於計算數據矩陣的行之間的距離矩陣。與標准dist()功能相比，它支持基於相關的距離測量，包括“皮爾遜”，“肯德爾”和“斯皮爾曼”方法。
fviz_dist()：用於可視化距離矩陣

res.dist <- get_dist(U
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

划分聚類

、算法是將數據集細分為一組k個組的聚類技術，其中k是分析人員預先指定的組的數量。

k-means聚類的替代方案是K-medoids聚類或PAM（Partitioning Around Medoids，Kaufman和Rousseeuw，1990），與k-means相比，它對異常值不太敏感。

以下R代碼顯示如何確定最佳簇數以及如何在R中計算k-means和PAM聚類。

確定最佳簇數


fviz_nbclust(my_data, kmeans, method = "gap_stat")

計算並可視化k均值聚類

set.seed(123)
 # Visualize

fviz_cluster(km.res, data = my_data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())

# Compute PAM

pam.res <- pam(my_data, 3)
# Visualize
fviz_cluster(pam.res)

分層聚類

分層聚類是一種分區聚類的替代方法，用於識別數據集中的組。它不需要預先指定要生成的簇的數量。

# Compute hierarchical clustering
res.hc <- USArrests %>%
  scale() %>%                    # Scale the data
   hclust(method = "ward.D2")     # Compute hierachical clustering

# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
            color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

評估聚類傾向

為了評估聚類傾向，可以使用Hopkins的統計量和視覺方法。

Hopkins統計：如果Hopkins統計量的值接近1（遠高於0.5），那么我們可以得出結論，數據集是顯着可聚類的。
視覺方法：視覺方法通過計算有序相異度圖像中沿對角線的方形黑暗（或彩色）塊的數量來檢測聚類趨勢。

R代碼：

 
iris[, -5] %>%    # Remove column 5 (Species)
  scale() %>%     # Scale variables
  get_clust_tendency(n = 50, gradient = gradient.color)

## $hopkins_stat
## [1] 0.8
## 
## $plot

確定最佳簇數

set.seed(123)

# Compute

res.nbclust <- USArrests %>%
  scale() %>%
   (distance = "euclidean",
          min.nc = 2, max.nc = 10, 
          method = "complete", index ="all")

# Visualize

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 4 proposed  3 as the best number of clusters
## * 6 proposed  4 as the best number of clusters
## * 2 proposed  5 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

群集驗證統計信息

在下面的R代碼中，我們將計算和評估層次聚類方法的結果。

計算和可視化層次聚類：

 # Enhanced hierarchical clustering, cut in 3 groups
res.hc <- iris[, -5] %>%
  scale() %>%
   ("hclust", k = 3, graph = FALSE)

# Visualize with factoextra
 (res.hc, palette = "jco",
          rect = TRUE, show_labels = FALSE) 

檢查輪廓圖：

 (res.hc)

##   cluster size ave.sil.width
## 1       1   49          0.63
## 2       2   30          0.44
## 3       3   71          0.32

哪些樣品有負面輪廓？他們更接近什么集群？

# Silhouette width of observations
sil <- res.hc$silinfo$widths[, 1:3]

# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##     cluster neighbor sil_width
## 84        3        2   -0.0127
## 122       3        2   -0.0179
## 62        3        2   -0.0476
## 135       3        2   -0.0530
## 73        3        2   -0.1009
## 74        3        2   -0.1476
## 114       3        2   -0.1611
## 72        3        2   -0.2304

高級聚類方法

混合聚類方法

分層K均值聚類：一種改進k均值結果的混合方法
HCPC：主成分上的分層聚類

模糊聚類

模糊聚類也稱為軟聚類方法。標准聚類方法（K-means，PAM），其中每個觀察僅屬於一個聚類。這稱為硬聚類。

基於模型的聚類

在基於模型的聚類中，數據被視為來自兩個或多個聚類的混合的分布。它找到了最適合模型的數據並估計了簇的數量。

DBSCAN：基於密度的聚類

DBSCAN是Ester等人引入的聚類方法。（1996）。它可以從包含噪聲和異常值的數據中找出不同形狀和大小的簇（Ester等，1996）。基於密度的聚類方法背后的基本思想源於人類直觀的聚類方法。

R鏈中的DBSCAN的描述和實現

如果您有任何疑問，請在下面發表評論。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 R語言代寫k-Shape時間序列聚類方法對股票價格時間序列聚類 R語言-聚類與分類 R語言中的畫圖 R語言中 %in%用法【R語言入門】R語言中的變量與基本數據類型 python實現一個層次聚類方法 c語言數組可以存儲不同類型數據 R語言中diff函數 R語言中的生存分析 R語言中aggregate函數