前言:
我根據自己的科研方向和實際工作,在利用R語言解決數據,特征和模型三方面的問題時,會搜集,閱讀,修改和遷移一些R代碼,利用【R語言】公眾號將其整理和歸總,分享給大家。一方面,希望這些R代碼能夠對大家解決實際問題有幫助或者啟示;另一方面,也希望大家嘗試從R代碼中學習和應用R語言。我只是R語言代碼的搬運工和傳播者,大家在使用這些R代碼的時候,有些什么新的啟示或者問題,請留言。依托【R語言】公眾號,我創建了R語言群,群友們每天都會就R語言的主題進行交流和分享。需要加入R語言群的朋友,可以掃碼加我的個人微信,請備注【姓名-入群】。我誠邀你加入群,大家相互學習和共同進步。
代碼:
最近在研究客群細分的問題,使用到了經典的聚類學習算法,K均值算法。
K均值算法的R語言代碼
##########################
#時間:2020-07-08
#########################
# 加載R包
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
# 數據准備
df <- USArrests
# 數據缺失值處理
df <- na.omit(df) # 刪除含有缺失值的樣本
# 數據標准化處理
df <- scale(df)
head(df)
# 基於距離度量的聚類學習
distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
# K均值算法
k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)
k2
fviz_cluster(k2, data = df)
df %>%
as_tibble() %>%
mutate(cluster = k2$cluster,
state = row.names(USArrests)) %>%
ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) +
geom_text()
# 不同的聚類數目對比分析
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
# 最佳的K數量確定
# 方法1 Elbow Method
set.seed(123)
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(df, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15
# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
set.seed(123)
fviz_nbclust(df, kmeans, method = "wss")
# 方法2 Average Silhouette Method
# function to compute average silhouette for k clusters
avg_sil <- function(k) {
km.res <- kmeans(df, centers = k, nstart = 25)
ss <- silhouette(km.res$cluster, dist(df))
mean(ss[, 3])
}
# Compute and plot wss for k = 2 to k = 15
k.values <- 2:15
# extract avg silhouette for 2-15 clusters
avg_sil_values <- map_dbl(k.values, avg_sil)
plot(k.values, avg_sil_values,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Average Silhouettes")
fviz_nbclust(df, kmeans, method = "silhouette")
# 方法3:Gap Statistic Method
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
# Print the result
print(gap_stat, method = "firstmax")
fviz_gap_stat(gap_stat)
# 選擇最佳K值后重新實施K均值算法
# Compute k-means clustering with k = 4
set.seed(123)
final <- kmeans(df, 4, nstart = 25)
print(final)
# 聚類學習的可視化效果
fviz_cluster(final, data = df)
# 聚類的中心點表示
USArrests %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
# 參考資料:
# https://uc-r.github.io/kmeans_clustering
最佳K=4后,重新執行K均值算法,可視化效果如下圖所示。
各個聚類的中心點坐標結果。
關於這段代碼有什么問題或者想法,請閱讀參考資料,或者添加我的微信,大家交流和討論。