R語言-聚類與分類


一.聚類:

 一般步驟:

  1.選擇合適的變量

  2.縮放數據

  3.尋找異常點

  4.計算距離

  5.選擇聚類算法

  6.采用一種或多種聚類方法

  7.確定類的數目

  8.獲得最終聚類的解決方案

  9.結果可視化

  10.解讀類

  11.驗證結果

      1.層次聚類分析

  案例:采用flexclust的營養數據集作為參考

    1.基於5種營養標准的27類魚,禽,肉的相同點和不同點是什么

    2.是否有一種辦法把這些食物分成若干各類

    1.1計算距離

1 data(nutrient,package = 'flexclust')
2 head(nutrient,4)
3 d <- dist(nutrient)
4 as.matrix(d)[1:4,1:4]

    結論:觀測的距離越大,異質性越大

    1.2平均聯動聚類

1 row.names(nutrient) <- tolower(row.names(nutrient))
2 nutrient.scaled <- scale(nutrient)
3 d2 <- dist(nutrient.scaled)
4 fit.average <- hclust(d2,method = 'average')
5 plot(fit.average,hang=-1,cex=.8,main='Average Linkage Clustering')

    結論:只能提供食物營養成分的相似性和相異性

    1.3獲取聚類的個數

1 library('NbClust')
2 devAskNewPage(ask = T)
3 nc <- NbClust(nutrient.scaled, distance="euclidean", 
4               min.nc=2, max.nc=15, method="average")
5 table(nc$Best.n[1,])
6 barplot(table(nc$Best.n[1,]),
7         xlab = 'Number of Clusters',ylab = 'Number of Criteria',
8         main='Number of Clusters chosen by 26 criteria')

    結論:分別有4個投票數最多的聚類(2,3,5,15),從中選擇一個更適合的聚類數

    1.4獲取聚類的最終方案

1 # 聚類分配情況
2 clusters <- cutree(fit.average,k=5) 
3 table(clusters)
4 # 描述聚類
5 aggregate(nutrient,by=list(clusters=clusters),median)
6 aggregate(as.data.frame(nutrient.scaled),by=list(clusters=clusters),median)
7 plot(fit.average,hang=-1,cex=.8,main='Average Linkpage Clustering\n 5 Cluster Solution')
8 rect.hclust(fit.average,k=5)

 

     結論:

      1.sardines canned形成自己的類,因為鈣含量比較高

      2.beef heart也是單獨的類,富含蛋白質和鐵

      3.beef roast到pork simmered含有高能量和脂肪

      4.clams raw到clams canned含有較高的維生素

      5.mackerel canned到bluefish baked含有較低的鐵

  2.划分聚類分析

  案例:采用rattle.data中的wine數據集進行分析

    1.葡萄酒數據的K均值聚類

# 使用卵石圖確定類的數量
wssplot <- function(data,nc=15,seed=1234){
  wss <- (nrow(data)-1) * sum(apply(data,2,var))
  for (i in 2:nc) {
    set.seed(seed)
    wss[i] <- sum(kmeans(data,centers = i)$withinss)
  }
  plot(1:nc,wss,type = 'b',xlab = 'Number of Clusters',ylab = 'Within groups sum of squares')
}
 1 data(wine,package = 'rattle.data')
 2 
 3 head(wine)
 4 df <- scale(wine[-1])
 5 wssplot(df)
 6 library(NbClust)
 7 set.seed(1234)
 8 # 確定聚類的數量
 9 nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans")
10 table(nc$Best.n[1,])
11 barplot(table(nc$Best.n[1,]), 
12         xlab="Numer of Clusters", ylab="Number of Criteria",
13         main="Number of Clusters Chosen by 26 Criteria") 
14 set.seed(1234)
15 # 進行k值聚類分析
16 fit.km <- kmeans(df, 3, nstart=25) 
17 fit.km$size
18 fit.km$centers                                               
19 aggregate(wine[-1], by=list(cluster=fit.km$cluster), mean)

    結論:分3個聚類對數據有很好的擬合

     

# 使用蘭德系數來量化類型變量和類之間的協議
ct.km <- table(wine$Type,fit.km$cluster)
library(flexclust)
randIndex(ct.km)

  結論:擬合結果優秀

  圍繞中心點的分類:因為K均值聚類方法是基於均值的,所以對異常值較為敏感,更為穩健的方法是圍繞中心點的划分,

           k均值聚類一般使用歐幾里得距離,而PAM可以使用任意的距離來計算

1 library(cluster)
2 set.seed(1234)
3 fit.pam <- pam(wine[-1],k=3,stand = T)
4 fit.pam$medoids
5 clusplot(fit.pam,main = 'Bivariate Cluster Plot')
6 ct.pam <- table(wine$Type,fit.pam$clustering)
7 randIndex(ct.pam)

  

  結論:調整后的蘭德指數從之前的0.9下降到0.7

  3.避免不存在的聚類

  3.1查看數據集

1 library(fMultivar)
2 set.seed(1234)
3 df <- rnorm2d(1000,rho=.5)
4 df <- as.data.frame(df)
5 plot(df,main='Bivariate Normal Distribution with rho=0.5')

  結論:沒有存在的類

  3.2計算聚類的個數

library(NbClust)
nc <- NbClust(df,min.nc = 2,max.nc = 15,method = 'kmeans')
dev.new()
barplot(table(nc$Best.n[1,]),xlab="Numer of Clusters", ylab="Number of Criteria",
        main="Number of Clusters Chosen by 26 Criteria")

  結論:一共可分為3各類

  3.3聚類圖像

1 library(ggplot2)
2 fit2 <- pam(df,k=2)
3 df$clustering <- factor(fit2$clustering)
4 ggplot(data = df,aes(x=V1,y=V2,color=clustering,shape=clustering))+
5   geom_point()+
6   ggtitle('Clustering of Bivariate Normal Data')

 

    結論:對於二元數據的PAM聚類分析,提取出2類

  3.4分析聚類

plot(nc$All.index[,4],type='o',ylab='ccc',xlab='Number of Clusters',col='blue')

    結論:二元正態數據的CCC圖,表明沒有類存在,當CCC為負數並且對於兩類或者是更多的類的遞減

二.分類

  使用機器學習來預測二分類結果

  案例分析:使用乳腺癌數據作為測試,訓練集建立邏輯回歸,決策時,條件推斷樹,隨機森林,支持向量機等分類模型,測試集用於評估各個模型的有效性

  1.准備數據

loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
ds  <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data"
url <- paste(loc, ds, sep="")

breast <- read.table(url, sep=",", header=FALSE, na.strings="?")
names(breast) <- c("ID", "clumpThickness", "sizeUniformity",
                   "shapeUniformity", "maginalAdhesion", 
                   "singleEpithelialCellSize", "bareNuclei", 
                   "blandChromatin", "normalNucleoli", "mitosis", "class")
df <- breast[-1]
df$class <-factor(df$class,levels = c(2,4),labels = c('begign','malignant'))
set.seed(1234)
train <- sample(nrow(df),0.7*nrow(df))
df.train <- df[train,]
df.validate <- df[-train,]
table(df.train$class)
table(df.validate$class)

  2.邏輯回歸

# 擬合邏輯回歸
fit.logit <- glm(class~.,data=df.train,family = binomial())
prob <- predict(fit.logit,df.validate,type='response')
# 對訓練集外的樣本進行分類
logit.pred <- factor(prob>.5,levels = c(F,T),labels = c('benign','malignant'))
# 評估預測的准確性
logit.pref <- table(df.validate$class,logit.pred,dnn = c('Actual','Predicted'))
logit.pref

  結論:正確分類的模型是97%

  3.決策樹

library(rpart)
set.seed(1234)
# 生成樹
dtree <- rpart(class~.,data = df.train,method = 'class',parms = list(split='information'))
plotcp(dtree)
# 剪枝
dtree.pruned <- prune(dtree,cp=.0125)
library(rpart.plot)
prp(dtree.pruned,type = 2,extra = 104,fallen.leaves = T,main='decision tree')
# 對訓練集外的樣本單元分類
dtree.pred <- predict(dtree.pruned,df.validate,type='class')
dtree.pref <- table(df.validate$class,dtree.pred,dnn = c('Actual','Predict'))
dtree.pref

 

   結論:驗證的准確率96%

  4.條件推斷樹

library(party)
library(partykit)
fit.tree <- ctree(class~.,data=df.train)
plot(fit.tree,main='Condition Inference Tree')
ctree.pred <- predict(fit.tree,df.validate,type='response')
ctree.pref <- table(df.validate$class,ctree.pred,dnn = c('Actual','Predicted'))
ctree.pref

  結論:驗證的准確率97%

  5.隨機森林

library(randomForest)
set.seed(1234)
# 生成森林
fit.forest <- randomForest(class~.,data=df.train,na.action=na.roughfix,importance=T)
importance(fit.forest,type=2)
forest.pred <- predict(fit.forest,df.validate)
# 對訓練集外的樣本點分類
forest.pref <- table(df.validate$class,forest.pred,dnn = c('Actual','Predicted'))
forest.pref

  結論:驗證准確率在98%

  6.支持向量機

library(e1071)
set.seed(1234)
fit.svm <- svm(class~.,data=df.train)
svm.pred <- predict(fit.svm,na.omit(df.validate))
svm.pref <- table(na.omit(df.validate)$class,svm.pred,dnn = c('Actual','Predicted'))
svm.pref

  結論:驗證准確率在96%

  7.帶有RBF內核的支持向量機

set.seed(1234)
# 通過調整gamma和c來擬合模型
tuned <- tune.svm(class~.,data=df.train,gamma = 10^(-6:1),cost = 10^(-10:10))
tuned
fit.svm <- svm(class~.,data=df.train,gamma=.01,cost=1)
svm.pred <- predict(fit.svm,na.omit(df.validate))
svm.pref <- table(na.omit(df.validate)$class,svm.pred,dnn = c('Actual','Predicted'))
svm.pref

  結論驗證的成功率有97%

  8.編寫函數選擇預測效果最好的解

performance <- function(table,n=2){
  if(!all(dim(table) == c(2,2))){
    stop('Must be a 2 * 2 table')
  }
  tn = table[1,1]
  fp = table[1,2]
  fn = table[2,1]
  tp = table[2,2]
  sensitivity = tp/(tp+fn)
  specificity = tn/(tn+fp)
  ppp = tp/(tp+fp)
  npp = tn/(tn+fn)
  hitrate = (tp+tn)/(tp+tn+fn+fp)
  result <- paste("Sensitivity = ", round(sensitivity, n) ,
                  "\nSpecificity = ", round(specificity, n),
                  "\nPositive Predictive Value = ", round(ppp, n),
                  "\nNegative Predictive Value = ", round(npp, n),
                  "\nAccuracy = ", round(hitrate, n), "\n", sep="")
  cat(result)
}
performance(logit.pref)
performance(dtree.pref)
performance(ctree.pref)
performance(forest.pref)
performance(svm.pref)

  結論:從以上的分類器中,本案例隨機森林的擬合度最優

三.使用rattle進行數據挖掘

  案例:預測糖尿病

library(rattle)
rattle()

  結論:設定好這些變量點擊執行

  選擇model選項卡,然后選擇條件推斷樹作為預測模型,點擊Draw生成圖片

  通過Evalute選項卡來評估模型

  結論:只有35%的病人被成功鑒別,我們可以試試隨機森林和支持向量機的匹配度是否更高

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM