一.聚類:
一般步驟:
1.選擇合適的變量
2.縮放數據
3.尋找異常點
4.計算距離
5.選擇聚類算法
6.采用一種或多種聚類方法
7.確定類的數目
8.獲得最終聚類的解決方案
9.結果可視化
10.解讀類
11.驗證結果
1.層次聚類分析
案例:采用flexclust的營養數據集作為參考
1.基於5種營養標准的27類魚,禽,肉的相同點和不同點是什么
2.是否有一種辦法把這些食物分成若干各類
1.1計算距離
1 data(nutrient,package = 'flexclust') 2 head(nutrient,4) 3 d <- dist(nutrient) 4 as.matrix(d)[1:4,1:4]
結論:觀測的距離越大,異質性越大
1.2平均聯動聚類
1 row.names(nutrient) <- tolower(row.names(nutrient)) 2 nutrient.scaled <- scale(nutrient) 3 d2 <- dist(nutrient.scaled) 4 fit.average <- hclust(d2,method = 'average') 5 plot(fit.average,hang=-1,cex=.8,main='Average Linkage Clustering')
結論:只能提供食物營養成分的相似性和相異性
1.3獲取聚類的個數
1 library('NbClust') 2 devAskNewPage(ask = T) 3 nc <- NbClust(nutrient.scaled, distance="euclidean", 4 min.nc=2, max.nc=15, method="average") 5 table(nc$Best.n[1,]) 6 barplot(table(nc$Best.n[1,]), 7 xlab = 'Number of Clusters',ylab = 'Number of Criteria', 8 main='Number of Clusters chosen by 26 criteria')
結論:分別有4個投票數最多的聚類(2,3,5,15),從中選擇一個更適合的聚類數
1.4獲取聚類的最終方案
1 # 聚類分配情況 2 clusters <- cutree(fit.average,k=5) 3 table(clusters) 4 # 描述聚類 5 aggregate(nutrient,by=list(clusters=clusters),median) 6 aggregate(as.data.frame(nutrient.scaled),by=list(clusters=clusters),median) 7 plot(fit.average,hang=-1,cex=.8,main='Average Linkpage Clustering\n 5 Cluster Solution') 8 rect.hclust(fit.average,k=5)
結論:
1.sardines canned形成自己的類,因為鈣含量比較高
2.beef heart也是單獨的類,富含蛋白質和鐵
3.beef roast到pork simmered含有高能量和脂肪
4.clams raw到clams canned含有較高的維生素
5.mackerel canned到bluefish baked含有較低的鐵
2.划分聚類分析
案例:采用rattle.data中的wine數據集進行分析
1.葡萄酒數據的K均值聚類
# 使用卵石圖確定類的數量 wssplot <- function(data,nc=15,seed=1234){ wss <- (nrow(data)-1) * sum(apply(data,2,var)) for (i in 2:nc) { set.seed(seed) wss[i] <- sum(kmeans(data,centers = i)$withinss) } plot(1:nc,wss,type = 'b',xlab = 'Number of Clusters',ylab = 'Within groups sum of squares') }
1 data(wine,package = 'rattle.data') 2 3 head(wine) 4 df <- scale(wine[-1]) 5 wssplot(df) 6 library(NbClust) 7 set.seed(1234) 8 # 確定聚類的數量 9 nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans") 10 table(nc$Best.n[1,]) 11 barplot(table(nc$Best.n[1,]), 12 xlab="Numer of Clusters", ylab="Number of Criteria", 13 main="Number of Clusters Chosen by 26 Criteria") 14 set.seed(1234) 15 # 進行k值聚類分析 16 fit.km <- kmeans(df, 3, nstart=25) 17 fit.km$size 18 fit.km$centers 19 aggregate(wine[-1], by=list(cluster=fit.km$cluster), mean)
結論:分3個聚類對數據有很好的擬合
# 使用蘭德系數來量化類型變量和類之間的協議 ct.km <- table(wine$Type,fit.km$cluster) library(flexclust) randIndex(ct.km)
結論:擬合結果優秀
圍繞中心點的分類:因為K均值聚類方法是基於均值的,所以對異常值較為敏感,更為穩健的方法是圍繞中心點的划分,
k均值聚類一般使用歐幾里得距離,而PAM可以使用任意的距離來計算
1 library(cluster) 2 set.seed(1234) 3 fit.pam <- pam(wine[-1],k=3,stand = T) 4 fit.pam$medoids 5 clusplot(fit.pam,main = 'Bivariate Cluster Plot') 6 ct.pam <- table(wine$Type,fit.pam$clustering) 7 randIndex(ct.pam)
結論:調整后的蘭德指數從之前的0.9下降到0.7
3.避免不存在的聚類
3.1查看數據集
1 library(fMultivar) 2 set.seed(1234) 3 df <- rnorm2d(1000,rho=.5) 4 df <- as.data.frame(df) 5 plot(df,main='Bivariate Normal Distribution with rho=0.5')
結論:沒有存在的類
3.2計算聚類的個數
library(NbClust) nc <- NbClust(df,min.nc = 2,max.nc = 15,method = 'kmeans') dev.new() barplot(table(nc$Best.n[1,]),xlab="Numer of Clusters", ylab="Number of Criteria", main="Number of Clusters Chosen by 26 Criteria")
結論:一共可分為3各類
3.3聚類圖像
1 library(ggplot2) 2 fit2 <- pam(df,k=2) 3 df$clustering <- factor(fit2$clustering) 4 ggplot(data = df,aes(x=V1,y=V2,color=clustering,shape=clustering))+ 5 geom_point()+ 6 ggtitle('Clustering of Bivariate Normal Data')
結論:對於二元數據的PAM聚類分析,提取出2類
3.4分析聚類
plot(nc$All.index[,4],type='o',ylab='ccc',xlab='Number of Clusters',col='blue')
結論:二元正態數據的CCC圖,表明沒有類存在,當CCC為負數並且對於兩類或者是更多的類的遞減
二.分類
使用機器學習來預測二分類結果
案例分析:使用乳腺癌數據作為測試,訓練集建立邏輯回歸,決策時,條件推斷樹,隨機森林,支持向量機等分類模型,測試集用於評估各個模型的有效性
1.准備數據
loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data" url <- paste(loc, ds, sep="") breast <- read.table(url, sep=",", header=FALSE, na.strings="?") names(breast) <- c("ID", "clumpThickness", "sizeUniformity", "shapeUniformity", "maginalAdhesion", "singleEpithelialCellSize", "bareNuclei", "blandChromatin", "normalNucleoli", "mitosis", "class") df <- breast[-1] df$class <-factor(df$class,levels = c(2,4),labels = c('begign','malignant')) set.seed(1234) train <- sample(nrow(df),0.7*nrow(df)) df.train <- df[train,] df.validate <- df[-train,] table(df.train$class) table(df.validate$class)
2.邏輯回歸
# 擬合邏輯回歸 fit.logit <- glm(class~.,data=df.train,family = binomial()) prob <- predict(fit.logit,df.validate,type='response') # 對訓練集外的樣本進行分類 logit.pred <- factor(prob>.5,levels = c(F,T),labels = c('benign','malignant')) # 評估預測的准確性 logit.pref <- table(df.validate$class,logit.pred,dnn = c('Actual','Predicted')) logit.pref
結論:正確分類的模型是97%
3.決策樹
library(rpart) set.seed(1234) # 生成樹 dtree <- rpart(class~.,data = df.train,method = 'class',parms = list(split='information')) plotcp(dtree) # 剪枝 dtree.pruned <- prune(dtree,cp=.0125) library(rpart.plot) prp(dtree.pruned,type = 2,extra = 104,fallen.leaves = T,main='decision tree') # 對訓練集外的樣本單元分類 dtree.pred <- predict(dtree.pruned,df.validate,type='class') dtree.pref <- table(df.validate$class,dtree.pred,dnn = c('Actual','Predict')) dtree.pref
結論:驗證的准確率96%
4.條件推斷樹
library(party) library(partykit) fit.tree <- ctree(class~.,data=df.train) plot(fit.tree,main='Condition Inference Tree') ctree.pred <- predict(fit.tree,df.validate,type='response') ctree.pref <- table(df.validate$class,ctree.pred,dnn = c('Actual','Predicted')) ctree.pref
結論:驗證的准確率97%
5.隨機森林
library(randomForest) set.seed(1234) # 生成森林 fit.forest <- randomForest(class~.,data=df.train,na.action=na.roughfix,importance=T) importance(fit.forest,type=2) forest.pred <- predict(fit.forest,df.validate) # 對訓練集外的樣本點分類 forest.pref <- table(df.validate$class,forest.pred,dnn = c('Actual','Predicted')) forest.pref
結論:驗證准確率在98%
6.支持向量機
library(e1071) set.seed(1234) fit.svm <- svm(class~.,data=df.train) svm.pred <- predict(fit.svm,na.omit(df.validate)) svm.pref <- table(na.omit(df.validate)$class,svm.pred,dnn = c('Actual','Predicted')) svm.pref
結論:驗證准確率在96%
7.帶有RBF內核的支持向量機
set.seed(1234) # 通過調整gamma和c來擬合模型 tuned <- tune.svm(class~.,data=df.train,gamma = 10^(-6:1),cost = 10^(-10:10)) tuned fit.svm <- svm(class~.,data=df.train,gamma=.01,cost=1) svm.pred <- predict(fit.svm,na.omit(df.validate)) svm.pref <- table(na.omit(df.validate)$class,svm.pred,dnn = c('Actual','Predicted')) svm.pref
結論驗證的成功率有97%
8.編寫函數選擇預測效果最好的解
performance <- function(table,n=2){ if(!all(dim(table) == c(2,2))){ stop('Must be a 2 * 2 table') } tn = table[1,1] fp = table[1,2] fn = table[2,1] tp = table[2,2] sensitivity = tp/(tp+fn) specificity = tn/(tn+fp) ppp = tp/(tp+fp) npp = tn/(tn+fn) hitrate = (tp+tn)/(tp+tn+fn+fp) result <- paste("Sensitivity = ", round(sensitivity, n) , "\nSpecificity = ", round(specificity, n), "\nPositive Predictive Value = ", round(ppp, n), "\nNegative Predictive Value = ", round(npp, n), "\nAccuracy = ", round(hitrate, n), "\n", sep="") cat(result) }
performance(logit.pref)
performance(dtree.pref)
performance(ctree.pref)
performance(forest.pref)
performance(svm.pref)
結論:從以上的分類器中,本案例隨機森林的擬合度最優
三.使用rattle進行數據挖掘
案例:預測糖尿病
library(rattle)
rattle()
結論:設定好這些變量點擊執行
選擇model選項卡,然后選擇條件推斷樹作為預測模型,點擊Draw生成圖片
通過Evalute選項卡來評估模型
結論:只有35%的病人被成功鑒別,我們可以試試隨機森林和支持向量機的匹配度是否更高