參考資料:
《復雜數據統計方法》&網絡&幫助文件
適用情況:在因變量為分類變量而自變量含有多個分類變量或分類變量水平較多的情況。
一.
(一)概論和例子
數據來源:http://archive.ics.uci.edu/ml/datasets/Cardiotocography
自變量:LB - FHR baseline (beats per minute)
AC - # of accelerations per second
FM - # of fetal movements per second
UC - # of uterine contractions per second
DL - # of light decelerations per second
DS - # of severe decelerations per second
DP - # of prolongued decelerations per second
ASTV - percentage of time with abnormal short term variability
MSTV - mean value of short term variability
ALTV - percentage of time with abnormal long term variability
MLTV - mean value of long term variability
Width - width of FHR histogram
Min - minimum of FHR histogram
Max - Maximum of FHR histogram
Nmax - # of histogram peaks
Nzeros - # of histogram zeros
Mode - histogram mode
Mean - histogram mean
Median - histogram median
Variance - histogram variance
Tendency - histogram tendency
CLASS - FHR pattern class code (1 to 10)
因變量:
NSP - fetal state class code (N=normal; S=suspect; P=pathologic)
(二)產生交叉驗證數據集
1.十折交叉驗證 概念(百度百科)
英文名叫做10-fold cross-validation,用來測試算法准確性。是常用的測試方法。將數據集分成十分,輪流將其中9份作為訓練數據,1份作為測試數據,進行試驗。每次試驗都會得出相應的正確率(或差錯率)。10次的結果的正確率(或差錯率)的平均值作為對算法精度的估計,一般還需要進行多次10折交叉驗證(例如10次10折交叉驗證),再求其均值,作為對算法准確性的估計。
之所以選擇將數據集分為10份,是因為通過利用大量數據集、使用不同學習技術進行的大量試驗,表明10折是獲得最好誤差估計的恰當選擇,而且也有一些理論根據可以證明這一點。但這並非最終診斷,爭議仍然存在。而且似乎5折或者20折與10折所得出的結果也相差無幾。
Fold=function(Z=10,w,D,seed=7777){ n=nrow(w) d=1:n dd=list() e=levels(w[,D]) T=length(e) set.seed(seed) for(i in 1:T){ d0=d[w[,D]==e[i]] j=length(d0) ZT=rep(1:Z,ceiling(j/Z))[1:j] id=cbind(sample(ZT,length(ZT)),d0) dd[[i]]=id} mm=list() for(i in 1:Z){u=NULL; for(j in 1:T)u=c(u,dd[[j]][dd[[j]][,1]==i,2]) mm[[i]]=u} return(mm)}
#讀入數據 w=read.csv("CTG.NAOMIT.csv") #因子化最后三個啞元變量 F=21:23 #三個分類變量的列數 for(i in F) w[,i]=factor(w[,i]) D=23 #因變量的位置 Z=10 #折數 n=nrow(w)#行數 mm=Fold(Z,w,D,8888)
二.決策樹分類(分類樹)
library(rpart.plot) (a=rpart(NSP~.,w))#用決策樹你和全部數據並打印輸出 rpart.plot(a,type=2,extra=4)
rpart.plot參數解釋:
x :
An rpart object. The only required argument.
type:
Type of plot. Five possibilities:
0 The default. Draw a split label at each split and a node label at each leaf.
1 Label all nodes, not just leaves. Similar to text.rpart's all=TRUE.
2 Like 1 but draw the split labels below the node labels. Similar to the plots in the CART book.
3 Draw separate split labels for the left and right directions.
4 Like 3 but label all nodes, not just leaves. Similar to text.rpart's fancy=TRUE. See also clip.right.labs.
extra :
Display extra information at the nodes. Possible values:
0 No extra information (the default).
1 Display the number of observations that fall in the node (per class for class objects; prefixed by the number of events for poisson and exp models). Similar to text.rpart's use.n=TRUE.
2 Class models: display the classification rate at the node, expressed as the number of correct classifications and the number of observations in the node. Poisson and exp models: display the number of events.
3 Class models: misclassification rate at the node, expressed as the number of incorrect classifications and the number of observations in the node.
4 Class models: probability per class of observations in the node (conditioned on the node, sum across a node is 1).
5 Class models: like 4 but do not display the fitted class.
6 Class models: the probability of the second class only. Useful for binary responses.
7 Class models: like 6 but do not display the fitted class.
8 Class models: the probability of the fitted class.
9 Class models: the probabilities times the fraction of observations in the node (the probability relative to all observations, sum across all leaves is 1).
branch:
Controls the shape of the branch lines. Specify a value between 0 (V shaped branches) and 1 (square shouldered branches). Default is if(fallen.leaves) 1 else .2.
branch=0
branch=1
digits :
The number of significant digits in displayed numbers. Default 2.
rpart.plot(a,extra=4,digits=4)