ML—隨機森林·1


Introduction to Random forest(Simplified)

With increase in computational power, we can now choose algorithms which perform very intensive calculations. One such algorithm is “Random Forest”, which we will discuss in this article. While the algorithm is very popular in various competitions (e.g. like the ones running on Kaggle), the end output of the model is like a black box and hence should be used judiciously(明智而審慎地).

Before going any further, here is an example on the importance of choosing the best algorithm.

Importance of choosing the right algorithm

# 以作者看過的一部電影<明日邊緣>引入:說明什么樣的算法才是最佳算法?【紅色高亮部分比喻很貼切】

Yesterday, I saw a movie called ” Edge of tomorrow“.  I loved the concept and the thought process which went behind the plot of this movie. Let me summarize the plot (without commenting on the climax, of course). Unlike other sci-fi movies, this movie revolves around one single power which is given to both the sides (hero and villain). The power being the ability to reset the day.

Human race is at war with an alien specie called “Mimics”.  Mimic is described as a far more evolved civilization of an alien specie. Entire Mimic civilization is like a single complete organism. It has a central brain called “Omega” which commands all other organisms in the civilization. It stays in contact with all other species of the civilization every single second. “Alpha” is the main warrior specie (like the nervous system) of this civilization and takes command from “Omega”. “Omega” has the power to reset the day at any point of time.

Now, let’s wear the hat of a predictive analyst to analyze this plot. If a system has the ability to reset the day at any point of time, it will use this power, whenever any of its warrior specie die. And, hence there will be no single war ,when any of the warrior specie (alpha) will actually die, and the brain “Omega” will repeatedly test the best case scenario to 【maximize the death of human race and put a constraint on number of deaths of alpha (warrior specie) to be zero】every single day. You can imagine this as “THE BEST” predictive algorithm ever made. It is literally impossible to defeat such an algorithm.

Let’s now get back to “Random Forests” using a case study.

Case Study

Following is a distribution of Annual income Gini Coefficients(基尼系數:高的基尼系數意味着該國的貧富差距相差很大) across different countries :

oecd-income_inequality_2013_2

Mexico has the second highest Gini coefficient and hence has a very high segregation in annual income of rich and poor. Our task is to come up with an accurate predictive algorithm to estimate annual income bracket(年收入檔次) of each individual in Mexico. The brackets of income are as follows :

1. Below $40,000

2. $40,000 – 150,000

3. More than $150,000

Following are the information available for each individual :

1. Age , 2. Gender,  3. Highest educational qualification, 4. Working in Industry, 5. Residence in Metro/Non-metro

We need to come up with an algorithm to give an accurate prediction for an individual who has following traits:

1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro

我們的任務就是:利用某一個人的年齡、性別、教育情況、工作領域以及住宅地共5個字段來預測他的收入層次

We will only talk about random forest to make this prediction in this article.

The algorithm of Random Forest

Random forest is like a bootstrapping algorithm with Decision tree (CART) model. Say, we have 1000 observation in the complete population with 10 variables(比如:一個總體有1000個觀測,10個變量). Random forest tries to build multiple CART model with different sample and different initial variables(隨機森林會使用不同的樣本和初始值來構建多個CART模型). For instance, it will take a random sample of 100 observation and 5 randomly(注意:每次都是隨機的) chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function of each prediction(最終的預測是每次預測的一個函數). This final prediction can simply be the mean of each prediction(最終預測最簡單的形式就是每次預測的平均!).

bootstrapping這個概念

所謂的Bootstrapping法就是利用有限的樣本資料經由多次重復抽樣,重新建立起足以代表母體樣本分布之新樣本。

對於一個采樣,我們只能計算出某個統計量(例如均值)的一個取值,無法知道均值統計量的分布情況。但是通過自助法(自舉法)我們可以模擬出均值統計量的近似分布。有了分布很多事情就可以做了(比如說有你推出的結果來進而推測實際總體的情況)。

bootstrapping方法的實現很簡單,假設你抽取的樣本大小為n:

在原樣本中有放回的抽樣,抽取n次。每抽一次形成一個新的樣本,重復操作,形成很多新樣本,通過這些樣本就可以計算出樣本的一個分布。新樣本的數量多少合適呢?大概1000就差不多行了,如果計算成本很小,或者對精度要求比較高,就增加新樣本的數量。

最后這種方法的准確性和什么有關呢?這個我還不是清楚,猜測是和原樣本的大小n,和Bootstrapping產生的新樣本的數量有關系,越大的話越是精確,更詳細的就不清楚了,想知道話做個搞幾個已知的分布做做實驗應該就清楚了。

Back to Case study

Disclaimer(免責聲明) : The numbers in this article are illustrative(說明:本文中使用的數值都是虛構的

Mexico has a population of 118 MM. Say, the algorithm Random forest picks up 10k observation with only one variable (for simplicity) to build each CART model(為了簡化說明,規定隨機森林每次選擇1萬條觀測值和一個變量). In total, we are looking at 5 CART model being built with different variables(看一看使用不同變量構建的5個CART模型!). In a real life problem, you will have more number of population sample and different combinations of  input variables.

Salary bands :

Band 1 : Below $40,000

Band 2: $40,000 – 150,000

Band 3: More than $150,000

Following are the outputs of the 5 different CART model

CART 1 : Variable Age

rf1

CART 2 : Variable Gender

rf2

CART 3 : Variable Education

rf3

CART 4 : Variable Residence

rf4

CART 5 : Variable Industryrf5

Using these 5 CART models, we need to come up with singe set of probability to belong to each of the salary classes(需要找到屬於每一個工資類別的概率組合,為了簡化,這里使用概率的平均作為一個指標). For simplicity, we will just take a mean of probabilities in this case study. Other than simple mean, we also consider vote method(同時,還考慮了投票機制!) to come up with the final prediction. To come up with the final prediction let’s locate the following profile(概況:其實就是重溫一下上面提到過的幾個變量)n each CART model :

# 下面是一個觀測值(即一個人)

1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro(這5五個變量相當於新的輸入變量,通過這5個變量,我們需要預測他的工資層次)

For each of these CART model, following is the distribution across salary bands:(下面是各工資層次的分布)

DF

注:上面這個表是針對新輸入變量的工資層次針對每個變量的概率分布以及通過平均單個變量的概率得到的最終概率! so  cool!

The final probability is simply the average of the probability in the same salary bands in different CART models. As you can see from this analysis, that there is 70% chance of this individual falling in class 1 (less than $40,000) and around 24% chance of the individual falling in class 2.(最后,可以看到:這個人有70%的概率屬於類別1,即工資收入低於40,000$)

End Notes

Random forest gives much more accurate predictions when compared to simple CART/CHAID or regression models in many scenarios(隨機森林相比於簡單CART或CHAID或回歸模型預測精度更高!). These cases generally have high number of predictive variables and huge sample size(一般當預測變量比較多且樣本比較大時運用隨機森林的效果非常好!). This is because it captures the variance of several input variables at the same time and enables high number of observations to participate in the prediction. In some of the coming articles, we will talk more about the algorithm in more detail and talk about how to build a simple random forest on R.


Comparing a CART model to Random Forest (Part 1)

I created my first simple regression model with my father in 8th standard (year: 2002) on MS Excel. Obviously, my contribution in that model was minimal, but I really enjoyed the graphical representation of the data. We tried validating all the assumptions etc. for this model. By the end of the exercise, we had 5 sheets of the simple regression model on 700 data points. The entire exercise was complex enough to confuse any person with average IQ level(看來作者的智商略高啊!!!). When I look at my models today, which are built on millions of observations and utilize complex statistics behind the scene, I realize how machine learning with sophisticated tools (like SAS, SPSS, R)  has made our life easy.

Having said that, many people in the industry do not bother about(置於不顧:我很在意啊,哈哈!)the complex statistics, which goes behind the scene. It becomes very important to realize the predictive power of each technique. No model is perfect in all scenarios(任何模型都不可能在所有的場景中達到效果最佳). Hence, we need to understand the data and the surrounding eco-system before coming up with a model recommendation(因此,在構建一個之前,需要了解數據,以及圍繞着數據的生態系統!作者的建議也正是我所推崇的啊!)。

In this article, we will compare two widely used techniques i.e. CART vs. Random forest. Basics of Random forest were covered in my last article. We will take a case study to build a strong foundation of this concept and use R to do the comparison. The dataset used in this article is an inbuilt dataset of R.

As the concept is pretty lengthy(要理解其中的概念確實需要細細品味,所以作者將其分成了兩部分來講!), we have broken down this article into two parts!

Background on Dataset “Iris”

Data set “iris” gives the measurements in centimeters of the variables : sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of Iris. The dataset has 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species.(5個變量,其中使用前四個變量來預測花的種類) We intend to predict the Specie based on the 4 flower characteristic variables.

We will first load the dataset into R and then look at some of the key statistics. You can use the following codes to do so.

data(iris)
#look at the dataset
summary(iris)
#visually look at the dataset
qplot(Petal.Length,Petal.Width,colour=Species,data=iris)

plot1

The three species seem to be well segregated from each other(可見,這三種花似乎彼此隔離的). The accuracy in prediction of borderline cases determines the predictive power of the model(預測臨界情形的精度決定了模型的預測能力). In this case, we will install two useful packages for making a CART model.

library(rpart) #
library(caret) #

After loading the library, we will divide the population in two sets: Training and validation(將數據集分成兩部分:訓練集和校驗集,各自占50%,這么做的目的是避免造成模型過擬合問題!). We do this to make sure that we do not overfit the model. In this case, we use a split of 50-50 for training and validation.

Generally, we keep training heavier to make sure that we capture the key characteristics(這個是常用的做法) You can use the following code to make this split.

train.flag <- 
createDataPartition
(y=iris$Species,p=0.5,list=FALSE)#這個函數用得好
training <- iris[train.flag,]
Validation <- iris[-train.flag,]

Building a CART model

Once we have the two data sets and have got a basic understanding of data, we now build a CART model. We have used “caret” and “rpart” package to build this model. However, the traditional representation of the CART model is not graphically appealing on R. Hence, we have used a package called “rattle” to make this decision tree. “Rattle” builds a more fancy and clean trees, which can be easily interpreted. Use the following code to build a tree and graphically check this tree(使用了三個包來構建CART模型:caret、rpart和rattle包):

modfit <- train(Species~.,method="rpart",data=training) 
library(rattle)
fancyRpartPlot(modfit$finalModel)

tree

Validating the model

Now, we need to check the predictive power of the CART model, we just built. Here, we are looking at a discordance rate (which is the number of misclassifications in the tree:這里所說的discordance rate,直譯是不一致率,其實指的是模型樹的錯分數量,一般用這個指標作為決策的標准!) as the decision criteria. We use the following code to do the same :

train.cart<-predict(modfit,newdata=training)
table(train.cart,training$Species)
train.cart   setosa versicolor virginica

setosa         25         0          0
versicolor     0         22         0
virginica       0         3         25
 
# 
Misclassification rate
 = 3/75(得到錯分率!!!)

Only 3 misclassified observations out of 75, signifies good predictive power(可以看到,針對75個觀測的預測只出現了3次錯誤分類,說明模型預測能力還是可以!). In general, a model with misclassification rate less than 30% is considered to be a good model(一般情況下,錯分率小於30%的模型被認為是一個良好的模型,但是一個好模型的判別范圍還取決於不同的產業和問題的本質!). But, the range of a good model depends on the industry and the nature of the problem. Once we have built the model, we will validate the same on a separate data set. This is done to make sure that we are not over fitting the model. In case we do over fit the model, validation will show a sharp decline in the predictive power. It is also recommended to do an out of time validation of the model. This will make sure that our model is not time dependent. For instance, a model built in festive time, might not hold in regular time. For simplicity, we will only do an in-time validation of the model. We use the following code to do an in-time validation(模型校驗,這里使用及時校驗方法,必要的時候建議使用out of time validation方法,這樣可以避免我們的模型不會依賴於時間,比如在萬聖節數據生成的模型可能不會再常規時間下適用啊!):

pred.cart<-predict(modfit,newdata=Validation)
table(pred.cart,Validation$Species)

pred.cart   setosa    versicolor virginica
 
setosa         25         0         0
versicolor     0         22        1
virginica       0         3        24
# Misclassification rate = 4/75

As we see from the above calculations that the predictive power decreased in validation as compared to training(可以看到,模型對於校驗數據的預測能力下降啦,這是很正常的!當校驗集產生的錯誤率與訓練集產生的錯誤率接近時,可以認為模型是比較穩定的!可以看到,我們取得的CART模型是比較穩定的,nice啊!). This is generally true in most cases. The reason being, the model is trained on the training data set, and just overlaid on validation training set. But, it hardly matters, if the predictive power of validation is lesser or better than training. What we need to check is that they are close enough. In this case, we do see the misclassification rate to be really close to each other. Hence, we see a stable CART model in this case study.

Let’s now try to visualize the cases for which the prediction went wrong(可視化一下那些引起錯誤分類的觀測). Following is the code we use to find the same :

correct <- pred.cart == Validation$Species
qplot(Petal.Length,Petal.Width,colour=correct,data=Validation)

misclassify

As you see from the graph, the predictions which went wrong were actually those borderline cases(可以看到,引起錯分的觀測都是位於類與類之間的臨界點). We have already discussed before that these are the cases which make or break the comparison for the model. Most of the models will be able to categorize observation far away from each other. It takes a model to be sharp to distinguish these borderline cases.

End Notes :

In the next article, we will solve the same problem using a random forest algorithm(下節預告:下一部分將會討論隨機森林,我們可以得知:隨機森林能夠很好的處理這些由於臨界值引起的問題!). We hope that random forest will be able to make even better prediction for these borderline cases. But, we can never generalize the order of predictive power among a CART and a random forest, or rather any predictive algorithm. The reason being every model has its own strength. Random forest generally tends to have a very high accuracy on the training population, because it uses many different characteristics to make a prediction. But, because of the same reason, it sometimes over fits the model on the data. We will see these observations graphically in the next article and talk in more details on scenarios where random forest or CART comes out to be a better predictive model.

Comparing a Random Forest to a CART model (Part 3)

Random forest is one of the most commonly used algorithm in Kaggle competitions. Along with a good predictive power, Random forest model are pretty simple to build. We have previously explained the algorithm of a random forest ( Introduction to Random Forest ). This article is the second part of the series on comparison of a random forest with a CART model. In the first article, we took an example of an inbuilt R-dataset to predict the classification of an specie. In this article we will build a random forest model on the same dataset to compare the performance with previously built CART model(該節將對相同的數據集構建一個隨機模型,從而與之前構建的CART模型進行性能對比). I did this experiment a week back and found the results very insightful. I recommend the reader to read the first part of this article (Last article) before reading this one.

Background on Dataset “Iris”

Data set “iris” gives the measurements in centimeters of the variables : sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of Iris. The dataset has 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species. We intend to predict the Specie based on the 4 flower characteristic variables(再次熟悉了一下數據集,並且說明了預測任務:基於花的4個特征變量來預測花的種類).

We will first load the dataset into R and then look at some of the key statistics. You can use the following codes to do so.

data(iris)
# look at the dataset
summary(iris)
# visually look at the dataset
qplot(Petal.Length,Petal.Width,colour=Species,data=iris)

Results using CART Model

The first step we follow in any modeling exercise is to split the data into training and validation. You can use the following code for the split. (We will use the same split for random forest as well)

注:這里用來分隔數據的方法和之前的CART模型一樣!

train.flag <- createDataPartition(y=iris$Species,p=0.5,list=FALSE)
training <- iris[train.flag,]
Validation <- iris[-train.flag,]

CART model gave following result in the training and validation :

Misclassification rate in training data = 3/75  #訓練集的錯分率是3/75

Misclassification rate in validation data = 4/75 #校驗集中的錯分率是4/75

As you can see, CART model gave decent result in terms of accuracy and stability. We will now model the random forest algorithm on the same training dataset and validate it using same validation dataset.

Building  a Random forest model

We have used “caret” , “randomForest” and “randomForestSRC”使用caret、randomForest、randomForestSRC三個包來構建隨機森林模型!)package to build this model. You can use the following code to generate a random forest model on the training dataset。

> library(randomForest)
> library(randomForestSRC)
> library(caret)
> modfit <- train(Species~ .,method="rf",data=training)
> train.RF<- predict(modfit,training)
> table(train.RF,training$Species)

 pred       setosa versicolor virginica
 setosa        25       0          0
 versicolor     0      25          0
 virginica      0       0         25

Misclassification rate in training data = 0/75 [This is simply awesome!](錯分率居然為0)

Validating the model

Having built such an accurate model, we will like to make sure that we are not over fitting the model on the training data. This is done by validating the same model on an independent data set. We use the following code to do the same :

> Pred.RF<-predict(modfit,newdata=Validation)
> table(Pred.RF,Validation$Species)

Pred.RF      setosa versicolor virginica
 setosa        25       0          0
 versicolor     0      22          0
 virginica      0       3         25
 # Misclassification rate = 3/75

Only 3 misclassified observations out of 75, signifies good predictive power. However, we see a significant drop in predictive power of this model when we compare it to training misclassification.(可以看到,75個觀測值中,只有3個預測錯誤,顯示模型預測能力還可以!同時:對比CART模型的錯分率結果,我們發現隨機森林模型的預測能力較高!)

Comparison between the two models

Till this point, everything was as per(按照,依據;如同) books. Here comes the tricky part. Once you have all performance metrics, you need to select the best model as per your business requirement. We will make this judgement based on 3 criterion(提出選擇模型的三個標准) in this case apart from business requirements:

.1. Stability(穩定性) : The model should have similar performance metrics across both training and validation. This is very essential because business can live with a lower accuracy but not with a lower stability. We will give the highest weight to stability. For this case let’s take it as 5(在商業運用中,穩定性通常是最重要的,這里給穩定性的權重賦值為5).

2. Performance on Training data(訓練集上的性能表現) : This is one of the important metric but nothing conclusive can be said just based on this metric. This is because an over fit model is unacceptable but will get a very high score at this parameter. Hence, we will give a low weight to this parameter (say 2)(訓練集上的性能表現賦值為2).

3. Performance on Validation data : This metric catch holds of overfit model and hence is an important metric. We will score it higher than performance and lower than stability. For this case let’s take it as 3(校驗集上的表現賦值為3).

Note that the weights and scores entirely depends on the business case. Following is a score table as per my judgement in this case(最終根據以上三個指標及相應權重得到一個模型打分表)。

Comparison

有一個問題:這個打分表針對CART模型和Random Forest模型的穩定性、訓練集性能表現、校驗集性能表現的得分是怎么得到的呢?是直接主觀打分呢,還是根據一定的概率算出來的呢?[我覺得這里作者應該是直接進行的主觀打分!有意見的朋友歡迎提出來哈,共同進步,oh yeah!]

===================================================================================

As you can see from the table that however Random forest gives me a better performance, I still will go ahead and use CART model because of the stability factor(結論:可以看到,盡管隨機森林的性能更高,但是我們仍然會選擇CART模型,因為CART模型的穩定性因素更高). Other factor in favor of CART model is the easy business justification. Random forest is very difficult to explain to people working on field. CART models are simple cuts which can be justified by simple business justification/reasons. But the choice of model selection is entirely dependent on business requirement(不管怎么說,模型的選擇完全取決於商業需求!).

End Notes

Every model has its own strength. Random forest, as seen from this case study, has a very high accuracy on the training population, because it uses many different characteristics to make a prediction. But, because of the same reason, it sometimes over fits the model on the data. CART model on the other side is simplistic criterion cut model. This might be over simplification in some case but works pretty well in most business scenarios. However, the choice of model might be business requirement dependent, it is always good to compare performance of different model before taking this call(最后闡釋了一下Random Forest和CART的優劣勢,並對模型的選擇給出了溫馨提示:模型的選擇依賴於具體的商業需求!!!).

Entire scripts summarized

 

1)CART模型

2)Random Forest

#加載iris數據集並查看數據特征
data(iris)
# look at the dataset
summary(iris)
# visually look at the dataset
qplot(Petal.Length,Petal.Width,
colour=Species,data=iris)
#加載rpart和caret包
library(rpart)
library(caret)
# 構造訓練集和校驗集
train.flag <- 
createDataPartition #[來自caret包]
(y=iris$Species,p=0.5,list=FALSE)
training <- iris[train.flag,]
Validation <- iris[-train.flag,]
# 構建CART模型
modfit <- 
train
(Species~.,method="rpart",data=training)
library(rattle)#
 加載rattle是為了產生更美觀的決策樹
fancyRpartPlot(modfit$finalModel)
# 使用訓練集和校驗集驗證模型!
train.cart<-predict(modfit,newdata=
training
)
table(train.cart,training$Species)
#Results:Misclassification rate = 3/75
pred.cart<-predict(modfit,newdata=
Validation
)
table(pred.cart,Validation$Species)
#Results:Misclassification rate = 4/75#可視化觀測值,檢測出時哪些觀測值導致錯分
correct <- pred.cart == Validation$Species
qplot(Petal.Length,Petal.Width,colour=correct,data=Validation)
#加載iris數據集並查看數據特征
data(iris)
# look at the dataset
summary(iris)
# visually look at the dataset
qplot(Petal.Length,Petal.Width,
colour=Species,data=iris)
# 構造訓練集和校驗集
train.flag <- createDataPartition(y=iris$Species,p=0.5,list=FALSE)
training <- iris[train.flag,]
Validation <- iris[-train.flag,]
#加載caret , randomForest和randomForestSRC三個包
> library(randomForest)
> library(randomForestSRC)
> library(caret)
# 構建RF模型
> modfit <- train(Species~ .,method="rf",data=training)
# 使用訓練集和校驗集驗證模型
> train.RF <- predict(modfit,training)
> table(train.RF,training$Species)
#Results:Misclassification rate in training data = 0/75    
> pred.RF<-predict(modfit,newdata=Validation)
> table(pred.RF,Validation$Species)
#Results:Misclassification rate = 3/75
 

暫時沒理解的部分[模型打分的根據是什么?]

Comparison

參考文獻

http://www.analyticsvidhya.com/blog/2014/06/comparing-random-forest-simple-cart-model/(Part1)

http://www.analyticsvidhya.com/blog/2014/06/comparing-random-forest-simple-cart-model/(Part2)

我的代碼實踐

#Step1:先把所需要的包都一並加載啦!

> require(ggplot2) #要使用該包中的qplot畫圖
> require(caret)
> require(rpart)
> require(rattle) #該包中的fancyRpartPlot函數可以繪制更美觀的決策樹!

> require(randomForest)

 
        

> require(randomForestSRC)

#Step2:加載iris案例數據集並查看數據特征

> data(iris)
> 

summary

(iris)
 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
>qplot(Petal.Length,Petal.Width,colour=Species,data=iris,shape=Species,color=Species)
image

#Step3:構造訓練集和驗證集

> train.flag <- createDataPartition (y=iris$Species,p=0.5,list=FALSE)

#p表示提取到訓練集中的百分比數據;y表示結果變量組成的向量.
#createDataPartition函數是caret包中用於拆分數據的函數之一,其系列家族函數的用法會在另一篇文章中介紹!
> training <- iris[train.flag,]
> Validation <- iris[-train.flag,]
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> head(train.flag)
     Resample1
[1,]         1
[2,]         3
[3,]         4
[4,]         5
[5,]         6
[6,]         7
> head(training)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
7          4.6         3.4          1.4         0.3  setosa
> head(Validation)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2           4.9         3.0          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
19          5.7         3.8          1.7         0.3  setosa
21          5.4         3.4          1.7         0.2  setosa

備注:可知,createDataPartition 拆分數據是隨機進行的!

#Step4:構建CART模型,並可視化結果決策樹

> modfit <- 

train

(Species~.,method="rpart",data=training)#train來自caret包,用法參見另一篇文章

### train方法返回一個模型對象,該對象中包含了以下方法(重要的方法用紅色高亮)

method 建模方法
modfit$method: "
rpart
modelInfo 模型對象信息 包含以下函數:grid+loop+fit+predict+prob+predictors+varImp
modelType 模型類型
modfit$modelType:
Classification
results 擬合結果
    cp  Accuracy     Kappa AccuracySD   KappaSD
1 0.00 0.9418214 0.9105524 0.05087244 0.0808062
2 0.48 0.6461598 0.4827485 0.17475020 0.2376012
3 0.50 0.5393118 0.3349623 0.13902303 0.1651445
pred  
modfit$pred:NULL
bestTune 得到的cp最佳參數
modfit$bestTune: cp[0]
call 訓練模型調用的函數 train.formula(form=Species ~. , data = training, method = "rpart")
dots   list()
metric 評估模型好壞的指標
modfit$metric:"
Accuracy
control 具體采用的CART方法及每次抽樣數據 比如:$method: "boot";$repeats:25;$p:0.75;
$index(保存每次重復抽樣數據的索引);$indexout(保存了未被抽樣數據的索引)
finalModel 最終的模型 簡單分類樹的結構(結合fancyRpartPlot函數繪制決策樹圖形
preProcess 預處理 NULL
trainingData 用來訓練模型的數據 你懂的!
resample 每次重復抽樣計算得到的精確度!
    Accuracy     Kappa   Resample
1  0.9032258 0.8514377 Resample09
2  0.8888889 0.8272921 Resample05
……………………此處省略…………………
24 0.9642857 0.9459459 Resample21
25 0.9677419 0.9514107 Resample17
resampledCM   cp cell1 cell2 cell3 cell4 cell5 cell6 cell7 cell8 cell9 Resample 1 0.00 7 0 0 0 7 3 0 0 13 Resample01 2 0.44 7 0 0 0 10 0 0 13 0 Resample01………………………………………省略……………………………………….…74 0.44 7 0 0 0 8 0 0 8 0 Resample25 75 0.50 7 0 0 0 8 0 0 8 0 Resample25
perfNames performanceNames
modfit$perfNames
:"Accuracy" "Kappa"
maximize  
modfit$maximize:TRUE
yLimits   NULL
times 記錄模型訓練所用時間
$everything
用戶 系統 流逝 
1.01 0.03 1.05
terms    
coefnames   預測變量:Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
xlevels    
> 

fancyRpartPlot

(modfit$

finalModel

) #該函數來自rattle包,用於繪制美觀的RPart決策樹(注意:海里還需要加載rpart.plot包)
image

# Step5:模型校驗

#1)利用訓練數據的預測變量來預測Species 
> train.cart<-

predict

(modfit,newdata=training
>train.cart(預測值)
[1] setosa setosa setosa setosa setosa setosa setosa [8] setosa setosa setosa setosa setosa setosa setosa [15] setosa setosa setosa setosa setosa setosa setosa [22] setosa setosa setosa setosa versicolor versicolor versicolor [29] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [36] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [43] virginica versicolor versicolor versicolor versicolor versicolor versicolor [50] versicolor virginica virginica virginica virginica versicolor virginica [57] virginica virginica virginica virginica virginica versicolor virginica [64] virginica virginica virginica virginica virginica virginica virginica [71] virginica virginica virginica virginica virginica Levels: setosa versicolor virginica
 
            
> training$Species(原始值)
[1] setosa setosa setosa setosa setosa setosa setosa [8] setosa setosa setosa setosa setosa setosa setosa [15] setosa setosa setosa setosa setosa setosa setosa [22] setosa setosa setosa setosa versicolor versicolor versicolor [29] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [36] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [43] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [50] versicolor virginica virginica virginica virginica virginica virginica [57] virginica virginica virginica virginica virginica virginica virginica [64] virginica virginica virginica virginica virginica virginica virginica [71] virginica virginica virginica virginica virginica Levels: setosa versicolor virginica預測值與原始值對比,甄別出預測值中預測錯誤的點(用紅色高亮的地方!)
> 

table

(train.cart,training$Species)
train.cart setosa versicolor virginica
setosa       25       0         0
versicolor   0        24        2
virginica    0        1         23
# 列表示預測類別,行表示原始類別.
# 可見:本來是virginica的有2個被預測成了versicolor,versicolor有一個被預測成了virginica
所以可以得到錯分率=(2+1)/(25+25+25)=3/75

#2))#利用驗證集的預測變量來預測Species

> pred.cart<-predict(modfit,newdata=Validation)
> table(pred.cart,Validation$Species)
pred.cart    setosa versicolor virginica
  setosa         25          0         0
  versicolor      0         25         7
  virginica       0          0        18
#同理可得:有7次預測錯誤, 錯分率=7/75

# Step6可視化觀測值,檢測出時哪些觀測值導致錯分

> correct <- pred.cart == Validation$Species
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [57] FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE [71] TRUE TRUE TRUE FALSE TRUE
> qplot(Petal.Length,Petal.Width,colour=correct,data=Validation)

image


# Step7:構建RF模型

> modfit <- train(Species~ .,method="rf",data=training) #得到modfit模型對象中的方法和CART一樣!
#

Step8: 模型校驗

> pred <- predict(modfit,training)
> pred [1] setosa setosa setosa setosa setosa setosa setosa [8] setosa setosa setosa setosa setosa setosa setosa [15] setosa setosa setosa setosa setosa setosa setosa [22] setosa setosa setosa setosa versicolor versicolor versicolor [29] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [36] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [43] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [50] versicolor virginica virginica virginica virginica virginica virginica [57] virginica virginica virginica virginica virginica virginica virginica [64] virginica virginica virginica virginica virginica virginica virginica [71] virginica virginica virginica virginica virginica Levels: setosa versicolor virginica > training$Species [1] setosa setosa setosa setosa setosa setosa setosa [8] setosa setosa setosa setosa setosa setosa setosa [15] setosa setosa setosa setosa setosa setosa setosa [22] setosa setosa setosa setosa versicolor versicolor versicolor [29] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [36] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [43] versicolor versicolor versicolor versicolor versicolor versicolor versicolor [50] versicolor virginica virginica virginica virginica virginica virginica [57] virginica virginica virginica virginica virginica virginica virginica [64] virginica virginica virginica virginica virginica virginica virginica [71] virginica virginica virginica virginica virginica Levels: setosa versicolor virginica# 哇啊,居然是0錯誤!碉堡了!
> table(train.cart,training$Species)
pred         setosa versicolor virginica
  setosa         25          0         0
  versicolor      0         25         0
  virginica       0          0        25
> train.RF<-predict(modfit,newdata=Validation)
> table(train.RF,Validation$Species)
train.RF     setosa versicolor virginica
  setosa         25          0         0
  versicolor      0         25         3
  virginica       0          0        22
# 錯誤率=3/75

# Step9可視化觀測值,檢測出時哪些觀測值導致錯分

> correct <- train.RF == Validation$Species
> qplot(Petal.Length,Petal.Width,colour=correct,data=Validation)
image


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM