R語言caret包的學習（四）--建立模型及驗證

本文轉載自查看原文 2018-01-17 15:25 12350 R語言/ caret包

本文介紹caret包中的建立模型及驗證的過程。主要涉及的函數有train()，predict()，confusionMatrix()，以及pROC包中的畫roc圖的相關函數。

建立模型

在進行建模時，需對模型的參數進行優化，在caret包中其主要函數命令是train。

train(x, y, method = "rf", preProcess = NULL, ...,
  weights = NULL, metric = ifelse(is.factor(y), "Accuracy", "RMSE"),
  maximize = ifelse(metric %in% c("RMSE", "logLoss", "MAE"), FALSE, TRUE),
  trControl = trainControl(), tuneGrid = NULL,
  tuneLength = ifelse(trControl$method == "none", 1, 3))

x 行為樣本，列為特征的矩陣或數據框。列必須有名字
y 每個樣本的結果，數值或因子型
method 指定具體的模型形式，支持大量訓練模型，可在此查詢：點擊
preProcess 代表自變量預處理方法的字符向量。默認為空，可以是 "BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "bagImpute", "medianImpute", "pca", "ica" and "spatialSign".
weights 加權的數值向量。僅作用於允許加權的模型
metric 指定將使用什么匯總度量來選擇最優模型。默認情況下，"RMSE" and "Rsquared" for regression and "Accuracy" and "Kappa" for classification
maximize 邏輯值，metric是否最大化
trControl 定義函數運行參數的列表。具體見下
tuneGrid 可能的調整值的數據框，列名與調整參數一致
tuneLength 調整參數網格中的粒度數量,默認時每個調整參數的level的數量

下面來具體介紹一下trainControl函數

trainControl(method = "boot", number = ifelse(grepl("cv", method), 10, 25),
  repeats = ifelse(grepl("[d_]cv$", method), 1, NA), p = 0.75,
  search = "grid", initialWindow = NULL, horizon = 1,
  fixedWindow = TRUE, skip = 0, verboseIter = FALSE, returnData = TRUE,
  returnResamp = "final",.....)

method 重抽樣方法："boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV", "LGOCV" (for repeated training/test splits), "none" (only fits one model to the entire training set), "oob" (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), timeslice, "adaptive_cv", "adaptive_boot" or "adaptive_LGOCV"
number folds的數量或重抽樣的迭代次數
repeats 僅作用於k折交叉驗證：代表要計算的完整折疊集的數量
p 僅作用於分組交叉驗證：代表訓練集的百分比
search Either "grid" or "random"，表示如何確定調整參數網格

用kernlab包中的spam數據來進行實驗