validation set以及cross validation的常見做法

本文轉載自查看原文 2017-05-18 11:58 2497 machine learning

如果給定的樣本充足，進行模型選擇的一種簡單方法是隨機地將數據集切分成三部分，分為訓練集（training set）、驗證集（validation set）和測試集（testing set）。訓練集用來訓練模型，驗證集用於模型的選擇，而測試集用於最終對學習方法評估。在學習到的不同復雜度的模型中，選擇對驗證集有最小預測誤差的模型。由於驗證集有足夠多的數據，用它對模型進行選擇也是有效的。

在許多實際應用中數據是不充足的，為了選擇好的模型，可以采用交叉驗證方法。

k折交叉驗證(k-fold cross validation)：首先隨機地將已給數據切分為k個互不相交的大小相同的子集，然后利用k-1個子集的數據訓練模型，利用余下的子集測試數據，將這一過程對可能的k中選擇重復進行，最后選出k次評測中平均測試誤差最小的模型。

總結：實際使用時，我們通過訓練集學習到參數，再計算交叉驗證集上的error，再選擇一個在驗證集上error最小的模型，最后再在測試集上估計模型的泛化誤差。

注意k-fold cross validation的目的不是為了選擇模型，而是先是有了一個模型，對這個模型進行精度評定。此處不同的模型指的“generally when we say 'a model' we refer to a particular method for describing how some input data relates to what we are trying to predict. We don't generally refer to particular instances of that method as different models. So you might say 'I have a linear regression model' but you wouldn't call two different sets of the trained coefficients different models. ”

say we have two models, say a linear regression model and a neural network. How can we say which model is better? We can do K-fold cross-validation and see which one proves better at predicting the test set points. But once we have used cross-validation to select the better performing model, we train that model (whether it be the linear regression or the neural network) on all the data. We don't use the actual model instances we trained during cross-validation for our final predictive model.

除了上述選擇模型的功能之外，validation set還用來防止過擬合。

To make sure you dont overfit the network you need to input the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for the validation and also the test set indicates that the network predicts well for the train set examples, also it is expected to perform well when new example are presented to the network which was not used in the training process.

Training set與Validation set都是在模型的training過程中使用的，訓練過程的workflow：

for each epoch
    for each training data instance
        propagate error through the network
        adjust the weights
        calculate the accuracy over training data
    for each validation data instance
        calculate the accuracy over the validation data
    if the threshold validation accuracy is met
        exit training
    else
        continue training

1、I don't do a separate final training on all the training data, instead I average the responses of the 10 folded models on the test data as my final results. Which may make for better CV results, as you're guaranteed to know you're using the same models you've got CV results for.

2、I use a holdout sample (usually it was ids 0-5k), but I occasionally change the holdout sample. My hardware isn't that great so 5 fold cv is a bit time-consuming. It matched reasonably well with lb. I also use a watchlist of 20% of data to get the number of rounds before retraining, so I sort of have 3 holdout sets - lb, holdout, and watchlist.

3、I like to draw alot of samples with n = size of private-LB data to get an estimate for the private LB score (if train & test share same distribution) or n = size of public-LB to check the correlations between local score and public LB score.

4、Using a fixed hold out-set is rarely a good idea, because it's very prone to overfitting. Also take a look at the std of your CV, not just the mean. What you can do is to monitor how your k-fold scores are varying together and how your LB scores behave with respesct to that. You will like always see some patterns there, which can be used to draw some conclusions.

edit: "rarely a good idea" is misleading. Should be something like that: on datasets, where (stratified) k-fold cv is applicable, it is the safer bet compared to a single hold-out set.

5、My understanding of "out-of-fold" prediction, is that you do the following:

Run K-fold CV, and for each run generate n*(1/K) predictions from training data with size n .
Aggregate the K set of n*(1/K) predictions, so that you have n prediction, and this is what is referred as "out-of-fold" prediction

And what you suggest, is to sample over this out-of-fold prediction to calculate with error rate.

6、I don't wanted to state, that single hold-out sets are a no-go, they have their applications. Forecasting problems are probably the best example for that. In competitions like that (e. g. Rossmann Store Sales), k-fold cv does not work very well, because the data is not iid. Gert mentions some other examples, like splits by geo-locations. In this case, stratified-CV could be bad, but often you can still define more than one hold-out set, which have the desired distribution. Another application is to detect leakage. If you don't want to waste submissions in order to be sure that your pre-processing is leakage-free, you can create a local private test: putting some training data and treat those as test data. So don't look at this set for data exploration and do not use its labels for any pre-processing.

My point is, that a single hold-out gets overfitted faster and hence it's more dangerous to use, if you do not have a good rapport with the god of overfitting. It's easy to do the wrong things, after you got an overfitting-occured response from the LB. Besides, you do not have information regarding variance with a single hold-out. So, if you are unexperienced with all the overfitting caveats, I would suggest to prefer k-fold cv over single hold-out if applicable.

7、Nevertheless, it seems 10-fold CV with out-of-fold prediction is a very much an adequate solution

注意：

kaggle上面public leaderboards are based on validating the submissions against a random fraction of the test set and the private one's are validated against the rest of the test set. I was just going to add that private one's are released after the competition is over and the final ranking is determined based on the private leaderboard.. People can do well in public leaderboard, yet do really bad in the private one because of overfitting.

參考：

http://cvrs.whu.edu.cn/blogs/?p=154

https://www.kaggle.com/c/telstra-recruiting-network/discussion/19277

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Cross Validation（交叉驗證）驗證和交叉驗證（Validation & Cross Validation）交叉驗證 Cross-validation 交叉驗證 Cross-validation （MATLAB）幾種交叉驗證（cross validation）方式的比較 Python如何進行cross validation training 使用交叉驗證法(Cross Validation)進行模型評估機器學習中的交叉驗證（cross-validation） ISLR系列：(3)重采樣方法 Cross-Validation & Bootstrap [深度概念]·K-Fold 交叉驗證 (Cross-Validation)的理解與應用