分享stackexchange的一篇問答:https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation
Q: Is it always a good idea to train with the full dataset after cross-validation? Put it another way, is it ok to train with all the samples in my dataset and not being able to check if this particular fittingoverfits?
A: The way to think of cross-validation is as estimating the performance obtained using a method for building a model, rather than for estimating the performance of a model.
If you use cross-validation to estimate the hyperparameters of a model and then use those hyper-parameters to fit a model to the whole dataset, then that is fine, provided that you recognise that the cross-validation estimate of performance is likely to be (possibly substantially) optimistically biased. This is because part of the model (the hyper-parameters) have been selected to minimise the cross-validation performance, so if the cross-validation statistic has a non-zero variance (and it will) there is the possibility of over-fitting the model selection criterion.
If you want to choose the hyper-parameters and estimate the performance of the resulting model then you need to perform a nested cross-validation, where the outer cross-validation is used to assess the performance of the model, and in each fold cross-validation is used to determine the hyper-parameters separately in each fold. You build the final model by using cross-validation on the whole set to choose the hyper-parameters and then build the classifier on the whole dataset using the optimized hyper-parameters.
This is of course computationally expensive, but worth it as the bias introduced by improper performance estimation can be large. See my paper
G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (www, pdf)
However, it is still possible to have over-fitting in model selection (nested cross-validation just allows you to test for it). A method I have found useful is to add a regularisation term to the cross-validation error that penalises hyper-parameter values likely to result in overly-complex models, see
G. C. Cawley and N. L. C. Talbot, Preventing over-fitting in model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007. (www,pdf)
So the answers to your question are (i) yes, you should use the full dataset to produce your final model as the more data you use the more likely it is to generalise well but (ii) make sure you obtain an unbiased performance estimate via nested cross-validation and potentially consider penalising the cross-validation statistic to further avoid over-fitting in model selection.
交叉驗證的目的可以理解為是為了估算建模方法的性能,而不是具體模型的性能。
如果使用交叉驗證來選擇超參數,那使用選取出的超參數在全部數據上擬合模型,這是對的。但是需要注意的是,使用這種交叉驗證方法得出的性能估計是很有可能有偏差的。這是因為被選出模型的超參數是通過最小化交叉驗證的性能而選出來的,這種情況下,交叉驗證的性能用於衡量模型的泛化誤差,不夠准確。
如果既需要選擇超參數,又需要估算選出模型的性能,可以選擇Nested Cross-Validation。Nested Cross-Validation中的外層交叉驗證用於估算模型性能,內層交叉驗證用於選擇超參數。最后,基於選出的超參數和全部數據集,產生最終的模型。
盡管這樣,還是有可能在模型選擇階段存在過擬合(Nested Cross-Validation只是允許你可以對這種情況進行測試,如何測?)。一種解決方法是在cross-validation error中加入正則項,用於懲罰易產生過度復雜模型的超參數。
總結,(1)最終的模型應該使用全部數據集來建模,因為越多的數據,模型泛化能力越好;(2)需要確認性能估計得無偏的,Nested Cross-Validation和加懲罰項是解決性能估計出現偏差的方法。