Python機器學習包的sklearn中的Gridsearch簡單使用

Python機器學習包的sklearn中的Gridsearch簡單使用
摘要：cross-validation(交叉驗證)Asolutiontothisproblemisaprocedurecalledcross-validation(CVforshort).Atestsetshouldstillbeheldoutforfinalevaluation,butthevalidationsetisnolongerneededwhendoingCV.Inthebasicapproach,calledk-foldCV,thetrainingsetissplit
cross-validation(交叉驗證)

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k”folds”:

1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

上面這段話是引自sklearn的document中,對於cv的描述.描述了一個在交叉驗證中的相同的規則就是,在解決實際問題中,我們可以將所有的數據集dataset,划分為train_set(例如70%)和test_set(30%),然后在train_set上做cross_validation,最后取平均之后,再使用test_set測試模型的准確度.不是直接在dataset上直接做cross-validation(這個是我理解cv中的一個誤區)
k-fold
本來不想寫關於cross-validation的內容的,但是決定這里面自己的誤區還是很多的,所以寫一下,如果有人看到了,也可以幫忙指出來.

1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop

前提:整個數據集被分成了訓練集D(70%)和測試集T(30%).

上面這段話就是k-fold的全過程,(此時只涉及到訓練集D)
1.將整個訓練集D分為k個等大的集合,然后選出k-1個作為模型的訓練集.訓練模型model1.
2.使用剩下的一個集合 Di ,作為驗證集(和所謂的測試集的作用是一樣的),測試model1的准確性.關於模型評估方法,可以參考sklearn實現的一些方法.
3.循環執行上述過程k次,保證沒有重復.然后對於准確性求平均值,這就是該分類方法對應的正確性.
有人可能會問平均出來的正確性對應的模型權值 θ 是哪一個?這個問題就需要明白機器學習的目的是什么?機器學習不是找到所謂模型對應的權值是多少,而是相對於實際問題,選出合適的模型(比如向量機模型)和合適的超參(比如核函數,c等超參).上述的平均正確率就是對應於模型+超參的.
GridSearch
搞懂了K-fold,就可以聊一聊GridSearch啦,因為GridSearch默認參數就是3-fold的,如果沒有不懂cross-validation就很難理解這個.
想干什么
Gridsearch是為了解決調參的問題.比如向量機SVM的常用參數有kernel,gamma,C等,手動調的話太慢了,寫循環也只能順序運行,不能並行.於是就出現了Gridsearch.通過它,可以直接找出最優的參數.
怎么調參
param字典類型,它會將每個字典類型里的字段所有的組合都輸入到分類器中執行.
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] 如何評估
參數輸入之后,需要評估每組參數對應的模型的預測能力.Gridsearch就在數據集上做k-fold,然后求出每組參數對應模型的平均精確度.選出最優的參數.返回.
一般Gridsearch只在訓練集上做k-fold並不會使用測試集.而是將測試集留在最后,當gridsearch選出最佳模型的時候,在使用測試集測試模型的泛化能力.

貼一個sklearn上面的例子
from sklearn import datasetsfrom sklearn.cross_validation import train_test_splitfrom sklearn.grid_search import GridSearchCVfrom sklearn.metrics import classification_reportfrom sklearn.svm import SVC# Loading the Digits datasetdigits = datasets.load_digits()# To apply an classifier on this data, we need to flatten the image, to# turn the data in a (samples, feature) matrix:n_samples = len(digits.images)X = digits.images.reshape((n_samples, -1))y = digits.target# 將數據集分成訓練集和測試集X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=0)# 設置gridsearch的參數tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]#設置模型評估的方法.如果不清楚,可以參考上面的k-fold章節里面的超鏈接scores = ['precision', 'recall']for score in scores: print("# Tuning hyper-parameters for %s" % score) print() #構造這個GridSearch的分類器,5-fold clf = GridSearchCV(SVC(), tuned_parameters, cv=5, scoring='%s_weighted' % score) #只在訓練集上面做k-fold,然后返回最優的模型參數 clf.fit(X_train, y_train) print("Best parameters set found on development set:") print() #輸出最優的模型參數 print(clf.best_params_) print() print("Grid scores on development set:") print() for params, mean_score, scores in clf.grid_scores_: print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params)) print() print("Detailed classification report:") print() print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.") print() #在測試集上測試最優的模型的泛化能力. y_true, y_pred = y_test, clf.predict(X_test) print(classification_report(y_true, y_pred)) print()
上面這個例子就符合一般的套路.例子中的SVC是支持多分類的,其默認使用的是ovo的方式,如果需要改變,可以將參數設置為decision_function_shape=’ovr’,具體的可以參看SVC的API文檔.
需要注意的幾個點
1.GridSearch支不支持多分類?
GridSearch只是在將參數組合好了,然后將數據使用k-fold的方式輸入到模型中,然后評估模型的准確性.其本身並不是新的分類方法,所以只要你選擇的estimator可以應用於多分類,就可以.上面的例子手寫體的識別就是一個多分類的問題.你選擇的模型評估方法也需要滿足多分類問題.當你使用roc_auc的時候評估模型的時候就需要注意數據格式.

2.GridSearch的estimator有的時候會出現嵌套,比如adaboost()集成學習中,就需要Gridsearch支持嵌套參數.雙下划線__就表示該參數是嵌套參數,內層的參數.(這一點我沒有試驗過,只是看到有人這樣說…)當然gridsearch也有專門針對集成學習的API.
嵌套參數這篇博客有個例子:
———2017.4.18

的內容，更多

Gridsearch 機器 sklearn 簡單使用 Python 學習包的

的內容，請您使用右上方搜索功能獲取相關信息。

Python機器學習包的sklearn中的Gridsearch簡單使用

免責聲明！