內容概要¶
- 如何使用K折交叉驗證來搜索最優調節參數
- 如何讓搜索參數的流程更加高效
- 如何一次性的搜索多個調節參數
- 在進行真正的預測之前,如何對調節參數進行處理
- 如何削減該過程的計算代價
1. K折交叉驗證回顧¶
交叉驗證的過程
- 選擇K的值(一般是10),將數據集分成K等份
- 使用其中的K-1份數據作為訓練數據,另外一份數據作為測試數據,進行模型的訓練
- 使用一種度量測度來衡量模型的預測性能
交叉驗證的優點
- 交叉驗證通過降低模型在一次數據分割中性能表現上的方差來保證模型性能的穩定性
- 交叉驗證可以用於選擇調節參數、比較模型性能差別、選擇特征
交叉驗證的缺點
- 交叉驗證帶來一定的計算代價,尤其是當數據集很大的時候,導致計算過程會變得很慢
2. 使用GridSearchCV進行高效調參¶
GridSearchCV根據你給定的模型自動進行交叉驗證,通過調節每一個參數來跟蹤評分結果,實際上,該過程代替了進行參數搜索時的for循環過程。
In [1]:
from sklearn.datasets import load_iris from sklearn.neighbors import KNeighborsClassifier import matplotlib.pyplot as plt %matplotlib inline from sklearn.grid_search import GridSearchCV
In [2]:
# read in the iris data
iris = load_iris() # create X (features) and y (response) X = iris.data y = iris.target
In [3]:
# define the parameter values that should be searched
k_range = range(1, 31) print k_range
In [4]:
# create a parameter grid: map the parameter names to the values that should be searched
# 下面是構建parameter grid,其結構是key為參數名稱,value是待搜索的數值列表的一個字典結構 param_grid = dict(n_neighbors=k_range) print param_grid
In [5]:
knn = KNeighborsClassifier(n_neighbors=5) # instantiate the grid # 這里GridSearchCV的參數形式和cross_val_score的形式差不多,其中param_grid是parameter grid所對應的參數 # GridSearchCV中的n_jobs設置為-1時,可以實現並行計算(如果你的電腦支持的情況下) grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
我們可以知道,這里的grid search針對每個參數進行了10次交叉驗證,並且一共對30個參數進行相同過程的交叉驗證
In [6]:
grid.fit(X, y)
Out[6]:
In [7]:
# view the complete results (list of named tuples)
grid.grid_scores_
Out[7]:
In [8]:
# examine the first tuple
print grid.grid_scores_[0].parameters print grid.grid_scores_[0].cv_validation_scores print grid.grid_scores_[0].mean_validation_score
In [9]:
# create a list of the mean scores only
grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_] print grid_mean_scores
In [10]:
# plot the results
plt.plot(k_range, grid_mean_scores) plt.xlabel('Value of K for KNN') plt.ylabel('Cross-Validated Accuracy')
Out[10]:
In [11]:
# examine the best model
print grid.best_score_ print grid.best_params_ print grid.best_estimator_
3. 同時對多個參數進行搜索¶
這里我們使用knn的兩個參數,分別是n_neighbors和weights,其中weights參數默認是uniform,該參數將所有數據看成等同的,而另一值是distance,它將近鄰的數據賦予更高的權重,而較遠的數據賦予較低權重。
In [12]:
# define the parameter values that should be searched
k_range = range(1, 31) weight_options = ['uniform', 'distance']
In [13]:
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range, weights=weight_options) print param_grid
In [14]:
# instantiate and fit the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy') grid.fit(X, y)
Out[14]:
In [15]:
# view the complete results
grid.grid_scores_
Out[15]:
In [16]:
# examine the best model
print grid.best_score_ print grid.best_params_
4. 使用最佳參數做出預測¶
In [17]:
# train your model using all data and the best known parameters
knn = KNeighborsClassifier(n_neighbors=13, weights='uniform') knn.fit(X, y) # make a prediction on out-of-sample data knn.predict([3, 5, 4, 2])
Out[17]:
這里使用之前得到的最佳參數對模型進行重新訓練,在訓練時,就可以將所有的數據都作為訓練數據全部投入到模型中去,這樣就不會浪費個別數據了。
In [18]:
# shortcut: GridSearchCV automatically refits the best model using all of the data
grid.predict([3, 5, 4, 2])
Out[18]:
5. 使用RandomizeSearchCV來降低計算代價¶
- RandomizeSearchCV用於解決多個參數的搜索過程中計算代價過高的問題
- RandomizeSearchCV搜索參數中的一個子集,這樣你可以控制計算代價
In [19]:
from sklearn.grid_search import RandomizedSearchCV
In [20]:
# specify "parameter distributions" rather than a "parameter grid"
param_dist = dict(n_neighbors=k_range, weights=weight_options)
In [21]:
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5) rand.fit(X, y) rand.grid_scores_
Out[21]:
In [22]:
# examine the best model
print rand.best_score_ print rand.best_params_
In [23]:
# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = [] for _ in range(20): rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10) rand.fit(X, y) best_scores.append(round(rand.best_score_, 3)) print best_scores
當你的調節參數是連續的,比如回歸問題的正則化參數,有必要指定一個連續分布而不是可能值的列表,這樣RandomizeSearchCV就可以執行更好的grid search。
參考資料¶
- scikit-learn documentation: Grid search, GridSearchCV, RandomizedSearchCV
- Timed example: Comparing randomized search and grid search
- scikit-learn workshop by Andreas Mueller: Video segment on randomized search (3 minutes), related notebook
- Paper by Yoshua Bengio: Random Search for Hyper-Parameter Optimization
轉:http://blog.csdn.net/jasonding1354/article/details/50562522