xgboost參數
-
選擇較高的學習速率(learning rate)。一般情況下,學習速率的值為0.1。但是,對於不同的問題,理想的學習速率有時候會在0.05到0.3之間波動。選擇對應於此學習速率的理想決策樹數量。XGBoost有一個很有用的函數“cv”,這個函數可以在每一次迭代中使用交叉驗證,並返回理想的決策樹數量。
-
對於給定的學習速率和決策樹數量,進行決策樹特定參數調優(max_depth, min_child_weight, gamma, subsample, colsample_bytree)。在確定一棵樹的過程中,我們可以選擇不同的參數,待會兒我會舉例說明。
-
xgboost的正則化參數的調優。(lambda, alpha)。這些參數可以降低模型的復雜度,從而提高模型的表現。
-
降低學習速率,確定理想參數。
1.讀取libsvm格式數據並指定參數建模
xgboost的使用方法
- ①使用xgboost自帶的數據集格式 + xgboost自帶的建模方式
- 把數據讀取成xgb.DMatrix格式(libsvm/dataframe.values給定X和Y)
- 准備好一個watch_list(觀測和評估的數據集)
- xgb.train(dtrain)
- xgb.predict(dtest)
- ②使用pandas的DataFrame格式 + xgboost的sklearn接口
- estimator = xgb.XGBClassifier()/xgb.XGBRegressor()
- estimator.fit(df_train.values, df_target.values)
#!/usr/bin/python
import numpy as np
#import scipy.sparse
import pickle
import xgboost as xgb
# 基本例子,從libsvm文件中讀取數據,做二分類
# 數據是libsvm的格式
#1 3:1 10:1 11:1 21:1 30:1 34:1 36:1 40:1 41:1 53:1 58:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 105:1 117:1 124:1
#0 3:1 10:1 20:1 21:1 23:1 34:1 36:1 39:1 41:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 120:1
#0 1:1 10:1 19:1 21:1 24:1 34:1 36:1 39:1 42:1 53:1 56:1 65:1 69:1 77:1 86:1 88:1 92:1 95:1 102:1 106:1 116:1 122:1
# 轉換成Dmatrix格式
dtrain = xgb.DMatrix('./data/agaricus.txt.train')
dtest = xgb.DMatrix('./data/agaricus.txt.test')
# 超參數設定
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
# 設定watchlist用於查看模型狀態
watchlist = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)
# 使用模型預測
preds = bst.predict(dtest)
# 判斷准確率
labels = dtest.get_label()
print ('錯誤類為%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
# 模型存儲
bst.save_model('./model/0001.model')
[15:49:14] 6513x127 matrix with 143286 entries loaded from ./data/agaricus.txt.train
[15:49:14] 1611x127 matrix with 35442 entries loaded from ./data/agaricus.txt.test
[0] eval-error:0.042831 train-error:0.046522
[1] eval-error:0.021726 train-error:0.022263
錯誤類為0.021726
2.配合pandas DataFrame格式數據建模
# 皮馬印第安人糖尿病數據集 包含很多字段:懷孕次數 口服葡萄糖耐量試驗中血漿葡萄糖濃度 舒張壓(mm Hg) 三頭肌組織褶厚度(mm)
# 2小時血清胰島素(μU/ ml) 體重指數(kg/(身高(m)^2) 糖尿病系統功能 年齡(歲)
import pandas as pd
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
data.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
# 基本例子,從csv文件中讀取數據,做二分類
# 用pandas讀入數據
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做數據切分
train, test = train_test_split(data)
# 轉換成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)
# 參數設定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }
# 設定watchlist用於查看模型狀態
watchlist = [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)
# 使用模型預測
preds = bst.predict(xgtest)
# 判斷准確率
labels = xgtest.get_label()
print ('錯誤類為%f' % (sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
# 模型存儲
bst.save_model('./model/0002.model')
[0] eval-error:0.322917 train-error:0.21875
[1] eval-error:0.244792 train-error:0.168403
[2] eval-error:0.255208 train-error:0.182292
[3] eval-error:0.270833 train-error:0.170139
[4] eval-error:0.244792 train-error:0.144097
[5] eval-error:0.25 train-error:0.145833
[6] eval-error:0.229167 train-error:0.144097
[7] eval-error:0.25 train-error:0.145833
[8] eval-error:0.239583 train-error:0.147569
[9] eval-error:0.234375 train-error:0.140625
錯誤類為0.234375
3.使用xgboost的sklearn包
#!/usr/bin/python
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
# 基本例子,從csv文件中讀取數據,做二分類
# 用pandas讀入數據
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做數據切分
train, test = train_test_split(data)
# 取出特征X和目標y的部分
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
train_X = train[feature_columns].values
train_y = train[target_column].values
test_X = test[feature_columns].values
test_y = test[target_column].values
# 初始化模型
xgb_classifier = xgb.XGBClassifier(n_estimators=20,\
max_depth=4, \
learning_rate=0.1, \
subsample=0.7, \
colsample_bytree=0.7)
# 擬合模型
xgb_classifier.fit(train_X, train_y)
# 使用模型預測
preds = xgb_classifier.predict(test_X)
# 判斷准確率
print ('錯誤類為%f' %((preds!=test_y).sum()/float(test_y.shape[0])))
# 模型存儲
joblib.dump(xgb_classifier, './model/0003.model')
錯誤類為0.276042
['./model/0003.model']
4.交叉驗證
xgb.cv(param, dtrain, num_round, nfold=5,metrics={'error'}, seed = 0)
train-error-mean | train-error-std | test-error-mean | test-error-std | |
---|---|---|---|---|
0 | 0.006832 | 0.001012 | 0.006756 | 0.001407 |
1 | 0.002994 | 0.002806 | 0.002303 | 0.002524 |
2 | 0.001382 | 0.000352 | 0.001382 | 0.001228 |
3 | 0.001190 | 0.000658 | 0.001382 | 0.001228 |
4 | 0.001382 | 0.000282 | 0.001075 | 0.000921 |
5 | 0.000921 | 0.000506 | 0.001228 | 0.001041 |
6 | 0.000921 | 0.000506 | 0.001228 | 0.001041 |
7 | 0.000921 | 0.000506 | 0.001228 | 0.001041 |
8 | 0.000921 | 0.000506 | 0.001228 | 0.001041 |
9 | 0.000921 | 0.000506 | 0.001228 | 0.001041 |
5.添加預處理的交叉驗證
# 計算正負樣本比,調整樣本權重
def fpreproc(dtrain, dtest, param):
label = dtrain.get_label()
ratio = float(np.sum(label == 0)) / np.sum(label==1)
param['scale_pos_weight'] = ratio
return (dtrain, dtest, param)
# 先做預處理,計算樣本權重,再做交叉驗證
xgb.cv(param, dtrain, num_round, nfold=5, metrics={'auc'}, seed = 0, fpreproc = fpreproc)
train-auc-mean | train-auc-std | test-auc-mean | test-auc-std | |
---|---|---|---|---|
0 | 0.999772 | 0.000126 | 0.999731 | 0.000191 |
1 | 0.999942 | 0.000044 | 0.999909 | 0.000085 |
2 | 0.999964 | 0.000035 | 0.999926 | 0.000084 |
3 | 0.999979 | 0.000036 | 0.999950 | 0.000089 |
4 | 0.999976 | 0.000043 | 0.999946 | 0.000098 |
5 | 0.999994 | 0.000010 | 0.999988 | 0.000020 |
6 | 0.999993 | 0.000012 | 0.999988 | 0.000020 |
7 | 0.999993 | 0.000012 | 0.999988 | 0.000020 |
8 | 0.999993 | 0.000012 | 0.999988 | 0.000020 |
9 | 0.999993 | 0.000012 | 0.999988 | 0.000020 |
6.自定義損失函數與評估准則
print ('running cross validation, with cutomsized loss function')
# 自定義損失函數,需要提供損失函數的一階導和二階導
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0-preds)
return grad, hess
# 自定義評估准則,評估預估值和標准答案之間的差距
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
watchlist = [(dtest,'eval'), (dtrain,'train')]
param = {'max_depth':3, 'eta':0.1, 'silent':1}
num_round = 5
# 自定義損失函數訓練
bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)
# 交叉驗證
xgb.cv(param, dtrain, num_round, nfold = 5, seed = 0, obj = logregobj, feval=evalerror)
running cross validation, with cutomsized loss function
[0] eval-rmse:0.306902 train-rmse:0.306163 eval-error:0.518312 train-error:0.517887
[1] eval-rmse:0.17919 train-rmse:0.177276 eval-error:0.518312 train-error:0.517887
[2] eval-rmse:0.172566 train-rmse:0.171727 eval-error:0.016139 train-error:0.014433
[3] eval-rmse:0.269611 train-rmse:0.271113 eval-error:0.016139 train-error:0.014433
[4] eval-rmse:0.396904 train-rmse:0.398245 eval-error:0.016139 train-error:0.014433
train-error-mean | train-error-std | train-rmse-mean | train-rmse-std | test-error-mean | test-error-std | test-rmse-mean | test-rmse-std | |
---|---|---|---|---|---|---|---|---|
0 | 0.517887 | 0.001085 | 0.308880 | 0.005170 | 0.517886 | 0.004343 | 0.309038 | 0.005207 |
1 | 0.517887 | 0.001085 | 0.176504 | 0.002046 | 0.517886 | 0.004343 | 0.177802 | 0.003767 |
2 | 0.014433 | 0.000223 | 0.172680 | 0.003719 | 0.014433 | 0.000892 | 0.174890 | 0.009391 |
3 | 0.014433 | 0.000223 | 0.275761 | 0.001776 | 0.014433 | 0.000892 | 0.276689 | 0.005918 |
4 | 0.014433 | 0.000223 | 0.399889 | 0.003369 | 0.014433 | 0.000892 | 0.400118 | 0.006243 |
7.只用前n顆樹預測
#!/usr/bin/python
import numpy as np
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.model_selection import train_test_split
# 基本例子,從csv文件中讀取數據,做二分類
# 用pandas讀入數據
data = pd.read_csv('./data/Pima-Indians-Diabetes.csv')
# 做數據切分
train, test = train_test_split(data)
# 轉換成Dmatrix格式
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_column = 'Outcome'
xgtrain = xgb.DMatrix(train[feature_columns].values, train[target_column].values)
xgtest = xgb.DMatrix(test[feature_columns].values, test[target_column].values)
# 參數設定
param = {'max_depth':5, 'eta':0.1, 'silent':1, 'subsample':0.7, 'colsample_bytree':0.7, 'objective':'binary:logistic' }
# 設定watchlist用於查看模型狀態
watchlist = [(xgtest,'eval'), (xgtrain,'train')]
num_round = 10
bst = xgb.train(param, xgtrain, num_round, watchlist)
# 只用第1顆樹預測
ypred1 = bst.predict(xgtest, ntree_limit=1)
# 用前9顆樹預測
ypred2 = bst.predict(xgtest, ntree_limit=9)
label = xgtest.get_label()
print ('用前1顆樹預測的錯誤率為 %f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))
print ('用前9顆樹預測的錯誤率為 %f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))
[0] eval-error:0.28125 train-error:0.203125
[1] eval-error:0.182292 train-error:0.1875
[2] eval-error:0.21875 train-error:0.184028
[3] eval-error:0.213542 train-error:0.175347
[4] eval-error:0.223958 train-error:0.164931
[5] eval-error:0.223958 train-error:0.164931
[6] eval-error:0.208333 train-error:0.164931
[7] eval-error:0.192708 train-error:0.15625
[8] eval-error:0.21875 train-error:0.15625
[9] eval-error:0.208333 train-error:0.147569
用前1顆樹預測的錯誤率為 0.281250
用前9顆樹預測的錯誤率為 0.218750
sklearn與Xgboost配合使用
1.Xgboost建模,sklearn評估
import pickle
import xgboost as xgb
import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, load_boston
rng = np.random.RandomState(31337)
# 二分類:混淆矩陣
print("數字0和1的二分類問題")
digits = load_digits(2)
y = digits['target']
X = digits['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折數據上的交叉驗證")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("混淆矩陣:")
print(confusion_matrix(actuals, predictions))
# 多分類:混淆矩陣
print("\nIris: 多分類")
iris = load_iris()
y = iris['target']
X = iris['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折數據上的交叉驗證")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBClassifier().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("混淆矩陣:")
print(confusion_matrix(actuals, predictions))
# 回歸問題:MSE
print("\n波士頓房價回歸預測問題")
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
print("在2折數據上的交叉驗證")
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBRegressor().fit(X[train_index],y[train_index])
predictions = xgb_model.predict(X[test_index])
actuals = y[test_index]
print("MSE:",mean_squared_error(actuals, predictions))
數字0和1的二分類問題
在2折數據上的交叉驗證
混淆矩陣:
[[87 0]
[ 1 92]]
混淆矩陣:
[[91 0]
[ 3 86]]
Iris: 多分類
在2折數據上的交叉驗證
混淆矩陣:
[[19 0 0]
[ 0 31 3]
[ 0 1 21]]
混淆矩陣:
[[31 0 0]
[ 0 16 0]
[ 0 3 25]]
波士頓房價回歸預測問題
在2折數據上的交叉驗證
[15:53:36] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MSE: 9.860776812557337
[15:53:36] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MSE: 15.942418468446029
2.網格搜索查找最優超參數
# 第2種訓練方法的 調參方法:使用sklearn接口的regressor + GridSearchCV
print("參數最優化:")
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
param_dict = {'max_depth': [2,4,6],
'n_estimators': [50,100,200]}
clf = GridSearchCV(xgb_model, param_dict, verbose=1)
clf.fit(X,y)
print(clf.best_score_)
print(clf.best_params_)
參數最優化:
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:37] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[15:53:38] WARNING: d:\build\xgboost\xgboost-0.90.git\src\objective\regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
0.6001029721598573
{'max_depth': 4, 'n_estimators': 100}
[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.7s finished
3.early-stopping 早停
# 第1/2種訓練方法的 調參方法:early stopping
# 在訓練集上學習模型,一顆一顆樹添加,在驗證集上看效果,當驗證集效果不再提升,停止樹的添加與生長
X = digits['data']
y = digits['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier()
clf.fit(X_train,
y_train,
early_stopping_rounds=10,
eval_metric="auc",
eval_set=[(X_val, y_val)])
[0] validation_0-auc:0.999497
Will train until validation_0-auc hasn't improved in 10 rounds.
[1] validation_0-auc:0.999497
[2] validation_0-auc:0.999497
[3] validation_0-auc:0.999749
[4] validation_0-auc:0.999749
[5] validation_0-auc:0.999749
[6] validation_0-auc:0.999749
[7] validation_0-auc:0.999749
[8] validation_0-auc:0.999749
[9] validation_0-auc:0.999749
[10] validation_0-auc:1
[11] validation_0-auc:1
[12] validation_0-auc:1
[13] validation_0-auc:1
[14] validation_0-auc:1
[15] validation_0-auc:1
[16] validation_0-auc:1
[17] validation_0-auc:1
[18] validation_0-auc:1
[19] validation_0-auc:1
[20] validation_0-auc:1
Stopping. Best iteration:
[10] validation_0-auc:1
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)
4.特征重要度
iris = load_iris()
y = iris['target']
X = iris['data']
xgb_model = xgb.XGBClassifier().fit(X,y)
print('特征排序:')
feature_names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
# 獲取特征重要度
feature_importances = xgb_model.feature_importances_
indices = np.argsort(feature_importances)[::-1]
for index in indices:
print("特征 %s 重要度為 %f" %(feature_names[index], feature_importances[index]))
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
plt.title("feature importances")
plt.bar(range(len(feature_importances)), feature_importances[indices], color='b')
plt.xticks(range(len(feature_importances)), np.array(feature_names)[indices], color='b')
特征排序:
特征 petal_length 重要度為 0.595834
特征 petal_width 重要度為 0.358166
特征 sepal_width 重要度為 0.033481
特征 sepal_length 重要度為 0.012520
([<matplotlib.axis.XTick at 0x1ed5a5bc7b8>,
<matplotlib.axis.XTick at 0x1ed5a3e6278>,
<matplotlib.axis.XTick at 0x1ed5a65c780>,
<matplotlib.axis.XTick at 0x1ed5a669748>],
<a list of 4 Text xticklabel objects>)
5.並行訓練加速
import os
if __name__ == "__main__":
try:
from multiprocessing import set_start_method
except ImportError:
raise ImportError("Unable to import multiprocessing.set_start_method."
" This example only runs on Python 3.4")
set_start_method("forkserver")
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_boston
import xgboost as xgb
rng = np.random.RandomState(31337)
print("Parallel Parameter optimization")
boston = load_boston()
os.environ["OMP_NUM_THREADS"] = "2" # or to whatever you want
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
{'max_depth': [2, 4, 6],'n_estimators': [50, 100, 200]},
verbose=1,
n_jobs=2)
clf.fit(X, y)
print(clf.best_score_)
print(clf.best_params_)