機器學習模型當中,目前最為先進的也就是xgboost和lightgbm這兩個樹模型了。那么我們該如何進行調試參數呢?哪些參數是最重要的,需要調整的,哪些參數比較一般,這兩個模型又該如何通過代碼進行調用呢?下面是一張總結了xgboost,lightbgm,catboost這三個模型調試參數的一些經驗,以及每個參數需要的具體數值以及含義,供大家參考:
一.Xgboost配合grid search進行網格搜索參數
實現代碼如下:
mport xgboost as xgb from sklearn import metrics from sklearn.model_selection import GridSearchCV def auc(m, train, test): return (metrics.roc_auc_score(y_train, m.predict_proba(train)[:,1]), metrics.roc_auc_score(y_test, m.predict_proba(test)[:,1])) # Parameter Tuning model = xgb.XGBClassifier() param_dist = {"max_depth": [10,30,50], "min_child_weight" : [1,3,6], "n_estimators": [200], "learning_rate": [0.05, 0.1,0.16],} grid_search = GridSearchCV(model, param_grid=param_dist, cv = 3, verbose=10, n_jobs=-1) grid_search.fit(train, y_train) grid_search.best_estimator_ model = xgb.XGBClassifier(max_depth=3, min_child_weight=1, n_estimators=20,\ n_jobs=-1 , verbose=1,learning_rate=0.16) model.fit(train,y_train) print(auc(model, train, test))
這里使用了自定義的auc作為模型的評價指標,輸出如下:
Fitting 3 folds for each of 27 candidates, totalling 81 fits (0.7479275227922775, 0.7430946047035487)
二.LightGBM配合grid search進行網格搜索參數
代碼如下:
import lightgbm as lgb from sklearn import metrics def auc2(m, train, test): return (metrics.roc_auc_score(y_train,m.predict(train)), metrics.roc_auc_score(y_test,m.predict(test))) lg = lgb.LGBMClassifier(silent=False) param_dist = {"max_depth": [25,50, 75], "learning_rate" : [0.01,0.05,0.1], "num_leaves": [300,900,1200], "n_estimators": [200] } grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 3, scoring="roc_auc", verbose=5) grid_search.fit(train,y_train) grid_search.best_estimator_
#使用lgbm原生態的方式進行訓練 d_train = lgb.Dataset(train, label=y_train, free_raw_data=False) params = {"max_depth": 3, "learning_rate" : 0.1, "num_leaves": 900, "n_estimators": 20} # Without Categorical Features model2 = lgb.train(params, d_train) print(auc2(model2, train, test)) #With Catgeorical Features cate_features_name = ["MONTH","DAY","DAY_OF_WEEK","AIRLINE","DESTINATION_AIRPORT", "ORIGIN_AIRPORT"] model2 = lgb.train(params, d_train, categorical_feature = cate_features_name) print(auc2(model2, train, test))
第三種引用方式lbgm的方式是,sklearn和lgbm相結合,這樣就可以使用sklearn對lgbm的運行結果快速進行評估。
# coding: utf-8 import lightgbm as lgb import pandas as pd from sklearn.metrics import mean_squared_error from sklearn.model_selection import GridSearchCV # 加載數據 print('加載數據...') df_train = pd.read_csv('../data/regression.train.txt', header=None, sep='\t') df_test = pd.read_csv('../data/regression.test.txt', header=None, sep='\t') # 取出特征和標簽 y_train = df_train[0].values y_test = df_test[0].values X_train = df_train.drop(0, axis=1).values X_test = df_test.drop(0, axis=1).values print('開始訓練...') # 直接初始化LGBMRegressor # 這個LightGBM的Regressor和sklearn中其他Regressor基本是一致的 gbm = lgb.LGBMRegressor(objective='regression', num_leaves=31, learning_rate=0.05, n_estimators=20) # 使用fit函數擬合 gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric='l1', early_stopping_rounds=5) # 預測 print('開始預測...') y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_) # 評估預測結果 print('預測結果的rmse是:') print(mean_squared_error(y_test, y_pred) ** 0.5)
這就是Xgboost/LightGBM的基本代碼使用啦!