【sklearn】Toy datasets上的分類/回歸問題 (XGBoost實踐)

本文轉載自查看原文 2019-11-10 18:21 344 sklearn

分類問題

1. 手寫數字識別問題

from sklearn.datasets import load_digits

digits = load_digits()  # 加載手寫字符識別數據集
X = digits.data  # 特征值
y = digits.target  # 目標值

X.shape, y.shape

((1797, 64), (1797,))

划分70%訓練集，30%測試集，

from sklearn.model_selection import train_test_split

# 划分數據集，70% 訓練數據和 30% 測試數據
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1257, 64), (540, 64), (1257,), (540,))

使用默認參數，

import xgboost as xgb

model_c = xgb.XGBClassifier()
model_c

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
subsample=1, verbosity=1)

參數解釋：
max_depth – 基學習器的最大樹深度。
learning_rate – Boosting 學習率。
n_estimators – 決策樹的數量。
gamma – 懲罰項系數，指定節點分裂所需的最小損失函數下降值。
booster – 指定提升算法：gbtree, gblinear or dart。
n_jobs – 指定多線程數量。
reg_alpha – L1 正則權重。
reg_lambda – L2 正則權重。
scale_pos_weight – 正負權重平衡。
random_state – 隨機數種子。

model_c.fit(X_train, y_train)  # 使用訓練數據訓練
model_c.score(X_test, y_test)  # 使用測試數據計算准確度

0.9592592592592593

簡單調參：

max_depth	result
1	0.912962962962963
2	0.9666666666666667
3	0.9592592592592593
4	0.9629629629629629
5	0.9537037037037037
6	0.9537037037037037

max_depth=2時效果最好。

learning_rate	result
0.05	0.9148148148148149
0.1	0.9666666666666667
0.15	0.9777777777777777
0.2	0.9814814814814815
0.25	0.9851851851851852
0.3	0.9796296296296296
0.35	0.9851851851851852
0.4	0.9851851851851852

learning_rate設置為0.25比較好。

n_estimators	result
50	0.9722222222222222
75	0.9777777777777777
100	0.9851851851851852
125	0.987037037037037
150	0.987037037037037
200	0.987037037037037

n_estimators設置為125比較好。

經過調參，准確率從95.9%提高到了98.7%。

from matplotlib import pyplot as plt
from matplotlib.pylab import rcParams
%matplotlib inline

# 設置圖像大小
rcParams['figure.figsize'] = [50, 10]

xgb.plot_tree(model_c, num_trees=1)

回歸問題

1. 波士頓房價預測問題

from sklearn.datasets import load_boston

boston = load_boston()
X = boston.data  # 特征值
y = boston.target  # 目標值

# 划分數據集，80% 訓練數據和 20% 測試數據
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((404, 13), (102, 13), (404,), (102,))

默認參數：

model_r = xgb.XGBRegressor()
model_r

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
importance_type='gain', learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)

model_r.fit(X_train, y_train)  # 使用訓練數據訓練
model_r.score(X_test, y_test)  # 使用測試數據計算 R^2 評估指標

0.811524182952107

參考資料

scikit-learn toy datasets doc.

實驗樓-XGBoost 梯度提升基礎課程

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Sklearn中的回歸和分類算法 sklearn多分類問題使用XGBoost實現多分類預測的實踐 sklearn實現多分類邏輯回歸 sklearn導入模塊問題：python ImportError: No module named datasets Sklearn庫例子2：分類——線性回歸分類（Line Regression ）例子 Sklearn庫例子3：分類——嶺回歸分類（Ridge Regression ）例子 from sklearn.datasets import make_classification創建分類數據集線性回歸邏輯回歸分類問題的區別 xgboost與sklearn的接口