GBDT梯度提升樹算法及官方案例

本文轉載自查看原文 2018-05-09 10:32 1302 Python進化論

梯度提升樹是一種決策樹的集成算法。它通過反復迭代訓練決策樹來最小化損失函數。決策樹類似，梯度提升樹具有可處理類別特征、易擴展到多分類問題、不需特征縮放等性質。Spark.ml通過使用現有decision tree工具來實現。

梯度提升樹依次迭代訓練一系列的決策樹。在一次迭代中，算法使用現有的集成來對每個訓練實例的類別進行預測，然后將預測結果與真實的標簽值進行比較。通過重新標記，來賦予預測結果不好的實例更高的權重。所以，在下次迭代中，決策樹會對先前的錯誤進行修正。

對實例標簽進行重新標記的機制由損失函數來指定。每次迭代過程中，梯度迭代樹在訓練數據上進一步減少損失函數的值。spark.ml為分類問題提供一種損失函數（Log Loss），為回歸問題提供兩種損失函數（平方誤差與絕對誤差）。

Spark.ml支持二分類以及回歸的隨機森林算法，適用於連續特征以及類別特征。不支持多分類問題。


# -*- coding: utf-8 -*-
"""
Created on Wed May  9 09:53:30 2018

@author: admin
"""

import numpy as np
import matplotlib.pyplot as plt

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

# #############################################################################
# Load data
boston = datasets.load_boston()
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

# #############################################################################
# Fit regression model
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
          'learning_rate': 0.01, 'loss': 'ls'}   #隨便指定參數長度，也不用在傳參的時候去特意定義一個數組傳參
clf = ensemble.GradientBoostingRegressor(**params)

clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
print("MSE: %.4f" % mse)

# #############################################################################
# Plot training deviance

# compute test set deviance
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_predict(X_test)):
    test_score[i] = clf.loss_(y_test, y_pred)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
         label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
         label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')

# #############################################################################
# Plot feature importance
feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, boston.feature_names[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

房產數據介紹：

- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- RAD      index of accessibility to radial highways
- TAX      full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    % lower status of the population
- MEDV     Median value of owner-occupied homes in $1000'

參考：http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regression-py

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 集成學習之梯度提升樹(GBDT)算法 GBDT：梯度提升決策樹梯度提升樹(GBDT)原理小結 GBDT（梯度提升樹）原理小結 GBDT 梯度提升樹原理總結梯度提升決策樹(GBDT）【機器學習】：梯度提升決策樹（GBDT）梯度提升決策樹（GBDT）與XGBoost、LightGBM scikit-learn 梯度提升樹(GBDT)調參小結 [機器學習]梯度提升決策樹--GBDT