一.損失函數
這一節對xgboost回歸做介紹,xgboost共實現了5種類型的回歸,分別是squarederror、logistic、poisson、gamma、tweedie回歸,下面主要對前兩種進行推導實現,剩余三種放到下一節
squarederror
即損失函數為平方誤差的回歸模型:
\[L(y,\hat{y})=\frac{1}{2}(y-\hat{y})^2 \]
所以一階導和二階導分別為:
\[\frac{\partial L(y,\hat{y})}{\partial \hat{y}}=\hat{y}-y\\ \frac{\partial^2 L(y,\hat{y})}{{\partial \hat{y}}^2}=1.0\\ \]
logistic
由於是回歸任務,所以y也要套上sigmoid
函數(用\(\sigma(\cdot)\)表示),損失函數:
\[L(y,\hat{y})=(1-\sigma(y))log(1-\sigma(\hat{y}))+\sigma(y)log(\sigma(\hat{y})) \]
一階導和二階導分別為:
\[\frac{\partial L(y,\hat{y})}{\partial \hat{y}}=\sigma(\hat{y})-\sigma(y)\\ \frac{\partial^2 L(y,\hat{y})}{{\partial \hat{y}}^2}=\sigma(\hat{y})(1-\sigma(\hat{y}))\\ \]
二.代碼實現
具體流程與gbdt的回歸類似,只是每次要計算一階、二階導數信息,同時基學習器要替換為上一節的xgboost回歸樹
import os
os.chdir('../')
import matplotlib.pyplot as plt
%matplotlib inline
from ml_models.ensemble import XGBoostBaseTree
from ml_models import utils
import copy
import numpy as np
"""
xgboost回歸樹的實現,封裝到ml_models.ensemble
"""
class XGBoostRegressor(object):
def __init__(self, base_estimator=None, n_estimators=10, learning_rate=1.0, loss='squarederror'):
"""
:param base_estimator: 基學習器
:param n_estimators: 基學習器迭代數量
:param learning_rate: 學習率,降低后續基學習器的權重,避免過擬合
:param loss:損失函數,支持squarederror、logistic
"""
self.base_estimator = base_estimator
self.n_estimators = n_estimators
self.learning_rate = learning_rate
if self.base_estimator is None:
# 默認使用決策樹樁
self.base_estimator = XGBoostBaseTree()
# 同質分類器
if type(base_estimator) != list:
estimator = self.base_estimator
self.base_estimator = [copy.deepcopy(estimator) for _ in range(0, self.n_estimators)]
# 異質分類器
else:
self.n_estimators = len(self.base_estimator)
self.loss = loss
def _get_gradient_hess(self, y, y_pred):
"""
獲取一階、二階導數信息
:param y:真實值
:param y_pred:預測值
:return:
"""
if self.loss == 'squarederror':
return y_pred - y, np.ones_like(y)
elif self.loss == 'logistic':
return utils.sigmoid(y_pred) - utils.sigmoid(y), utils.sigmoid(y_pred) * (1 - utils.sigmoid(y_pred))
def fit(self, x, y):
y_pred = np.zeros_like(y)
g, h = self._get_gradient_hess(y, y_pred)
for index in range(0, self.n_estimators):
self.base_estimator[index].fit(x, g, h)
y_pred += self.base_estimator[index].predict(x) * self.learning_rate
g, h = self._get_gradient_hess(y, y_pred)
def predict(self, x):
rst_np = np.sum(
[self.base_estimator[0].predict(x)] +
[self.learning_rate * self.base_estimator[i].predict(x) for i in
range(1, self.n_estimators - 1)] +
[self.base_estimator[self.n_estimators - 1].predict(x)]
, axis=0)
return rst_np
#測試
data = np.linspace(1, 10, num=100)
target = np.sin(data) + np.random.random(size=100) # 添加噪聲
data = data.reshape((-1, 1))
model = XGBoostRegressor(loss='squarederror')
model.fit(data, target)
plt.scatter(data, target)
plt.plot(data, model.predict(data), color='r')
plt.show()
model = XGBoostRegressor(loss='logistic')
model.fit(data, target)
plt.scatter(data, target)
plt.plot(data, model.predict(data), color='r')
plt.show()