XGBoost 原生版本和sklearn接口版本的使用（泰坦尼克數據）

本文轉載自查看原文 2020-09-10 17:02 1093 xgboost

2021.3.11補充：

官網地址：https://xgboost.readthedocs.io/en/latest/python/python_api.html

DMatrix

是XGBoost中使用的數據矩陣。DMatrix是XGBoost使用的內部數據結構，它針對內存效率和訓練速度進行了優化

class xgboost.DMatrix(data, label=None, *, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, 
feature_types=None, nthread=None, group=None, qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None, enable_categorical=False)

參數：

data：即是入模特征的表，可以是多種數據類型，df，或者numpy.array 等等

label：即是y值，數據類型同上

還有一些我們暫時不需要，平時使用到的一般都是這兩個變量，下面說一下屬性

get_label()：可以得到y值

其余的需要到再閱讀API文檔

train

xgboost.train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=None, 
early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None)

參數：

params：參數，是dict形式

dtrain（DMatrix）：要訓練的數據，也就是上面xgboost.DMatrix后得到的數據

num_boost_round（int）：提升迭代的次數

evals（對列表（DMatrix，字符串））–在訓練期間將評估其度量的驗證集列表。驗證指標將幫助我們跟蹤模型的性能，也就是訓練集和測試集（train-auc:0.92495 valid-auc:0.91495）展示成這個樣子，平時有人會寫成watchlist

obj（function）–自定義的目標函數

early_stopping_rounds（int）：如果迭代完還是找不到最優次數，那么就是使用這個值最為最優迭代次數

verbose_eval（布爾值或整數）：每隔n次迭代一次

返回就是一個模型，API沒有詳細的說明，但是我們知道有如下屬性或者方法：

函數/方法：['attr', 'attributes', 'boost', 'copy', 'dump_model', 'eval', 'eval_set', 'get_dump', 'get_fscore', 'get_score', 'get_split_value_histogram', 'inplace_predict', 'load_config', 'load_model', 'load_rabit_checkpoint', 'predict', 'save_config', 'save_model', 'save_rabit_checkpoint', 'save_raw', 'set_attr', 'set_param', 'trees_to_dataframe', 'update']

屬性：['best_iteration', 'best_ntree_limit', 'best_score', 'booster', 'feature_names', 'feature_types', 'handle']

`xgboost.cv`

xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=None, 
early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)

主要用來尋找最優參數的，通過交叉驗證去尋找最優參數

params（dict）–助推器參數。
dtrain（DMatrix）–要訓練的數據。
num_boost_round（int）–提升迭代的次數。通常表達成num_boost_round=model.get_params()['n_estimators']
nfold（int）– CV的折疊數。
stratified（布爾）–執行分層采樣。不常用，需要分層采樣時再使用
folds（KFold或StratifiedKFold實例或折疊索引列表）– Sklearn KFolds或StratifiedKFolds對象。或者，可以顯式傳遞每個折疊的樣本索引。對於n折疊，折疊應n為元組的長度列表。每個元組在(in,out)哪里in是用作n第三折的訓練樣本out的索引列表，並且是用作n第三折的測試樣本的索引列表。
metrics（字符串或字符串列表）–在CV中要觀察的評估指標。
obj（function）–自定義目標函數。
feval（函數）–自定義評估函數。
maximize（布爾）–是否最大化盛宴。
early_stopping_rounds（int）–激活提前停止。交叉驗證度量標准（通過CV折疊計算得出的驗證度量標准的平均值）需要在每個Early_stopping_rounds回合中至少改善一次，以繼續進行訓練。評估歷史記錄中的最后一個條目將代表最佳迭代。如果在params中給定的eval_metric參數中有多個度量標准，則最后一個度量標准將用於提前停止。
fpreproc（函數）–預處理函數，它接受（dtrain，dtest，param）並返回這些函數的轉換版本。
as_pandas（bool，默認為True）–安裝pandas時返回pd.DataFrame。如果未安裝False或pandas，則返回np.ndarray
verbose_eval（bool，int或None，默認為None）–是否顯示進度。如果為None，則返回np.ndarray時將顯示進度。如果為True，則進度將在提升階段顯示。如果給定一個整數，則將在每個給定的verbose_eval提升階段顯示進度。
show_stdv（bool，默認為True）–是否顯示進行中的標准偏差。結果不受影響，並且始終包含std。
seed（int）–用於生成折疊的種子（傳遞給numpy.random.seed）
callbacks (list of callback functions) –在每次迭代結束時應用的回調函數列表。通過使用Callback API可以使用預定義的回調。例子：
```
[xgb.callback.LearningRateScheduler(custom_rates)]
```

一般這樣使用

def model_cv(model, X, y, cv_folds=5, early_stopping_rounds=50, seed=0):
    xgb_param = model.get_xgb_params()
    xgtrain = xgb.DMatrix(X, label=y)
    cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=model.get_params()['n_estimators'], nfold=cv_folds,
                    metrics='auc', seed=seed, callbacks=[
            xgb.callback.print_evaluation(show_stdv=False),
            xgb.callback.early_stop(early_stopping_rounds)
       ])
    num_round_best = cvresult.shape[0] - 1
    print('Best round num: ', num_round_best)
    return num_round_best


num_round = 500
seed = 0
max_depth = 4
min_child_weight = 1000
gamma = 0
subsample = 0.8
colsample_bytree = 0.8
scale_pos_weight = 1
reg_alpha = 1
reg_lambda = 1e-5
learning_rate = 0.1
model = XGBClassifier(learning_rate=learning_rate, n_estimators=num_round, max_depth=max_depth,
                      min_child_weight=min_child_weight, gamma=gamma, subsample=subsample, reg_alpha=reg_alpha,
                      reg_lambda=reg_lambda, colsample_bytree=colsample_bytree, objective='binary:logistic',
                      nthread=4, scale_pos_weight=scale_pos_weight, seed=seed)
num_round = model_cv(model,X , y)

2. 兩個版本的區別

建議還是使用原生版本

一、XGBoost的原生版本參數介紹

1.1 General Parameters通用參數

booster [default=gbtree]：可選項為gbtree，gblinear或dart；其中gbtree和dart是使用基於樹模型的，而gblinear是使用基於線性模型的；
silent [default=0]：0表示輸出運行信息，1表示不輸出；
nthread [如果不進行設置，默認是最大線程數量]：表示XGBoost運行時的並行線程數量；
disable_default_eval_metric [default=0]：標記以禁用默認度量標准。設置 >0 表示禁用；
num_pbuffer [通過XGBoost自動設置，不需要用戶來設置]：預測緩沖區的大小，通常設置為訓練實例的數量；
num_feature [通過XGBoost自動設置，不需要用戶來設置]：被使用在boosting中的特征維度，設置為最大化的特征維度

1.2 Parameters for Tree Booster：

eta (default=0.3, 別名: learning_rate) ：eta表示學習率：range：[0, 1] ，作用：防止過擬合；
gamma [default=0, 別名: min_split_loss]：在樹的葉節點上進一步分區所需的最小化損失減少，gamma越大算法越保守 range:[0, ∞]；
max_depth [default=6]：表示樹的深度，值越大模型越復雜，越容易過擬合。0表示不限制；
min_child_weight [default=1]：子節點所需要的最小樣本權重之和。如果一個葉子節點的樣本權重和小於min_child_weight結束節點進一步的切分。在線性回歸模型中，這個參數是指建立每個模型所需要的最小樣本數。該值越大，算法越保守；
max_delta_step [default=0]：我們允許每個葉子輸出的最大的delta step，該值為0，表示不限制。該值為正數，可以幫助使更新步驟更加保守。通常該參數不需要設置，但是在logistic回歸中，分類類別極度不平衡的時候，將該值設置在1_10之間可以幫助控制更新步驟；
subsample [default=1]：訓練數據的子樣本，subsample=n，表示在訓練數據中隨機采樣n%的樣本，可以防止過擬合。 range：(0, 1] ；
lambda [default=1, 別名: reg_lambda]： L2正則化項系數；
alpha [default=0, 別名: reg_alpha]： L1正則化項系數；
tree_method string [default= auto]：在分布式和外存的版本中，僅支持 tree_method=approx；可選項為：auto, exact, approx, hist, gpu_exact, gpu_hist
- auto：表示使用啟發式的方法來選擇使運行速度最快的算法，如下：
  - 對於小到中等的數據集，Exact Greedy Algorithm將被使用；
  - 對於大數據集，Approximate Algorithm將被使用；
  - 因為以前的行為總是在單個機器中使用Exact Greedy Algorithm，所以當選擇Approximate Algorithm來通知該選擇時，用戶將得到消息。
- exact：Exact Greedy Algorithm
- approx：Approximate Algorithm
- hist：快速直方圖優化近似貪心算法。它使用了一些可以改善性能的方法，例如bins caching；
- gpu_exact：在GPU上執行Exact Greedy Algorithm；
- gpu_hist：在GPU上執行hist算法；
max_leaves [default=0]：設置葉節點的最大數量，僅僅和當row_policy=lossguide才需要被設置；
max_bin, [default=256]：僅僅tree_method=hist時，該方法需要去設置。bucket連續特征的最大離散bins數量；

1.3 學習任務參數（Learning Task Parameters）

objective [default=reg:linear]
- reg:linear：線性回歸；
- reg:logistic：邏輯回歸；
- binary:logistic：二分類邏輯回歸，輸出概率，難怪后面會有>0.5的操作；
- binary:logitraw：二分類邏輯回歸，在logistic transformation之前輸出score；
- binary:hinge：二分類的hinge損失，讓預測為0或1，而不是概率；
- multi:softmax：多分類的使用softmax目標函數，使用此含參數時需要指定多分類分為幾類，設置num_class=n；
- multi:softprob: 和softmax相同，但是輸出的是每個樣本點屬於哪個類的預測概率值；
- rank:pairwise：使用XGBoost做排序任務使用的。
base_score [default=0.5]：所有實例的初始預測分數，全局偏差。對於有足夠的迭代數目，改變該值將不會太多的影響；
eval_metric [default according to objective] ：默認：根據objective參數(回歸：rmse, 分類：error)。還有許多可以自己查官方API。

二、XGBoost的sklearn接口版本參數介紹

因為XGBoost是使用的是一堆CART樹進行集成的，而CART(Classification And Regression Tree)樹即可用作分類也可用作回歸，這里僅僅介紹XGBoost的分類，回歸問題類似，有需要請訪問XGBoost API的官網進行查看。

class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)

max_depth : int 表示基學習器的最大深度；
learning_rate : float 表示學習率，相當於原生版本的 "eta";
n_estimators: int 表示去擬合的boosted tree數量；
silent：boolean 表示是否在運行boosting期間打印信息；
objective：string or callable 指定學習任務和相應的學習目標或者一個自定義的函數被使用，具體看原生版本的objective；
booster：string 指定要使用的booster，可選項為：gbtree，gblinear 或 dart；
n_jobs：int 在運行XGBoost時並行的線程數量。
gamma：float 在樹的葉節點上進行進一步分區所需的最小損失的減少值，即加入新節點進入的復雜度的代價；
min_child_weight ： int 在子節點中實例權重的最小的和；
max_delta_step ： int 我們允許的每棵樹的權重估計最大的delta步驟；
subsample ：float 訓練樣本的子采樣率；
colsample_bytree ：float 構造每個樹時列的子采樣率。
colsample_bylevel ：float 在每一層中的每次切分節點時的列采樣率；
reg_alpha ：float 相當於原生版本的alpha，表示L1正則化項的權重系數；
reg_lambda： float 相當於原生版本的lambda，表示L2正則化項的權重系數；
scale_pos_weight：float 用來平衡正負權重；
base_score：所有實例的初始預測分數，全局偏差；
random_state：int 隨機種子；
missing：float，optional 需要作為缺失值存在的數據中的值。如果為None，則默認為np.nan。

三、代碼

數據字典

survival------表示乘客是否存活；0=No，1=Yes
pclass------表示票的等級；1=1st，2=2nd，3=3rd
sex------表示乘客性別；
Age------表示乘客年齡
sibsp------表示在船上的兄弟姐妹加上配偶的數量
parch------表示在船上的父母加上子女的數量
ticket------表示票的編號
fare------表示票價
cabin------表示船艙編號
embarked------表示乘客登錄的港口；C = Cherbourg, Q = Queenstown, S = Southampton

數據的特征處理

導入模塊

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb from sklearn.model_selection import train_test_split
from sklearn import preprocessing 
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder
 
import warnings
warnings.filterwarnings('ignore')

導入訓練集和測試集

train =pd.read_csv("D:\\Users\\Downloads\\《泰坦尼克號數據分析項目數據》\\train.csv", index_col=0)
test = pd.read_csv("D:/Users/Downloads/《泰坦尼克號數據分析項目數據》/test.csv", index_col=0)
train.info()  # 打印訓練數據的信息

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

從輸出信息中可以看出訓練集一共有891個樣本,11個特征，所有數據所占的內存大小為83.5K；所有的特征中有兩個特征缺失情況較為嚴重,一個是Age,一個是Cabin;一個缺失不嚴重Embarked；數據一共有三種類型,float64(2), int64(5), object(5)。

接下來就是對數據的缺失值進行處理，這里采用的方法是對連續值用該列的平均值進行填充，非連續值用該列的眾數進行填充，還可以使用機器學習的模型對缺失值進行預測，用預測的值來填充缺失值，該方法這里不做介紹

def handle_na(train, test):  # 將Cabin特征刪除
    fare_mean = train['Fare'].mean()  # 測試集的fare特征有缺失值，用訓練數據的均值填充
    test.loc[pd.isnull(test.Fare), 'Fare'] = fare_mean
 
    embarked_mode = train['Embarked'].mode()  # 用眾數填充
    train.loc[pd.isnull(train.Embarked), 'Embarked'] = embarked_mode[0]
    
    train.loc[pd.isnull(train.Age), 'Age'] = train['Age'].mean()  # 用均值填充年齡
    test.loc[pd.isnull(test.Age), 'Age'] = train['Age'].mean()
    return train, test
 
new_train, new_test = handle_na(train, test)  # 填充缺失值

由於Embarked，Sex，Pclass特征是離散特征，所以對其進行one-hot/get_dummies編碼

# 對Embarked和male特征進行one-hot/get_dummies編碼
new_train = pd.get_dummies(new_train, columns=['Embarked', 'Sex', 'Pclass'])
new_test = pd.get_dummies(new_test, columns=['Embarked', 'Sex', 'Pclass'])

然后再去除掉PassengerId，Name，Ticket，Cabin, Survived列，這里不使用這些特征做預測

target = new_train['Survived'].values
# 刪除PassengerId，Name，Ticket，Cabin, Survived列,  且全部換成了數組的形式
df_train = new_train.drop(['Name','Ticket','Cabin','Survived'], axis=1).values
df_test = new_test.drop(['Name','Ticket','Cabin'], axis=1).values

不管是特征還是label都已經換成了數組（array）形式，可能模型接收的數據形式就是這樣

使用原生態版本

X_train,X_test,y_train,y_test = train_test_split(df_train,target,test_size = 0.3,random_state = 1)
 
data_train = xgb.DMatrix(X_train, y_train)  # 使用XGBoost的原生版本需要對數據進行轉化
data_test = xgb.DMatrix(X_test, y_test)  #這個是使用原生態版本必須要做的事情
 
param = {'max_depth': 5, 'eta': 1, 'objective': 'binary:logistic'}
watchlist = [(data_test, 'test'), (data_train, 'train')]  #這個參數需要特別注意一下
n_round = 3
booster = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist)  #這里也是
 
# 計算錯誤率
y_predicted = booster.predict(data_test)   #注意這里使用的測試集
y = data_test.get_label()   #這個函數是xgb.DMatrix里面的，具體還得看看怎么使用
 
accuracy = sum(y == (y_predicted > 0.5))  #sum（布爾型）時，只計算True的值
#這個首先y_predicted > 0.5返回的是布爾型的數據，而y又是0或者1，那么y == (y_predicted > 0.5)，當y=1且(y_predicted > 0.5)=True時，或者
#當y=0且(y_predicted > 0.5)=False時，返回的才是True，其余的都是False
accuracy_rate = float(accuracy) / len(y_predicted)
print ('樣本總數：{0}'.format(len(y_predicted)))
print ('正確數目：{0}'.format(accuracy) )
print ('正確率：{0:.3f}'.format((accuracy_rate)))

[0]	test-error:0.231343	train-error:0.126806
[1]	test-error:0.227612	train-error:0.117175
[2]	test-error:0.223881	train-error:0.104334
樣本總數：268
正確數目：208
正確率：0.776

sklearn 接口版本的用法

XGBoost的sklearn的接口版本用法與sklearn中的模型的用法相同，這里簡單的進行使用

X_train,X_test,y_train,y_test = train_test_split(df_train,target,test_size = 0.3,random_state = 1)
 
model = xgb.XGBClassifier(max_depth=3, n_estimators=200, learn_rate=0.01) #使用時主要區別在這里，其實接口形式的和其他的模型用法基本一樣
model.fit(X_train, y_train)  
test_score = model.score(X_test, y_test)  #也是使用測試集
print('test_score: {0}'.format(test_score))

test_score: 0.7723880597014925

使用其他模型看看區別如何

# 應用模型進行預測
from sklearn.model_selection import ShuffleSplit  #使用ShuffleSplit方法，可以隨機的把數據打亂，然后分為訓練集和測試集。可以指定測試集占比
model_lr = LogisticRegression()   #邏輯回歸
model_rf = RandomForestClassifier(n_estimators=200)  #隨機深林
model_xgb = xgb.XGBClassifier(max_depth=5, n_estimators=200, learn_rate=0.01)  #sklearn接口版本
models = [model_lr, model_rf, model_xgb]
model_name = ['LogisticRegression', '隨機森林', 'XGBoost']
 
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=1)
for i in range(3):
    print(model_name[i] + ":")
    model = models[i]
    for train, test in cv.split(df_train):    
        model.fit(df_train[train], target[train])
        train_score = model.score(df_train[train], target[train])
        test_score = model.score(df_train[test], target[test])
        print('train score: {0:.5f} \t test score: {0:.5f}'.format(train_score, test_score))

LogisticRegression:
train score: 0.81220 	 test score: 0.81220
train score: 0.81701 	 test score: 0.81701
train score: 0.82183 	 test score: 0.82183
隨機森林:
train score: 0.98876 	 test score: 0.98876
train score: 0.99037 	 test score: 0.99037
train score: 0.99037 	 test score: 0.99037
XGBoost:
train score: 0.95185 	 test score: 0.95185
train score: 0.96629 	 test score: 0.96629
train score: 0.95345 	 test score: 0.95345

備注一下：random_state真的是一個很神奇的參數，值不一樣得到的結果也會有很大的區別，導致上面的結果差異這么大

下面我就做了一個循環，記錄每次的結果，看的出來結果波動還是很大的

l=[]
for i in range(100):
    X_train,X_test,y_train,y_test = train_test_split(df_train,target,test_size = 0.3,random_state = i)
     
    model = xgb.XGBClassifier(max_depth=3, n_estimators=200, learn_rate=0.01) #使用時主要區別在這里，其實接口形式的和其他的模型用法基本一樣
    model.fit(X_train, y_train)  
    test_score = model.score(X_test, y_test)  #也是使用測試集
    print('{0} :test_score: {1}'.format(i,test_score))
    l.append(test_score)
plt.plot(list(range(100)),l)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Kaggle泰坦尼克數據科學解決方案 python代寫缺失值處理案例分析:泰坦尼克數據 sklearn機器學習-泰坦尼克號泰坦尼克獲救預測泰坦尼克號-數據挖掘項目實戰泰坦尼克號獲救問題泰坦尼克號之災分析泰坦尼克號幸存預測 Kaggle泰坦尼克號案例 pytorch kaggle 泰坦尼克生存預測