Kaggle八門神器（一）：競賽神器之XGBoost介紹

本文轉載自查看原文 2019-05-28 22:30 562 Machine Learning/ Package Man

Xgboost為一個十分有效的機器學習模型，在各種競賽中均可以看到它的身影，同時Xgboost在工業屆也有着廣泛的應用，本文以Titanic數據集為研究對象，簡單地探究Xgboost模型建模過程，同時對數據清理以及特征工程的內容作簡單的介紹，以此作為Xgboost模型的學習筆記，錯誤和不足之處還請各位看官指出。

數據集

本文數據集源自於競賽Titanic: Machine Learning from Disaster，競賽中我們要求根據數據集提供的乘客編號、姓名性別等信息，運用機器學習模型預測船上乘客的存活與否

泰坦尼克號沉沒事故（英語：Sinking of the RMS Titanic）是1912年4月14日深夜至15日凌晨在北大西洋發生的著名船難，事發時是泰坦尼克號從英國南安普敦港至美國紐約港首航的第5天，該船當時是世界最大的郵輪。1912年4月14日星期天23時40分[a]與一座冰山擦撞前，已經收到6次海冰警告，但當瞭望員看到冰山時，該船的行駛速度正接近最高速。由於無法快速轉向，該船右舷側面遭受了一次撞擊，部分船體出現縫隙，使16個水密隔艙中的5個進水。泰坦尼克號的設計僅能夠承受4個水密隔艙進水，因此沉沒。 --Wikipedia

import pandas as pd
pd.options.mode.chained_assignment = None

titanic = pd.read_csv('Titanic/train.csv')

titanic.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

數據清理

數據分析中維持一個干凈的數據集對建模十分關鍵，可靠的數據集主要由以下幾個方面來評估：

數據的可靠性，這個方面由原始數據集保證
數據的版本控制, 輸入數據對機器學習建模影響很大，如果模型訓練輸入數據不斷發生變化的話很可能無法生成正確的模型，即上游的輸入數據供給進程突然發生變化會波及到模型建立的過程
特征的必要性，建模特征數量和模型精度並不呈現嚴格的正相關
特征的相關性，建模過程中我們盡可能減少相關特征的數量

在本例子，Name和Ticket和存活條件相關性較低，我們可以考慮將這些特征剔除

X = titanic[['Pclass', 'Age', 'Sex']]
y = titanic['Survived']

# 對於年齡空缺的乘客我們使用平均年齡進行填充
X['Age'] = X['Age'].fillna(X['Age'].mean())

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

X_train.head(5)

	Pclass	Age	Sex
110	1	47.000000	male
360	3	40.000000	male
364	3	29.699118	male
320	3	22.000000	male
296	3	23.500000	male

特征工程

傳統編碼工作的關注點在於代碼編碼的過程，而機器學習和數據分析工作者則是着力於數據特征的表示過程，開發者通過特征工程（新特征可以來源於數據濟原始特征的邏輯運算）建立一個良好的數據特征原型。特征工程的主要工作有

映射字符串字符為整型
采用One-Hot編碼方式映射枚舉值

在本例中，我們將Titanic數據集的Sex一列的男性和女性映射為整型值0和1

X_train['Sex'] = X_train['Sex'].map({'male':0,'female':1})
X_test['Sex'] = X_test['Sex'].map({'male':0,'female':1})

# 檢視映射處理結果
X_train.head(5)

	Pclass	Age
110	1	47.000000
360	3	40.000000
364	3	29.699118
320	3	22.000000
296	3	23.500000

from sklearn.ensemble import RandomForestClassifier

titanic_rf = RandomForestClassifier()

titanic_rf.fit(X_train, y_train)

print('The accuracy of Random Forest Classifier on testing set:', titanic_rf.score(X_test, y_test))

The accuracy of Random Forest Classifier on testing set: 0.8026905829596412

from xgboost import XGBClassifier

titanic_xgb = XGBClassifier()
titanic_xgb.fit(X_train, y_train)

print('The accuracy of eXtreme Gradient Boosting Classifier on testing set:', titanic_xgb.score(X_test, y_test))

The accuracy of eXtreme Gradient Boosting Classifier on testing set: 0.8385650224215246

分類結果報告

目標分類中常用的指標有精確率、召回率以及F1均值，公式如下：

精確率 \(Precision = \frac{T_P}{(T_P + F_P)}\)
召回率 \(Recall = \frac{T_P}{(T_P + F_N)}\)
F1值 \(F1 = 2 \times \frac{Precision \times Recall}{(Precision + Recall)}\)

from sklearn.metrics import classification_report, precision_recall_curve
from sklearn.metrics import f1_score

rf_result = titanic_rf.predict(X_test)
xgb_result = titanic_xgb.predict(X_test)

print('隨機森林模型: \n ' + classification_report(rf_result, y_test, digits=4))
print('XGBoost模型: \n ' + classification_report(xgb_result, y_test, digits=4))

隨機森林模型: 
               precision    recall  f1-score   support

           0     0.8731    0.8125    0.8417       144
           1     0.6966    0.7848    0.7381        79

   micro avg     0.8027    0.8027    0.8027       223
   macro avg     0.7849    0.7987    0.7899       223
weighted avg     0.8106    0.8027    0.8050       223

XGBoost模型: 
               precision    recall  f1-score   support

           0     0.9179    0.8311    0.8723       148
           1     0.7191    0.8533    0.7805        75

   micro avg     0.8386    0.8386    0.8386       223
   macro avg     0.8185    0.8422    0.8264       223
weighted avg     0.8510    0.8386    0.8414       223

可以看到隨機森林模型和XGBoost的F1均值分別為0.8050和0.8414，XGBoost在Titanic數據集中略勝一籌

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Kaggle競賽 —— 房價預測 (House Prices) Kaggle大數據競賽平台入門 Spring Cloud中五花八門的分布式組件我到底該怎么學 XGBoost原理介紹《機器學習及實踐--從零開始通往Kaggle競賽之路》 Kaggle M5 沃爾瑪銷量時間序列預測競賽總結 kaggle數據挖掘競賽初步--Titanic<數據變換> XGBoost、LightGBM的詳細對比介紹 kaggle比賽實踐M5-比賽介紹 kaggle比賽實踐M5-baseline研讀（二）M5 LOFO Importance on GPU via Rapids/Xgboost