1 案例背景

泰坦尼克號沉沒是歷史上最臭名昭着的沉船之一。1912年4月15日，在她的處女航中，泰坦尼克號在與冰山相撞后沉沒，在2224名乘客和機組人員中造成1502人死亡。這場聳人聽聞的悲劇震驚了國際社會，並為船舶制定了更好的安全規定。造成海難失事的原因之一是乘客和機組人員沒有足夠的救生艇。盡管幸存下沉有一些運氣因素，但有些人比其他人更容易生存，例如婦女，兒童和上流社會。在這個案例中，我們要求您完成對哪些人可能存活的分析。特別是，我們要求您運用機器學習工具來預測哪些乘客幸免於悲劇。

案例：https://www.kaggle.com/c/titanic/overview
我們提取到的數據集中的特征包括票的類別，是否存活，乘坐班次，年齡，登陸home.dest，房間，船和性別等。

經過觀察數據得到:

1 乘坐班是指乘客班（1，2，3），是社會經濟階層的代表。
2 其中age數據存在缺失。

2 步驟分析

1.獲取數據
2.數據基本處理
- 2.1 確定特征值,目標值
- 2.2 缺失值處理
- 2.3 數據集划分
3.特征工程(字典特征抽取)
4.機器學習(決策樹)
5.模型評估

3 代碼實現

導入需要的模塊

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz

1.獲取數據

# 可以通過github上下載數據
titanic=pd.read_csv("data/titanic/train.csv")
titanic

2.數據基本處理

# 2.1 確定特征值,目標值
x = titan[["pclass", "age", "sex"]]
y = titan["survived"]
# 2.2 缺失值處理
# 缺失值需要處理，將特征當中有類別的這些特征進行字典特征抽取
x['age'].fillna(x['age'].mean(), inplace=True)
# 2.3 數據集划分
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

3.特征工程(字典特征抽取)

特征中出現類別符號，需要進行one-hot編碼處理(DictVectorizer),x.to_dict(orient="records") 需要將數組特征轉換成字典數據

# 對於x轉換成字典數據x.to_dict(orient="records")
# [{"pclass": "1st", "age": 29.00, "sex": "female"}, {}]
# 轉換為字典的形式
x_train=x_train.to_dict(orient="records")
x_test=x_test.to_dict(orient="records")
# 特征轉換
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.fit_transform(x_test.to_dict(orient="records"))

4.決策樹模型訓練和模型評估

決策樹API當中，如果沒有指定max_depth那么會根據信息熵的條件直到最終結束。這里我們可以指定樹的深度來進行限制樹的大小

# 4.機器學習(決策樹)
estimator = DecisionTreeClassifier(criterion="entropy", max_depth=5)
estimator.fit(x_train, y_train)

5.模型評估

# 5.模型評估
estimator.score(x_test, y_test)
estimator.predict(x_test)

4 決策樹可視化

4.1 保存樹的結構到dot文件

sklearn.tree.export_graphviz()

該函數能夠導出DOT格式
tree.export_graphviz(estimator,out_file='tree.dot’,feature_names=['',''])

# 6.決策樹可視化
export_graphviz(estimator, out_file="./data/tree.dot", feature_names=['Age','Pclass','Sex','Survived'])

dot文件當中的內容如下

digraph Tree {
node [shape=box] ;
0 [label="petal length (cm) <= 2.45\nentropy = 1.584\nsamples = 112\nvalue = [39, 37, 36]"] ;
1 [label="entropy = 0.0\nsamples = 39\nvalue = [39, 0, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="petal width (cm) <= 1.75\nentropy = 1.0\nsamples = 73\nvalue = [0, 37, 36]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="petal length (cm) <= 5.05\nentropy = 0.391\nsamples = 39\nvalue = [0, 36, 3]"] ;
2 -> 3 ;
4 [label="sepal length (cm) <= 4.95\nentropy = 0.183\nsamples = 36\nvalue = [0, 35, 1]"] ;
3 -> 4 ;
5 [label="petal length (cm) <= 3.9\nentropy = 1.0\nsamples = 2\nvalue = [0, 1, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
5 -> 6 ;
7 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 0, 1]"] ;
5 -> 7 ;
8 [label="entropy = 0.0\nsamples = 34\nvalue = [0, 34, 0]"] ;
4 -> 8 ;
9 [label="petal width (cm) <= 1.55\nentropy = 0.918\nsamples = 3\nvalue = [0, 1, 2]"] ;
3 -> 9 ;
10 [label="entropy = 0.0\nsamples = 2\nvalue = [0, 0, 2]"] ;
9 -> 10 ;
11 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
9 -> 11 ;
12 [label="petal length (cm) <= 4.85\nentropy = 0.191\nsamples = 34\nvalue = [0, 1, 33]"] ;
2 -> 12 ;
13 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
12 -> 13 ;
14 [label="entropy = 0.0\nsamples = 33\nvalue = [0, 0, 33]"] ;
12 -> 14 ;
}

4.2 網站顯示結構

http://webgraphviz.com/

5 決策樹總結

優點：
- 簡單的理解和解釋，樹木可視化。
缺點：
- 決策樹學習者可以創建不能很好地推廣數據的過於復雜的樹,容易發生過擬合。
改進：
- 減枝cart算法
- 隨機森林（集成學習的一種）
  注：企業重要決策，由於決策樹很好的分析能力，在決策過程應用較多，可以選擇特征

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 用決策樹做泰坦尼克號乘客的生存預測機器學習之路: python 決策樹分類DecisionTreeClassifier 預測泰坦尼克號乘客是否幸存決策樹之泰坦尼克號實戰【決策樹】泰坦尼克號幸存者預測項目泰坦尼克號生存預測分析 Kaggle泰坦尼克號生存情況預測泰坦尼克號幸存預測 Kaggle泰坦尼克號案例 [機器學習]貝葉斯算法對泰坦尼克號生存人群分類預測 [簡單示例] 用Python隨機森林預測泰坦尼克號生存情況