kaggle-titanic 數據分析過程


1. 引入所有需要的包

# -*- coding:utf-8 -*- # 忽略警告 import warnings warnings.filterwarnings('ignore') # 引入數據處理包 import numpy as np import pandas as pd # 引入算法包 from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import BaggingRegressor from sklearn.svm import SVC, LinearSVC from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier from sklearn.ensemble import GradientBoostingRegressor from sklearn.linear_model import Perceptron from sklearn.linear_model import SGDClassifier from sklearn.linear_model import Perceptron from xgboost import XGBClassifier # 引入幫助分析數據的包 from sklearn.preprocessing import Imputer,Normalizer,scale from sklearn.cross_validation import train_test_split,StratifiedKFold,cross_val_score from sklearn.feature_selection import RFECV import sklearn.preprocessing as preprocessing from sklearn.learning_curve import learning_curve from sklearn.metrics import accuracy_score # 可視化 import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.pylab as pylab import seaborn as sns %matplotlib inline # 配置可視化 mpl.style.use('ggplot') sns.set_style('white') pylab.rcParams['figure.figsize'] = 8,6

 

2. 讀入數據源

train = pd.read_csv('base_data/train.csv') test = pd.read_csv('base_data/test.csv') full = train.append(test, ignore_index=True) # 保證train和test的數據格式一樣 titanic = full[:891] titanic_pre = full[891:] del train, test print ('DataSets: ', 'full: ', full.shape, 'titanic: ', titanic.shape)
 

3. 分析數據

# 查看有哪些數據 titanic.head() # 數據的整體情況 titanic.describe() # 只能查看數字類型的整體情況 >>> Age Fare Parch PassengerId Pclass SibSp Survived count 714.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 mean 29.699118 32.204208 0.381594 446.000000 2.308642 0.523008 0.383838 std 14.526497 49.693429 0.806057 257.353842 0.836071 1.102743 0.486592 min 0.420000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 25% 20.125000 7.910400 0.000000 223.500000 2.000000 0.000000 0.000000 50% 28.000000 14.454200 0.000000 446.000000 3.000000 0.000000 0.000000 75% 38.000000 31.000000 0.000000 668.500000 3.000000 1.000000 1.000000 max 80.000000 512.329200 6.000000 891.000000 3.000000 8.000000 1.000000 titanic.describe(include=['O']) # 查看字符串類型(非數字)的整體情況 >>> Cabin Embarked Name Sex Ticket count 204 889 891 891 891 unique 147 3 891 2 681 top C23 C25 C27 S Graham, male CA. 2343 freq 4 644 1 577 7

 # 查看整體數據的情況(包含訓練數據和預測數據,看哪些字段有缺失值,需要填充) # 查看缺失值情況 titanic.info() titanic.isnull().sum().sort_values(ascending=False) # 發現以下字段有缺失值 >>> Cabin 687 Age 177 Embarked 2 full.info() full.isnull().sum().sort_values(ascending=False) # 發現以下字段有缺失值 Cabin 1014 Age 263 Embarked 2 Fare 1 

 

總結:所有的數據中一共包括12個變量,其中7個是數值變量,5個是屬性變量

PassengerId(忽略):這是乘客的編號,顯然對乘客是否幸存完全沒有任何作用,僅做區分作用,所以我們就不考慮它了。

Survived(目標值):乘客最后的生存情況,這個是我們預測的目標變量。不過從平均數可以看出,最后存活的概率大概是38%。

Pclass(考慮):社會經濟地位,這個很明顯和生存結果相關啊,有錢人住着更加高級船艙可能會享受着更加高級的服務,因此遇險時往往會受到優待。所以這顯然是我們要考慮的一個變量。分為123等級,1等級獲救率更高。

Name(考慮):這個變量看起來好像是沒什么用啊,因為畢竟從名字你也不能看出能不能獲救,但是仔細觀察數據我們可以看到,所有人的名字里都包括了Mr,Mrs和Miss,從中是不是隱約可以看出來一些性別和年齡的信息呢,所以干脆把名字這個變量變成一個狀態變量,包含Mr,Mrs和Miss這三種狀態,但是放到機器學習里面我們得給它一個編碼啊,最直接的想法就是0,1,2,但是這樣真的合理嗎?因為從距離的角度來說,這樣Mr和Mrs的距離要小於Mr和Miss的距離,顯然不合適,因為我們把它看成平權的三個狀態。

所以,敲黑板,知識點來了,對於這種狀態變量我們通常采取的措施是one-hot編碼,什么意思呢,有幾種狀態就用一個幾位的編碼來表示狀態,每種狀態對應一個一位是1其余各位是0的編碼,這樣從向量的角度來講,就是n維空間的n個基准向量,它們相互明顯是平權的,此例中,我們分別用100,010,001來表示Mr,Mrs和Miss。

Sex(考慮):性別這個屬性肯定是很重要的,畢竟全人類都講究Lady First,所以遇到危險的時候,紳士們一定會先讓女士逃走,因此女性的生存幾率應該會大大提高。類似的,性別也是一個平權的狀態變量,所以到時候我們同樣采取one-hot編碼的方式進行處理。

Age(考慮):這個變量和性別類似,都是明顯會發揮重要作用的,因為無論何時,尊老愛幼總是為人們所推崇,但年齡對是否會獲救的影響主要體現在那個人處在哪個年齡段,因此我們選擇將它划分成一個狀態變量,比如18以下叫child,18以上50以下叫adult,50以上叫elder,然后利用one-hot編碼進行處理。不過這里還有一個問題就是年齡的值只有714個,它不全!這么重要的東西怎么能不全呢,所以我們只能想辦法補全它。

又敲黑板,知識點又來了,缺失值我們怎么處理呢?最簡單的方法,有缺失值的樣本我們就扔掉,這種做法比較適合在樣本數量很多,缺失值樣本舍棄也可以接受的情況下,這樣雖然信息用的不充分,但也不會引入額外的誤差。然后,假裝走心的方法就是用平均值或者中位數來填充缺失值,這通常是最簡便的做法,但通常會帶來不少的誤差。最后,比較負責任的方法就是利用其它的變量去估計缺失變量的值,這樣通常會更靠譜一點,當然也不能完全這樣說,畢竟只要是估計值就不可避免的帶來誤差,但心理上總會覺得這樣更好……

SibSp(考慮):船上兄弟姐妹或者配偶的數量。這個變量對最后的結果到底有什么影響我還真的說不准,但是預測年紀的時候說不定有用。

Parch(考慮):船上父母或者孩子的數量。這個變量和上個變量類似,我確實沒有想到特別好的應用它的辦法,同樣的,預測年齡時這個應該挺靠譜的。

Ticket(忽略):船票的號碼。恕我直言,這個謎一樣的數字真的是不知道有什么鬼用,果斷放棄了。

Fare(考慮):船票價格,這個變量的作用其實類似於社會地位,船票價格越高,享受的服務越高檔,所以遇難獲救的概率肯定相對較高,所以這是一個必須要考慮進去的變量。

Cabin(忽略):船艙號,這個變量或許透露出了一點船艙等級的信息,但是說實話,你的缺失值實在是太多了,我要是把它補全引入的誤差感覺比它提供的信息還多,所以忍痛割愛,和你say goodbye!

Embarked(考慮):登船地點,按道理來說,這個變量應該是沒什么卵用的,但介於它是一個只有三個狀態的狀態變量,那我們就把它處理一下放進模型,萬一有用呢對吧。另外,它有兩個缺失值,這里我們就不大動干戈的去預測了,就直接把它定為登船人數最多的S吧。

4. 分析數據和結果之前的關系(可視化)

4.1 相關系數圖

def plot_correlation_map(df): corr = df.corr() _, ax = plt.subplots(figsize=(12, 10)) cmap = sns.diverging_palette(220, 10, as_cmap=True) _ = sns.heatmap( corr, cmap=cmap, square=True, cbar_kws={'shrink': .9}, ax=ax, annot=True, annot_kws={'fontsize': 15}, fmt='.2f' ) plot_correlation_map(titanic)

這里寫圖片描述

4.2 數據透視圖以及對數據的處理

# 數據處理:  titanic = titanic.drop(['Cabin', 'PassengerId', 'Ticket'], axis=1) # 看等級和Survived之間的關系  titanic[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False) >>Pclass Survived 0 1 0.629630 1 2 0.472826 2 3 0.242363 >>> 可以看出1等級的幸存率較高 >>> 等級的幸存率分別為1>2>3 titanic[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False) >>> Sex Survived 0 female 0.742038 1 male 0.188908 >>> 可以看出女性的幸存率高於男性的幸存率 titanic[['Pclass', 'Sex', 'Survived']].groupby(['Pclass', 'Sex'], as_index=False).mean() >>> Pclass Sex Survived 0 1 female 0.968085 1 1 male 0.368852 2 2 female 0.921053 3 2 male 0.157407 4 3 female 0.500000 5 3 male 0.135447 >>> 性別和等級可以做一下特征各部分組合,比起單獨的更容易划分 # 數據處理  titanic['Sex'] = titanic['Sex'].map({'female': 1, 'male': 4}).astype(int) # 如果設置為0,1 的話后面做特征組合會不好做 # titanic['Sex'] = pd.factorize(titanic.Sex)[0] >>> 把性別轉化為1,4 titanic['Pclass*Sex'] = titanic.Pclass * titanic.Sex titanic['Pclass*Sex'] = pd.factorize(titanic['Pclass*Sex'])[0] titanic['Title'] = titanic.Name.str.extract(' ([A-Za-z]+)\.', expand=False) pd.crosstab(titanic['Title'], titanic['Sex']) >>> Sex 1 4 Title Capt 0 1 Col 0 2 Countess 1 0 Don 0 1 Dr 1 6 Jonkheer 0 1 Lady 1 0 Major 0 2 Master 0 40 Miss 182 0 Mlle 2 0 Mme 1 0 Mr 0 517 Mrs 125 0 Ms 1 0 Rev 0 6 Sir 0 1 titanic['Title'] = titanic['Title'].replace(['Capt', 'Col', 'Don', 'Dr','Jonkheer','Major', 'Rev', 'Sir'], 'Male_Rare') titanic['Title'] = titanic['Title'].replace(['Countess', 'Lady','Mlle', 'Mme', 'Ms'], 'Female_Rare') pd.crosstab(titanic['Title'], titanic['Sex']) >>> Sex 1 4 Title Female_Rare 6 0 Male_Rare 1 20 Master 0 40 Miss 182 0 Mr 0 517 Mrs 125 0 # 轉化為數字,雖然不太合理,只是最后畫相關矩陣的適合可以看到,最終訓練的適合還是需要get_dummies處理 title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Female_Rare": 5, 'Male_Rale':6} titanic['Title'] = titanic['Title'].map(title_mapping) titanic['Title'] = titanic['Title'].fillna(0) titanic.head() # 兄弟姐妹的數量 titanic[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False) >>> SibSp Survived 1 1 0.535885 2 2 0.464286 0 0 0.345395 3 3 0.250000 4 4 0.166667 5 5 0.000000 6 8 0.000000 # 中間的界限不是很明顯 # 父母和孩子的數量 titanic[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False) Parch Survived 3 3 0.600000 1 1 0.550847 2 2 0.500000 0 0 0.343658 5 5 0.200000 4 4 0.000000 6 6 0.000000 # 可以考慮父母和孩子的總和 titanic[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False) >>> FamilySize Survived 3 4 0.724138 2 3 0.578431 1 2 0.552795 6 7 0.333333 0 1 0.303538 4 5 0.200000 5 6 0.136364 7 8 0.000000 8 11 0.000000 titanic.loc[titanic['FamilySize'] == 1, 'Family'] = 0 titanic.loc[(titanic['FamilySize'] > 1) & (titanic['FamilySize'] < 5), 'Family'] = 1 titanic.loc[(titanic['FamilySize'] >= 5), 'Family'] = 2 grid = sns.FacetGrid(titanic, col='Survived', row='Pclass', size=2.2, aspect=1.6) grid.map(plt.hist, 'Age', alpha=.5, bins=20) grid.add_legend();

 

 

 

這里寫圖片描述

dataset = titanic[['Age', 'Sex', 'Pclass']] guess_ages = np.zeros((2, 3)) l = [1, 4] for i in range(len(l)): for j in range(0, 3): guess_df = dataset[(dataset['Sex'] == l[i]) & (dataset['Pclass'] == j + 1)]['Age'].dropna() # age_mean = guess_df.mean() # age_std = guess_df.std() # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)  age_guess = guess_df.median() # Convert random age float to nearest .5 age guess_ages[i, j] = int(age_guess / 0.5 + 0.5) * 0.5 print(guess_ages) for i in range(len(l)): for j in range(0, 3): dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == l[i]) & (dataset.Pclass == j + 1), 'Age'] = guess_ages[i, j] titanic['Age'] = dataset['Age'].astype(int) titanic.head(10) def plot_distribution(df, var, target, **kwargs): row = kwargs.get('row', None) col = kwargs.get('col', None) facet = sns.FacetGrid(df, hue=target, aspect=4, row=row, col=col) facet.map(sns.kdeplot, var, shade=True) facet.set(xlim=(0, df[var].max())) facet.add_legend() plot_distribution(titanic, var='Age', target='Survived')

 

 

 

這里寫圖片描述

sns.FacetGrid(titanic, col='Survived').map(plt.hist, 'Age', bins=20)

 

 
  • 1

這里寫圖片描述

titanic['AgeBand'] = pd.cut(titanic['Age'], 4) titanic[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True) >>> AgeBand Survived 0 (0.34, 20.315] 0.458101 1 (20.315, 40.21] 0.397403 2 (40.21, 60.105] 0.390625 3 (60.105, 80.0] 0.227273 >>> 按照上面的年齡段來進行划分 # 數據處理 titanic.loc[titanic['Age'] <= 20, 'Age'] = 4 titanic.loc[(titanic['Age'] > 20) & (full['Age'] <= 40), 'Age'] = 5 titanic.loc[(titanic['Age'] > 40) & (full['Age'] <= 60), 'Age'] = 6 titanic.loc[titanic['Age'] > 60, 'Age'] = 7 

 

titanic[['Age', 'Survived']].groupby(['Age'], as_index=False).mean() >>> Age Survived 0 4.0 0.458101 1 5.0 0.397403 2 6.0 0.390625 3 7.0 0.227273 titanic[['Pclass', 'Age', 'Survived']].groupby(['Pclass', 'Age'], as_index=False).mean() Pclass Age Survived 0 1 4.0 0.809524 1 1 5.0 0.741573 2 1 6.0 0.580645 3 1 7.0 0.214286 4 2 4.0 0.742857 5 2 5.0 0.423077 6 2 6.0 0.387097 7 2 7.0 0.333333 8 3 4.0 0.317073 9 3 5.0 0.223958 10 3 6.0 0.057143 11 3 7.0 0.200000 titanic['Pclass*Age'] = titanic.Pclass * titanic.Age titanic[['Pclass*Age', 'Survived']].groupby(['Pclass*Age'], as_index=False).mean().sort_values(by='Survived', ascending=False) >>>Pclass*Age Survived 0 4.0 0.809524 4 8.0 0.742857 1 5.0 0.741573 2 6.0 0.580645 5 10.0 0.423077 7 14.0 0.333333 6 12.0 0.331169 8 15.0 0.223958 3 7.0 0.214286 10 21.0 0.200000 9 18.0 0.057143 titanic[['Sex', 'Age', 'Survived']].groupby(['Sex', 'Age'], as_index=False).mean().sort_values(by='Survived', ascending=False) >>> Sex Age Survived 3 1 7.0 1.000000 1 1 5.0 0.786765 2 1 6.0 0.755556 0 1 4.0 0.688312 4 4 4.0 0.284314 6 4 6.0 0.192771 5 4 5.0 0.184739 7 4 7.0 0.105263

 

def plot_bar(df, cols): # 看看各乘客等級的獲救情況 fig = plt.figure() fig.set(alpha=0.2) # 設定圖表顏色alpha參數 Survived_0 = df[cols][df.Survived == 0].value_counts() Survived_1 = df[cols][df.Survived == 1].value_counts() data = pd.DataFrame({u'Survived': Survived_1, u'UnSurvived': Survived_0}) data.plot(kind='bar', stacked=True) plt.title(cols + u"_survived") plt.xlabel(cols) plt.ylabel(u"count") plt.show() plot_bar(titanic, cols='Embarked')

 

這里寫圖片描述

def plot_categories(df, cat, target, **kwargs): row = kwargs.get('row', None) col = kwargs.get('col', None) facet = sns.FacetGrid(df, row=row, col=col) facet.map(sns.barplot, cat, target) facet.add_legend() plot_categories(titanic, cat='Embarked', target='Survived') 

 

這里寫圖片描述

# 因為全部數據中Embarked,所以,可以用眾數來填充, 並將其轉化為數字 freq_port = titanic.Embarked.dropna().mode()[0] titanic['Embarked'] = titanic['Embarked'].fillna(freq_port) titanic[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False) titanic['Embarked'] = titanic['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int) titanic.head()

 

# Fare預測數據特征里面也有缺失值,需要填充 # 用中位數進行填充, 然后划分Fare段 titanic['Fare'].fillna(titanic['Fare'].dropna().median(), inplace=True)

 

def plot_distribution(df, var, target, **kwargs): row = kwargs.get('row', None) col = kwargs.get('col', None) facet = sns.FacetGrid(df, hue=target, aspect=4, row=row, col=col) facet.map(sns.kdeplot, var, shade=True) facet.set(xlim=(0, df[var].max())) facet.add_legend() plot_distribution(titanic, var='Fare', target='Survived') 

 

這里寫圖片描述

titanic['FareBand'] = pd.cut(titanic['Fare'], 4) titanic[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True) >>> FareBand Survived 0 (-0.512, 128.082] 0.368113 1 (128.082, 256.165] 0.724138 2 (256.165, 384.247] 0.666667 3 (384.247, 512.329] 1.000000 titanic.loc[titanic['Fare'] <= 128.082, 'Fare'] = 0 titanic.loc[(titanic['Fare'] > 128.082) & (titanic['Fare'] <= 256.165), 'Fare'] = 1 titanic.loc[(titanic['Fare'] > 256.165) & (titanic['Fare'] <= 384.247), 'Fare'] = 2 titanic.loc[titanic['Fare'] > 384.247, 'Fare'] = 3 titanic['Fare'] = titanic['Fare'].astype(int) titanic = titanic.drop(['Name', 'FareBand'], axis=1) plot_correlation_map(titanic)

 

 

 

這里寫圖片描述

titanic['Sex'] = titanic['Sex'].astype('int').astype('str') titanic = pd.get_dummies(titanic, prefix='Sex') titanic.head() titanic['Embarked'] = titanic['Embarked'].astype('str') titanic = pd.get_dummies(titanic, prefix='Embarked') titanic['Title'] = titanic['Title'].astype('str') titanic = pd.get_dummies(titanic, prefix='Title') plot_correlation_map(titanic)

 

 

這里寫圖片描述

5. 完整的數據處理過程

train = pd.read_csv('base_data/train.csv') test = pd.read_csv('base_data/test.csv') full = train.append(test, ignore_index=True) # 保證train和test的數據格式一樣 titanic = full[:891] titanic_pre = full[891:] del train, test print ('DataSets: ', 'full: ', full.shape, 'titanic: ', titanic.shape) full = full.drop(['Cabin', 'PassengerId', 'Ticket'], axis=1) full['Sex'] = full['Sex'].map({'female': 1, 'male': 4}).astype(int) full['Pclass*Sex'] = full.Pclass * full.Sex full['Pclass*Sex'] = pd.factorize(full['Pclass*Sex'])[0] full['Title'] = full.Name.str.extract(' ([A-Za-z]+)\.', expand=False) full['Title'] = full['Title'].replace(['Capt', 'Col', 'Don', 'Dr','Jonkheer','Major', 'Rev', 'Sir'], 'Male_Rare') full['Title'] = full['Title'].replace(['Countess', 'Lady','Mlle', 'Mme', 'Ms'], 'Female_Rare') title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Female_Rare": 5, 'Male_Rale':6} full['Title'] = full['Title'].map(title_mapping) full['Title'] = full['Title'].fillna(0) full.head() full['FamilySize'] = full['Parch'] + full['SibSp'] + 1 full.loc[full['FamilySize'] == 1, 'Family'] = 0 full.loc[(full['FamilySize'] > 1) & (full['FamilySize'] < 5), 'Family'] = 1 full.loc[(full['FamilySize'] >= 5), 'Family'] = 2 dataset = full[['Age', 'Sex', 'Pclass']] guess_ages = np.zeros((2, 3)) l = [1, 4] for i in range(len(l)): for j in range(0, 3): guess_df = dataset[(dataset['Sex'] == l[i]) & (dataset['Pclass'] == j + 1)]['Age'].dropna() # age_mean = guess_df.mean() # age_std = guess_df.std() # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std) age_guess = guess_df.median() # Convert random age float to nearest .5 age guess_ages[i, j] = int(age_guess / 0.5 + 0.5) * 0.5 print(guess_ages) for i in range(len(l)): for j in range(0, 3): dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == l[i]) & (dataset.Pclass == j + 1), 'Age'] = guess_ages[i, j] full['Age'] = dataset['Age'].astype(int) full.loc[full['Age'] <= 20, 'Age'] = 4 full.loc[(full['Age'] > 20) & (full['Age'] <= 40), 'Age'] = 5 full.loc[(full['Age'] > 40) & (full['Age'] <= 60), 'Age'] = 6 full.loc[full['Age'] > 60, 'Age'] = 7 full.head() full['Pclass*Age'] = full.Pclass * full.Age freq_port = full.Embarked.dropna().mode()[0] full['Embarked'] = full['Embarked'].fillna(freq_port) full[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False) full['Embarked'] = full['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int) full.head() # Fare預測數據特征里面也有缺失值,需要填充 # 用中位數進行填充, 然后划分Fare段 full['Fare'].fillna(full['Fare'].dropna().median(), inplace=True) full.loc[full['Fare'] <= 128.082, 'Fare'] = 0 full.loc[(full['Fare'] > 128.082) & (full['Fare'] <= 256.165), 'Fare'] = 1 full.loc[(full['Fare'] > 256.165) & (full['Fare'] <= 384.247), 'Fare'] = 2 full.loc[full['Fare'] > 384.247, 'Fare'] = 3 full['Fare'] = full['Fare'].astype(int) full = full.drop(['Name'], axis=1) full = full.drop(['Parch', 'SibSp', 'FamilySize'], axis=1) full['Sex'] = full['Sex'].astype('int').astype('str') full = pd.get_dummies(full, prefix='Sex') full.head() full['Embarked'] = full['Embarked'].astype('str') full = pd.get_dummies(full, prefix='Embarked') full.head() full['Title'] = full['Title'].astype('str') full = pd.get_dummies(full, prefix='Title') full = full.drop(['Survived'], axis=1)

 

 

 

6. 選擇模型

def COMPARE_MODEL(train_valid_X, train_valid_y): def cross_model(model, train_X, train_y): cvscores = cross_val_score(model, train_X, train_y, cv=3, n_jobs=-1) return round(cvscores.mean() * 100, 2) train_X, valid_X, train_y, valid_y = train_test_split(train_valid_X, train_valid_y, train_size=.78, random_state=0) def once_model(clf): clf.fit(train_X, train_y) y_pred = clf.predict(valid_X) return round(accuracy_score(y_pred, valid_y) * 100, 2) logreg = LogisticRegression() acc_log = cross_model(logreg, train_valid_X, train_valid_y) acc_log_once = once_model(logreg) xgbc = XGBClassifier() acc_xgbc = cross_model(xgbc, train_valid_X, train_valid_y) acc_xgbc_once = once_model(xgbc) svc = SVC() acc_svc = cross_model(svc, train_valid_X, train_valid_y) acc_svc_once = once_model(svc) knn = KNeighborsClassifier(n_neighbors=3) acc_knn = cross_model(knn, train_valid_X, train_valid_y) acc_knn_once = once_model(knn) gaussian = GaussianNB() acc_gaussian = cross_model(gaussian, train_valid_X, train_valid_y) acc_gaussian_once = once_model(gaussian) perceptron = Perceptron() acc_perceptron = cross_model(perceptron, train_valid_X, train_valid_y) acc_perceptron_once = once_model(perceptron) linear_svc = LinearSVC() acc_linear_svc = cross_model(linear_svc, train_valid_X, train_valid_y) acc_linear_svc_once = once_model(linear_svc) sgd = SGDClassifier() acc_sgd = cross_model(sgd, train_valid_X, train_valid_y) acc_sgd_once = once_model(sgd) decision_tree = DecisionTreeClassifier() acc_decision_tree = cross_model(decision_tree, train_valid_X, train_valid_y) acc_decision_tree_once = once_model(decision_tree) random_forest = RandomForestClassifier(n_estimators=100) acc_random_forest = cross_model(random_forest, train_valid_X, train_valid_y) acc_random_forest_once = once_model(random_forest) gbc = GradientBoostingClassifier() acc_gbc = cross_model(gbc, train_valid_X, train_valid_y) acc_gbc_once = once_model(gbc) models = pd.DataFrame({ 'Model': ['XGBC', 'Support Vector Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic Gradient Decent', 'Linear SVC', 'Decision Tree', 'GradientBoostingClassifier'], 'Score': [acc_xgbc, acc_svc, acc_knn, acc_log, acc_random_forest, acc_gaussian, acc_perceptron, acc_sgd, acc_linear_svc, acc_decision_tree, acc_gbc]}) models_once = pd.DataFrame({ 'Model': ['XGBC', 'Support Vector Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic Gradient Decent', 'Linear SVC', 'Decision Tree', 'GradientBoostingClassifier'], 'Score': [acc_xgbc_once, acc_svc_once, acc_knn_once, acc_log_once, acc_random_forest_once, acc_gaussian_once, acc_perceptron_once, acc_sgd_once, acc_linear_svc_once, acc_decision_tree_once, acc_gbc_once]}) models = models.sort_values(by='Score', ascending=False) models_once = models_once.sort_values(by='Score', ascending=False) return models, models_once train_valid_X = full[0:891] train_valid_y = titanic.Survived test_X = full[891:] train_X, valid_X, train_y, valid_y = train_test_split(train_valid_X, train_valid_y, train_size=.78, random_state=0) print(full.shape, train_X.shape, valid_X.shape, train_y.shape, valid_y.shape, test_X.shape) models_cross, models_once = COMPARE_MODEL(train_valid_X, train_valid_y) xgbc = XGBClassifier() xgbc.fit(train_valid_X, train_valid_y) test_Y_xgbc = xgbc.predict(test_X) passenger_id = titanic_pre.PassengerId test = pd.DataFrame({'PassengerId': passenger_id, 'Survived': np.round(test_Y_xgbc).astype('int32')}) print test.head() test.to_csv('titanic_pred_gbdc_sc.csv', index=False)

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM