關於TiTanic存活預測實戰(一、數據分析)


一、前言

  雖然一直算IT男,但是基本沒有接觸過最前沿的IT知識,一直在做生產方面的IT,突發奇想,開始學習算法,學習算法有半年多了,從最初的Python,到線性回歸、邏輯回歸、SVM,聚類,NLP,CNN,RNN,GAN等神經網絡,感覺知識的海洋真是浩瀚如海,今天打算開始分享一下我的一些學習情況,第一個當然就是最基礎的泰坦尼克存活預測啦。

二、背景介紹

  背景介紹:泰坦尼克號沉沒是歷史上最著名的沉船事故之一。1912年4月15日,在她的處女航中,泰坦尼克號在與冰山相撞后沉沒,在2224名乘客和機組人員中造成1502人死亡。這場聳人聽聞的悲劇震驚了國際社會,並為船舶制定了更好的安全規定。造成海難失事的原因之一是乘客和機組人員沒有足夠的救生艇。盡管幸存下沉有一些運氣因素,但有些人比其他人更容易生存,例如婦女,兒童和上流社會。在這個案例中我們將運用機器學習來預測哪些乘客可以存活。

  數據介紹:PassengerId:乘客ID,Survived:是否存活,0是死了1是活了,Pclass:船艙等級,Name:名字,Sex:性別,Age:年紀,SibSp:有幾個兄弟姐妹,Parch:父母和小孩個數,Ticket:船票,Fare:船票價格,Cabin:客艙,Embarked:出發港口

  實戰介紹:這是一個kaggle比賽的一個題目,我應用的是Python3.7+Anaconda3進行操作的,主要分三步,數據分析、數據清洗和建模預測

三、數據集及代碼

  https://pan.baidu.com/s/1JuCWhOEgvAV6gocicddQ4A       提取碼:1t9w

四、實戰

  a、數據分析

    1、導入pandas和numpy庫

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用來正常顯示中文標簽
plt.rcParams['axes.unicode_minus'] = False  # 用來正常顯示負號

    2、加載數據

data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')

    3、查看數據的整體情況

data_train.shape#查看訓練集的shape
data_test.shape#查看測試集的shape
data_train.head(4)#查看一下前幾行

 

 

     

data_train.info()#查看個數,空值情況以及數據類型

    

      4、針對列業務數據進行單獨分析  

    Pclass

data_train.Pclass.value_counts()#查看船艙等級情況

data_train.Sex.isnull().sum()#看看是否有空值

 

 

 

#畫圖查看和存活有關系嗎
fig = plt.figure() fig.set(alpha=0.65) ax = fig.add_subplot(3,3,1) Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts() Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts() df_Pclass = pd.DataFrame({"Survived_1":Survived_1,"Survived_0":Survived_0}) df_Pclass.plot(kind='bar',stacked = True) df_Pclass.plot(kind='bar',stacked = False)

 

 

 

#得到信息,1等的活下來的概率高一點

    Name

#這個我感覺沒啥可分析的 

    SEX

data_train.Sex.value_counts()#查看每類的個數

 

 

 

data_train.Sex.isnull().sum()#查看空值情況

#查看性別和存活的關系
fig = plt.figure() Survived_0 = data_train.Sex[data_train.Survived==0].value_counts() Survived_1 = data_train.Sex[data_train.Survived==1].value_counts() df_sex = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1}) df_sex.plot(kind='bar',stacked=True) plt.show()

    Age

fig = plt.figure()
Survived_0 = data_train.Age[data_train.Survived==0].value_counts()
Survived_1 = data_train.Age[data_train.Survived==1].value_counts()
df_sex = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_sex.plot(kind='kde',stacked=True)
# plt.scatter(Survived_0.index,Survived_0.values)
# plt.scatter(Survived_1.index,Survived_1.values)
plt.show()

 

 

 

#分段看看
def
get_age(age): if 0<age <= 8: return 0 if 8<age<=15: return 1 if 15<age<=22: return 2 if 22<age<=30: return 3 if 30<age<=38: return 4 if 38<age<=48: return 5 if 48<age<=58: return 6 if 58<age: return 7 else: return 8
data_train.Age = data_train.Age.apply(get_age)data_train.Age = data_train.Age.apply(get_age)
fig = plt.figure()
Survived_0 = data_train.Age[data_train.Survived==0].value_counts()
Survived_1 = data_train.Age[data_train.Survived==1].value_counts()
df_sex = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
# df_sex.plot(kind='bar',stacked=True)
df_sex.plot(kind='bar',stacked=False)
# plt.scatter(Survived_0.index,Survived_0.values)
# plt.scatter(Survived_1.index,Survived_1.values)
plt.show()

 

 

 

#年齡小 存活率大,但是也不一定 第三個欄位,28到38的也不少活着,后來一細分 發現20來歲的小伙子 活的概率也不高,看樣子得做一下onehot

    SibSp 有幾個兄弟姐妹

data_train.SibSp.value_counts()#看看分類情況

 

 

 

fig = plt.figure()

Survived_0 = data_train.SibSp[data_train.Survived ==0].value_counts()
Survived_1 = data_train.SibSp[data_train.Survived ==1].value_counts()
df_sibsp = pd.DataFrame({"Survived_0":Survived_0,'Survived_1':Survived_1})
df_sibsp.plot(kind='bar',stacked=True)
plt.show()

 

 

 

#看起來沒有的容易死,一個的和兩個的活的概率高,大於三個的基本就死了

    Parch 父母與小孩個數

data_train.Parch.value_counts()
fig = plt.figure()
Survived_0 = data_train.Parch[data_train.Survived ==0].value_counts()
Survived_1 = data_train.Parch[data_train.Survived ==1].value_counts()
df_parch = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_parch.plot(kind = 'bar',stacked = True)
df_parch.plot(kind = 'bar',stacked = False)

 

 

 

#還是獨生子死亡率高,有一個或者兩個的三個的反而基本都死了

    Ticket

#車票 這么多種類啊 ,看着頭疼,不要了

    Fare

def get_fare(fare):
    if  fare<=8:
        return 0
    elif 8<fare<=14:
        return 1
    elif 14<fare<=30:
        return 2
    elif 30<fare<=60:
        return 3
    elif 60<fare:
        return 4
data_train.Fare = data_train.Fare.apply(get_fare)
fig = plt.figure()
Survived_0 = data_train.Fare[data_train.Survived ==0].value_counts()
Survived_1 = data_train.Fare[data_train.Survived ==1].value_counts()
df_fare = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_fare.plot(kind = 'bar')

 

 

#通過這次,可以基本確定,這個車票買的越貴,人越容易存活

    Cabin 客艙

data_train.Cabin.isnull().sum()
#這玩意也好多種類喲,又不是數字,還不好分段,還好多是空的,也可以分析一下,空和不空與死活有關系嗎

 

 

 

data_train.loc[ (data_train.Cabin.notnull()), 'Cabin' ] = "Yes"
data_train.loc[ (data_train.Cabin.isnull()), 'Cabin' ] = "No"
Survived_0 = data_train.Cabin[data_train.Survived ==0].value_counts()
Survived_1 = data_train.Cabin[data_train.Survived ==1].value_counts()
df_fare = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_fare.plot(kind = 'bar',stacked=True)

 

 

 

#這么看起來 貌似有點規律喲

    Embarked

 

data_train.Embarked.value_counts()

 

 

 

Survived_0 = data_train.Embarked[data_train.Survived ==0].value_counts()
Survived_1 = data_train.Embarked[data_train.Survived ==1].value_counts()
df_fare = pd.DataFrame({'Survived_0':Survived_0,'Survived_1':Survived_1})
df_fare.plot(kind = 'bar',stacked=True)
df_fare.plot(kind = 'bar',stacked=False)

 

 

 

#C港的有點意思,可以能有一半多,其他倆看不出多大的規律來

    

至此,單個的分析已經完成,看看可以組合分析一下

 

fig = plt.figure(figsize=(15,20))
# fig.set(alpha=0.2)
plt.subplot(5,3,1)
data_train.Survived.value_counts().plot(kind='bar')
plt.title("Survived(1 is Survived)") # puts a title on our graph
plt.ylabel("counts")

plt.subplot(5,3,2)
data_train.Pclass.value_counts().plot(kind='bar')
plt.title("船艙等級情況")
plt.ylabel('counts')

plt.subplot(5,3,3)
data_train.Sex.value_counts().plot(kind='bar')
plt.title('male or female')
plt.ylabel('counts')

plt.subplot(5,3,4)
data_train.Age.value_counts().plot(kind='kde')
plt.title('age')
plt.ylabel('counts')

plt.subplot(5,3,5)
plt.scatter(data_train.Survived, data_train.Age)
plt.ylabel(u"年齡")                         # sets the y axis lable
plt.grid(b=True, which='major', axis='y') # formats the grid line style of our graphs
plt.title(u"按年齡看獲救分布 (1為獲救)")

plt.subplot(5,3,6)
data_train.Parch.value_counts().plot(kind='bar')
plt.title("兄弟姐妹個數")
plt.ylabel("人數")

plt.subplot(5,3,7)
plt.scatter(data_train.Survived,data_train.Fare)
plt.title("0:死,1:活")
plt.ylabel("船票價格")
plt.subplot(5,3,8)
data_train.Fare.value_counts().plot(kind='kde')
plt.title("船票價格")
plt.ylabel("人員分布")

plt.subplot(5,3,8)
data_train.Embarked.value_counts().plot(kind='bar')
plt.title("港口情況")
plt.ylabel("人員分布")

 

 

 

   先看一下船艙等級和性別有關系嗎 Pclass Sex

fig = plt.figure()
p1_m = data_train.Sex[data_train.Sex == 'male'][data_train.Pclass == 1].value_counts()
p2_m = data_train.Sex[data_train.Sex == 'male'][data_train.Pclass == 2].value_counts()
p3_m = data_train.Sex[data_train.Sex == 'male'][data_train.Pclass == 3].value_counts()
pd.DataFrame({'p1_m':p1_m,'p2_m':p2_m,'p3_m':p3_m}).plot(kind='bar')

p1_f = data_train.Sex[data_train.Sex == 'female'][data_train.Pclass == 1].value_counts()
p2_f = data_train.Sex[data_train.Sex == 'female'][data_train.Pclass == 2].value_counts()
p3_f = data_train.Sex[data_train.Sex == 'female'][data_train.Pclass == 3].value_counts()
pd.DataFrame({'p1_f':p1_f,'p2_f':p2_f,'p3_f':p3_f}).plot(kind='bar')

 

 

 

#整體來看,沒啥關系,男的在高等的人數多點 比例稍微差點,但是感覺不怎么明顯

    看看船艙和船票價格之間的關系 Pclass Fare

df_pf1 = data_train.Fare[data_train.Pclass==1].value_counts()
df_pf2 = data_train.Fare[data_train.Pclass==2].value_counts()
df_pf3 = data_train.Fare[data_train.Pclass==3].value_counts()
df_pf = pd.DataFrame({'df_pf1':df_pf1,'df_pf2':df_pf2,'df_pf3':df_pf3})
plt.scatter(df_pf1.index,df_pf1.values,c='r',marker='.')
# plt.subplot(1,3,2)
plt.scatter(df_pf2.index,df_pf2.values,c = 'y',marker='.')
# plt.subplot(1,3,3)
plt.scatter(df_pf3.index,df_pf3.values,c = 'k',marker='.')

 

 

 

#一等票價高,三等的和二等的票價低,當然二等的比三等的票價高一點,符合相關邏輯,即票和船艙等級是正相關的

    看看船艙和出發地有關系嗎,是不是不同的地方人有錢的程度不一樣啊 Pclass,Em

fig = plt.figure(figsize=(10,8))
df_pm1 = data_train.Embarked[data_train.Pclass==1].value_counts()
df_pm2 = data_train.Embarked[data_train.Pclass==2].value_counts()
df_pm3 = data_train.Embarked[data_train.Pclass==3].value_counts()
df_pm = pd.DataFrame({'df_pm1':df_pm1,'df_pm2':df_pm2,'df_pm3':df_pm3})
df_pm.plot(kind='bar')

#從這里可以看出來,C港出發的人,貌似一等多,Q的三等的多,S的較為正常

    看看船艙和年紀有關系不

 

fig = plt.figure(figsize=(18,8))

df_pa1 = data_train.Age[data_train.Pclass==1].value_counts()
df_pa2 = data_train.Age[data_train.Pclass==2].value_counts()
df_pa3 = data_train.Age[data_train.Pclass==3].value_counts()
df_pa = pd.DataFrame({'df_pa1':df_pa1,'df_pa2':df_pa2,'df_pm3':df_pa3})
# plt.subplot(1,3,1)
plt.scatter(df_pa1.index,df_pa1.values,c='r',marker='.')
# plt.subplot(1,3,2)
plt.scatter(df_pa2.index,df_pa2.values,c = 'y',marker='.')
# plt.subplot(1,3,3)
plt.scatter(df_pa3.index,df_pa3.values,c = 'k',marker='.')

 

 

 

#圖好難看啊 ,但是 雖然難看,但是可以稍微看到,黑的 也就是三等的,年輕人18—25之間十分的集中

    Age 和Sex

 

df_as1 = data_train.Age[data_train.Sex=='male'].value_counts()
df_as2 = data_train.Age[data_train.Sex=='female'].value_counts()
# plt.subplot(1,3,1)
plt.scatter(df_as1.index,df_as1.values,c='r',marker='.')
# plt.subplot(1,3,2)
plt.scatter(df_as2.index,df_as2.values,c = 'k',marker='.')

 

 

 

#分布來看,也沒發現女的在某些年齡段人數過多,比如少年,比如中年這兩個段,特別是中年這個段存活異常,應該不是性別造成的

    Sex和Fare

 

df_sf1 = data_train.Fare[data_train.Sex=='male'].value_counts()
df_sf2 = data_train.Fare[data_train.Sex=='female'].value_counts()
# plt.subplot(1,3,1)
plt.scatter(df_sf1.index,df_sf1.values,c='r',marker='.')
# plt.subplot(1,3,2)
plt.scatter(df_sf2.index,df_sf2.values,c = 'k',marker='.')

 

 

 

#就這么看來,女的好像買的貴哦 但是和之前船艙一樣 也不怎么明顯

    Sex 和 Em`·

df_sm1 = data_train.Embarked[data_train.Sex=='male'].value_counts()
df_sm2 = data_train.Embarked[data_train.Sex=='female'].value_counts()
pd.DataFrame({"df_sm1":df_sm1,"df_sm2":df_sm2}).plot(kind='bar')

 

 

 

    之前看到兄弟姐妹少的,父母小孩少的活下來的概率高點,那么組合一下,來一個家庭

 

data_train['family'] = data_train.SibSp + data_train.Parch
df_f0 = data_train.family[data_train.Survived==0].value_counts() df_f1 = data_train.family[data_train.Survived==1].value_counts() pd.DataFrame({'df_f0':df_f0,'df_f1':df_f1}).plot(kind='bar',sharey=True)

 

 

 

#1、2、3個的時候活下來的概率搞,其次是0個,絕對的單身狗嗎 ,什么都沒有,也是無語了

接下來我們統一分析了

1、Pclass 船票等級越高,越容易存活下來,是線性的,不需要加維度
2、Sex,女的越容易存活下來
3、Age,小的容易活下來,可能是因為年紀關系,15-25的活的不多,可能與船票有關系,買的低等艙,反而中年人有錢,買的高等的,或下的不少,看來要進行升維度了
4、Sibsp,一兩個活下來的概率搞,其實是0個,升維度啊 <br>
5、Pare,社會自古以來原來都這樣,有錢人活下來的概率高啊,正相關,只需要做歸一化
6、Cabin,空的太多,不打算要了
7、Embrbed,出發港口的話呢,C活的概率搞點,綜合來看,可能是因為一等艙的人數多,再加上女的比例不少導致的,可以去掉嗎
8、family,多加了一項,發現1,2,3個家庭成員的時候,活的概率大,其次是0個,這個和之前的相加結果好心很類似哦

 

 

 

 
       


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM