kaggle 泰坦尼克號問題總結

本文轉載自查看原文 2017-06-07 10:26 4070 python/ Machine Learning

學習了機器學習這么久，第一次真正用機器學習中的方法解決一個實際問題，一步步探索，雖然最后結果不是很准確，僅僅達到了0.78647，但是真是收獲很多，為了防止以后我的記憶蟲上腦，我決定還是記錄下來好了。

1，看到樣本是，查看樣本的分布和統計情況

#查看數據的統計信息
print(data_train.info())
#查看數據關於數值的統計信息
print(data_train.describe())

通常遇到缺值的情況，我們會有幾種常見的處理方式

如果缺值的樣本占總數比例極高，我們可能就直接舍棄了，作為特征加入的話，可能反倒帶入noise，影響最后的結果了，或者考慮有值的是一類，沒有值的是一類，
如果缺值的樣本適中，而該屬性非連續值特征屬性(比如說類目屬性)，那就把NaN作為一個新類別，加到類別特征中
如果缺值的樣本適中，而該屬性為連續值特征屬性，有時候我們會考慮給定一個step(比如這里的age，我們可以考慮每隔2/3歲為一個步長)，然后把它離散化，之后把NaN作為一個type加到屬性類目中。
有些情況下，缺失的值個數並不是特別多，那我們也可以試着根據已有的值，擬合一下數據，補充上。

隨機森林的方法用來填充數據

from sklearn.ensemble import RandomForestRegressor

### 使用 RandomForestClassifier 填補缺失的年齡屬性
def set_missing_ages(df):

    # 把已有的數值型特征取出來丟進Random Forest Regressor中
    age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]

    # 乘客分成已知年齡和未知年齡兩部分
    known_age = age_df[age_df.Age.notnull()].as_matrix()
    unknown_age = age_df[age_df.Age.isnull()].as_matrix()

    # y即目標年齡
    y = known_age[:, 0]

    # X即特征屬性值
    X = known_age[:, 1:]

    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)

    # 用得到的模型進行未知年齡結果預測
    predictedAges = rfr.predict(unknown_age[:, 1::])

    # 用得到的預測結果填補原缺失數據
    df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges 

    return df, rfr

def set_Cabin_type(df):
    df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes"
    df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No"
    return df

data_train, rfr = set_missing_ages(data_train)
data_train = set_Cabin_type(data_train)

2，接下來就是特征工程了，這一步比較復雜，就是選擇特征，

特征工程的處理方法包括很多種，可以在我的特征工程的博客中找到。

隨機森林特征選擇方法：通過加入噪音值前后的錯誤率的差值來判斷特征值的重要程度。

import numpy as np  
from sklearn.feature_selection import SelectKBest,f_classif  
import matplotlib.pyplot as plt  
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]  
  
#Perform feature selection  
selector=SelectKBest(f_classif,k=5)  
selector.fit(titanic[predictors],titanic["Survived"])  
  
#Plot the raw p-values for each feature,and transform from p-values into scores  
scores=-np.log10(selector.pvalues_)  
  
#Plot the scores.   See how "Pclass","Sex","Title",and "Fare" are the best?  
plt.bar(range(len(predictors)).scores)  
plt.xticks(range(len(predictors)).predictors,rotation='vertical')  
plt.show()  
  
#Pick only the four best features.  
predictors=["Pclass","Sex","Fare","Title"]  
  
alg=RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=8,min_samples_leaf=4)

然后就是模型選擇了，

不能找到一個在所有數據上都表現好的模型，這就需要一步一步的驗證了，而且同一個模型的不同參數，對結果影響也很大，在解決這個問題中我主要用了n折交叉驗證來驗證模型的准確率，選擇准確率高的模型，然后通過曲線來模擬這些過程，還有一個可以考慮的點就是boosting方法，把許多個弱分類器的結果整合起來，還可以給每個弱分類器一定的權值。

//集成多種算法求平均的方法來進行機器學習求解  
from sklearn.ensemble import GradientBoostingClassifier  
import numpy as  np  
  
#The algorithms we want to ensemble.  
#We're using the more linear predictors for the logistic regression,and everything with the gradient boosting classifier  
algorithms=[  
    [GradientBoostingClassifier(random_state=1,n_estimators=25,max_depth=3, ["Pclass","Sex","Age","Fare","FamilySize","Title","Age","Embarked"]]  
    [LogisticRegression(random_state=1),["Pclass","Sex","Fare","FamilySize","Title","Age","Embarked"]]  
]  
  
#Initialize the cross validation folds  
kf=KFold(titanic.shape[0],n_folds=3,random_state=1)  
  
predictions=[]  
for train,test in kf:  
    train_target=titanic["Survived"].iloc[train]  
    full_test_predictions=[]  
    #Make predictions for each algorithm on each fold  
    for alg,predictors in algorithms:  
        #Fit the algorithm on the training data  
        alg.fit(titanic[predictors].iloc[train,:],train_targegt)  
        #Select and predict on the test fold  
        #The .astype(float) is necessary to convert the dataframe to all floats and sklearn error.  
        test_predictions=alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]  
    #Use a simple ensembling scheme -- just  average the predictions to get the final classification.  
    test_predictions=(full_test_predictions[0]+full_test_predictions[1])/2  
    #Any value over .5 is assumed to be a 1 prediction,and below .5 is a 0 prediction.  
    test_predictions[test_predictions<=0.5]=0  
    test_predictions[test_predictions>0.5]=1  
    predictions.append(test_predictions)  
  
#Put all the predictions together into one array.  
predictions=np.concatenate(predictions,axis=0)  
  
#Compute accuracy by comparing to the training data  
accuracy=sum(predictions[predictions==titanic["Survived"]])/len(predictions)  
print(accuracy)  



#The gradient boosting classifier generates better predictions,so we weight it higher  
predictions=(full_predictions[0]*3+full_predictions[1]*1)/4  
predictions

這個問題參考了很多的博客或教材：

這個問題的視頻講解 http://study.163.com/course/courseLearn.htm?courseId=1003551009&from=study&edusave=1#/learn/video?lessonId=1004052093&courseId=1003551009

使用sklearn進行kaggle案例泰坦尼克Titanic船員獲救預測

數據科學工程師面試寶典系列之二---Python機器學習kaggle案例：泰坦尼克號船員獲救預測

kaggle之泰坦尼克的沉沒

機器學習系列(3)_邏輯回歸應用之Kaggle泰坦尼克之災

經典又兼具備趣味性的Kaggle案例泰坦尼克號問題

kaggle實戰之Titanic (1)-預處理

kaggle實戰之Titanic(2)-分類器的選擇與實現

kaggle入門泰坦尼克之災內容總結

我的代碼已經上傳至 github

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Kaggle泰坦尼克號案例 Kaggle競賽 —— 泰坦尼克號（Titanic） Kaggle經典數據分析項目：泰坦尼克號生存預測！ pytorch kaggle 泰坦尼克生存預測 Python機器學習：泰坦尼克號獲救預測一【數據分析入門】泰坦尼克號生存率預測（一）機器學習——用邏輯回歸及隨機森林實現泰坦尼克號的生存預測泰坦尼克號電影完整字幕中英文對比完整版機器學習項目實戰----泰坦尼克號獲救預測(二) 決策樹算法6-案例：泰坦尼克號乘客生存預測