學習了機器學習這么久,第一次真正用機器學習中的方法解決一個實際問題,一步步探索,雖然最后結果不是很准確,僅僅達到了0.78647,但是真是收獲很多,為了防止以后我的記憶蟲上腦,我決定還是記錄下來好了。
1,看到樣本是,查看樣本的分布和統計情況
#查看數據的統計信息
print(data_train.info())
#查看數據關於數值的統計信息
print(data_train.describe())
通常遇到缺值的情況,我們會有幾種常見的處理方式
- 如果缺值的樣本占總數比例極高,我們可能就直接舍棄了,作為特征加入的話,可能反倒帶入noise,影響最后的結果了,或者考慮有值的是一類,沒有值的是一類,
- 如果缺值的樣本適中,而該屬性非連續值特征屬性(比如說類目屬性),那就把NaN作為一個新類別,加到類別特征中
- 如果缺值的樣本適中,而該屬性為連續值特征屬性,有時候我們會考慮給定一個step(比如這里的age,我們可以考慮每隔2/3歲為一個步長),然后把它離散化,之后把NaN作為一個type加到屬性類目中。
- 有些情況下,缺失的值個數並不是特別多,那我們也可以試着根據已有的值,擬合一下數據,補充上。
隨機森林的方法用來填充數據
from sklearn.ensemble import RandomForestRegressor ### 使用 RandomForestClassifier 填補缺失的年齡屬性 def set_missing_ages(df): # 把已有的數值型特征取出來丟進Random Forest Regressor中 age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']] # 乘客分成已知年齡和未知年齡兩部分 known_age = age_df[age_df.Age.notnull()].as_matrix() unknown_age = age_df[age_df.Age.isnull()].as_matrix() # y即目標年齡 y = known_age[:, 0] # X即特征屬性值 X = known_age[:, 1:] # fit到RandomForestRegressor之中 rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1) rfr.fit(X, y) # 用得到的模型進行未知年齡結果預測 predictedAges = rfr.predict(unknown_age[:, 1::]) # 用得到的預測結果填補原缺失數據 df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges return df, rfr def set_Cabin_type(df): df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes" df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No" return df data_train, rfr = set_missing_ages(data_train) data_train = set_Cabin_type(data_train)
2,接下來就是特征工程了,這一步比較復雜,就是選擇特征,
特征工程的處理方法包括很多種,可以在我的特征工程的博客中找到。
隨機森林特征選擇方法:通過加入噪音值前后的錯誤率的差值來判斷特征值的重要程度。
import numpy as np from sklearn.feature_selection import SelectKBest,f_classif import matplotlib.pyplot as plt predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"] #Perform feature selection selector=SelectKBest(f_classif,k=5) selector.fit(titanic[predictors],titanic["Survived"]) #Plot the raw p-values for each feature,and transform from p-values into scores scores=-np.log10(selector.pvalues_) #Plot the scores. See how "Pclass","Sex","Title",and "Fare" are the best? plt.bar(range(len(predictors)).scores) plt.xticks(range(len(predictors)).predictors,rotation='vertical') plt.show() #Pick only the four best features. predictors=["Pclass","Sex","Fare","Title"] alg=RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=8,min_samples_leaf=4)
然后就是模型選擇了,
不能找到一個在所有數據上都表現好的模型,這就需要一步一步的驗證了,而且同一個模型的不同參數,對結果影響也很大,在解決這個問題中我主要用了n折交叉驗證來驗證模型的准確率,選擇准確率高的模型,然后通過曲線來模擬這些過程,還有一個可以考慮的點就是boosting方法,把許多個弱分類器的結果整合起來,還可以給每個弱分類器一定的權值。
//集成多種算法求平均的方法來進行機器學習求解 from sklearn.ensemble import GradientBoostingClassifier import numpy as np #The algorithms we want to ensemble. #We're using the more linear predictors for the logistic regression,and everything with the gradient boosting classifier algorithms=[ [GradientBoostingClassifier(random_state=1,n_estimators=25,max_depth=3, ["Pclass","Sex","Age","Fare","FamilySize","Title","Age","Embarked"]] [LogisticRegression(random_state=1),["Pclass","Sex","Fare","FamilySize","Title","Age","Embarked"]] ] #Initialize the cross validation folds kf=KFold(titanic.shape[0],n_folds=3,random_state=1) predictions=[] for train,test in kf: train_target=titanic["Survived"].iloc[train] full_test_predictions=[] #Make predictions for each algorithm on each fold for alg,predictors in algorithms: #Fit the algorithm on the training data alg.fit(titanic[predictors].iloc[train,:],train_targegt) #Select and predict on the test fold #The .astype(float) is necessary to convert the dataframe to all floats and sklearn error. test_predictions=alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1] #Use a simple ensembling scheme -- just average the predictions to get the final classification. test_predictions=(full_test_predictions[0]+full_test_predictions[1])/2 #Any value over .5 is assumed to be a 1 prediction,and below .5 is a 0 prediction. test_predictions[test_predictions<=0.5]=0 test_predictions[test_predictions>0.5]=1 predictions.append(test_predictions) #Put all the predictions together into one array. predictions=np.concatenate(predictions,axis=0) #Compute accuracy by comparing to the training data accuracy=sum(predictions[predictions==titanic["Survived"]])/len(predictions) print(accuracy) #The gradient boosting classifier generates better predictions,so we weight it higher predictions=(full_predictions[0]*3+full_predictions[1]*1)/4 predictions
這個問題參考了很多的博客或教材:
使用sklearn進行kaggle案例泰坦尼克Titanic船員獲救預測
數據科學工程師面試寶典系列之二---Python機器學習kaggle案例:泰坦尼克號船員獲救預測
我的代碼已經上傳至 github