四、特征重要性衡量

通過上面可以發現准確率有小幅提升，但是似乎得到的結果還是不太理想。我們可以發現模型似乎優化的差不多了，使用的特征似乎也已經使用完了。准確率已經達到了瓶頸，但是如果我們還想提高精度的話，還是要回到最原始的數據集里面。對分類器的結果最大的影響還是輸入的數據本身。接下來采用的方法一般是從原始的數據集里面構造出新的特征。新增特征，家庭成員數和名字長度。

# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

提取名字(名字里面包含稱呼，如小姐，女士，先生等等)，這些稱呼也是有可能對結果產生影響的。

import re


# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title.
    # Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""


# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {
    "Mr": 1,
    "Miss": 2,
    "Mrs": 3,
    "Master": 4,
    "Dr": 5,
    "Rev": 6,
    "Major": 7,
    "Col": 7,
    "Mlle": 8,
    "Mme": 8,
    "Don": 9,
    "Lady": 10,
    "Countess": 10,
    "Jonkheer": 10,
    "Sir": 9,
    "Capt": 7,
    "Ms": 2
}
for k, v in title_mapping.items():
    titles[titles == k] = v

# Verify that we converted everything.
# 驗證我們是否轉換了所有內容
print(pandas.value_counts(titles))

# Add in the title column.
titanic["Title"] = titles

得到的結果，發現前三個稱呼占據數據集的一大半，毫無疑問，這個特征對結果也是有較大影響的。

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Major         2
Mlle          2
Col           2
Sir           1
Mme           1
Lady          1
Countess      1
Capt          1
Ms            1
Don           1
Jonkheer      1
Name: Name, dtype: int64
1     517
2     183
3     125
4      40
5       7
6       6
7       5
10      3
8       3
9       2
Name: Name, dtype: int64

通過前面的步驟發現特征有點太多了，我們可以通過特征的重要性來篩選出哪些特征比較重要，而隨機森林的好處就是特征重要性衡量。

特征重要性解釋：在機器學習的訓練過程中，對於多個特征來說，假如要對其中某一個特征來衡量它的重要性，我們就不用這個特征的數據來進行訓練，而是把這個特征里面的數據全部替換為噪音數據，假如得到的准確率沒有太大的變化，那就說明這個特征其實不那么重要，如果得到的准確率相差太大的話，說明這個特征很重要。其他特征的重要衡量以此類推。

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif # 選擇最好特征
import matplotlib.pyplot as plt
predictors = [
    "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize",
    "Title", "NameLength"
]

# Perform feature selection
# 執行特征選擇
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# Pick only the four best features.
# 只選擇4個最好的特征
predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1,
                             n_estimators=50,
                             min_samples_split=8,
                             min_samples_leaf=4)

得到的結果為：

上圖就是特征重要性的一個柱狀圖，發現Age等一些特征好像影響不大，和剛開始的假設有較大出入，那么這些沒用的特征就可以刪除掉，只保留有用的特征即可。

五、集成算法

使用集成算法來提升准確率

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]],
    [LogisticRegression(random_state=1,solver='liblinear'), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

# Initialize the cross validation folds
kf = KFold(n_splits=3,shuffle=False, random_state=1)

predictions = []
for train, test in kf.split(titanic):
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2  # 兩個分類器的平均結果
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)

# Put all the predictions together into one array.
# 將所有的預測放在一個數組中
predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.
accuracy = sum(predictions == titanic["Survived"]) / len(predictions)
print(accuracy)

得到的准確率為：

0.8215488215488216

接下來用測試數據集來進行預測(注意：在測試數據集里面沒有"Survived"這一列，所以我們得不到測試結果的准確率，只能進行預測)

titles = titanic_test["Name"].apply(get_title)
# We're adding the Dona title to the mapping, because it's in the test set, but not the training set
title_mapping = {
    "Mr": 1,
    "Miss": 2,
    "Mrs": 3,
    "Master": 4,
    "Dr": 5,
    "Rev": 6,
    "Major": 7,
    "Col": 7,
    "Mlle": 8,
    "Mme": 8,
    "Don": 9,
    "Lady": 10,
    "Countess": 10,
    "Jonkheer": 10,
    "Sir": 9,
    "Capt": 7,
    "Ms": 2,
    "Dona": 10
}
for k, v in title_mapping.items():
    titles[titles == k] = v
titanic_test["Title"] = titles
# Check the counts of each unique title.
print(pandas.value_counts(titanic_test["Title"]))

# Now, we add the family size column.
titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]

得到測試數據集里面Name里面稱呼的次數：

1     240
2      79
3      72
4      21
7       2
6       2
10      1
5       1
Name: Title, dtype: int64

最終對測試數據集里面的乘客能否獲救進行預測

predictors = [
    "Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title"
]

algorithms = [
    [
        GradientBoostingClassifier(random_state=1,
                                   n_estimators=25,
                                   max_depth=3), predictors
    ],
    [
        LogisticRegression(random_state=1, solver='liblinear'),
        ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]
    ]
]

full_predictions = []
for alg, predictors in algorithms:
    # Fit the algorithm using the full training data.
    alg.fit(titanic[predictors], titanic["Survived"])
    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.
    predictions = alg.predict_proba(
        titanic_test[predictors].astype(float))[:, 1]
    predictions[predictions <= .5] = 0
    predictions[predictions > .5] = 1
    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.
# predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4
predictions

得到的結果(1表示能夠獲救，0表示不能被獲救)：

array([0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,
       0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0.,
       0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 1., 0.,
       0., 1., 1., 0., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
       0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0.,
       1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1.,
       0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1.,
       0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 1., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
       1., 1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1.,
       0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0.,
       1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1.,
       1., 0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0.,
       1., 1., 1., 1., 1., 0., 1., 0., 0., 0.])

六、總結

首先考慮數據集里面的所有特征，盡可能提取出來對結果有影響的一些信息。然后缺失值的處理，字符數據的映射，機器學習算法的改變，模型參數的優化，最后使用集成算法提升准確率。還包括對數據集的特征重要性的衡量和篩選。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習項目實戰----泰坦尼克號獲救預測(一) Python機器學習：泰坦尼克號獲救預測一 sklearn機器學習-泰坦尼克號機器學習——用邏輯回歸及隨機森林實現泰坦尼克號的生存預測泰坦尼克號獲救問題 [機器學習]貝葉斯算法對泰坦尼克號生存人群分類預測 [簡單示例] 機器學習之路: python 決策樹分類DecisionTreeClassifier 預測泰坦尼克號乘客是否幸存使用scikit-learn進行建模預測和評估操作_泰坦尼克號獲救預測泰坦尼克號幸存預測泰坦尼克號-數據挖掘項目實戰