python學習08之處理缺失值

本文轉載自查看原文 2019-07-15 21:18 1096

　　1、缺失值的處理

　　　　我們將學習三種處理缺失值的方法。然后我們將比較這些方法在實際數據集上的有效性。

　　　　　　缺失值的介紹：

　　有很多種方法可以使數據以丟失的值結束。

　　　　　　　　例如：

　　　　　　　　　　兩居室的房子不包括第三居室大小的價值。

　　　　　　　　　　調查對象可選擇不分享其收入。

　　　　如果嘗試使用缺少值的數據構建模型，大多數機器學習庫（包括SciKit學習）都會給出錯誤。

　　2、三種處理缺失值的方法

　　　　1）一個簡單的選項：刪除缺少值的列

　　　　　　最簡單的選擇是刪除缺少值的列。

　　　　　　除非已刪除列中的大多數值都丟失，否則模型將無法訪問大量（可能有用！）

　　　　　　　　此方法的信息：

　　　　　　　　　　作為一個極端的例子，考慮一個具有10000行的數據集，其中一個重要列缺少一個條目。這種方法會完全刪除列！

　　　　 2）、更好的選擇：插補

　　　　　　　　插補用一些數字填充缺失的值。例如，我們可以沿着每列填寫平均值。

　　　　　　　　在大多數情況下，插補值並不完全正確，但它通常會導致比完全刪除列得到的模型更精確。

　　　　 3）、插補的擴展

　　　　　　　　插補法是標准的方法，通常工作得很好。但是，輸入值可能系統地高於或低於其實際值（數據集中未收集）。

　　　　　　　　或者缺少值的行在其他方面可能是唯一的。在這種情況下，通過考慮最初缺少的值，您的模型可以做出更好的預測。

　　　　　　　　在這種方法中，我們像以前一樣輸入缺失的值。另外，對於原始數據集中缺少條目的每一列，我們添加一個新列，顯示輸入條目的位置。

　　　　　　　　在某些情況下，這將有意義地改善結果。在其他情況下，這根本沒有幫助。

　　3、舉例表明

　　　　1、准備活動　

import pandas as pd
from sklearn.model_selection import train_test_split
#加載數據
data = pd.read_csv('E:/data_handle/melb_data.csv')
#選擇目標
y = data.Price
#使用數字預測器
melb_predictors = data.drop(['Price'],axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])
#將數據分為訓練和驗證子集
X_trian, X_valid, y_train, y_valid = train_test_split(X,y,train_size=0.8, test_size =0.2,random_state=0)

　　　　2、定義功能來度量每種方法的質量

　　　　　　我們定義了一個函數score_dataset()來比較處理缺失值的不同方法。該函數報告隨機森林模型的平均絕對誤差(MAE)。

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
#函數用於比較不同的方法
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid,preds)

　　　　3、方法1的得分（刪除缺少值的列）

　　　　　　由於我們同時使用培訓和驗證集，因此我們小心地在兩個數據幀中刪除相同的列。　　

#獲得缺失值的列名
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
#刪除訓練和驗證數據的列
reduced_X_train = X_train.drop(cols_with_missing,axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

　　　　4、方法二的得分（插補）

　　　　　　接下來，我們使用simpleinputer將缺失的值替換為每列的平均值。

　　　　　　雖然這很簡單，但是填充平均值通常會很好地執行（但這會因數據集而異）。

　　　　　　雖然統計學家已經嘗試了更復雜的方法來確定插補值（例如回歸插補），但一旦將結果插入復雜的機器學習模型中，復雜的策略通常不會帶來額外的好處。　

#插補
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
#插補刪除列名，並放回原處
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

　　　　5、方法3的得分（插補的擴展）

　　　　　　接下來，我們輸入缺失的值，同時跟蹤哪些值被輸入。

#復制以避免更改原始數據（輸入時）
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

#新建列，指示將要輸入的內容
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# 插補
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

#插補刪除列名，並放回原處
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

　　　　6、總結

#打印訓練的形狀（行、列）
print(X_train.shape)
#每列培訓數據中缺少的值的數目
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

　　　　　　與常見的情況一樣，與我們簡單地刪除缺少值的列（在方法1中）相比，輸入缺少值（在方法2和方法3中）會產生更好的結果。

此次學習到此結束！！！！！

Score from Approach 3 (An Extension to Imputation)¶

Next, we impute the missing values, while also keeping track of which values were imputed.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 缺失值處理（Imputation） python-缺失值處理 Python處理時間序列缺失值機器學習sklearn（五）：數據處理（二）缺失值處理缺失值處理 Pandas缺失值處理 Xgboost如何處理缺失值/ 缺失值的處理方法 Pandas對缺失值的處理 pandas缺失值處理