機器學習sklearn（五）：數據處理（二）缺失值處理

本文轉載自查看原文 2021-06-16 23:26 192

來源 https://www.cnblogs.com/B-Hanan/articles/12774433.html

1 單變量缺失

import numpy as np
from sklearn.impute import SimpleImputer

help(SimpleImputer):

class SimpleImputer(_BaseImputer):Imputation transformer for completing missing values.

Parameters(參數設置)

missing_values(缺失值類型) : number, string, np.nan (default) or None

The placeholder for the missing values. All occurrences of missing_values will be imputed.

strategy : string, default='mean'

The imputation strategy.

If "mean", then replace missing values using the mean along each column. Can only be used with numeric data.
If "median", then replace missing values using the median along each column. Can only be used with numeric data.
If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
If "constant", then replace missing values with fill_value. Can be used with strings or numeric data.strategy="constant" for fixed value imputation.

fill_value : string or numerical value, default=None

When strategy == "constant", fill_value is used to replace all occurrences of missing_values.If left to the default, fill_value will be 0 when imputing numericaldata and "missing_value" for strings or object data types.

imp=SimpleImputer(missing_values=np.nan,strategy='mean')
imp.fit([[1,2],[np.nan,3],[7,6]])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

##SimpleImputer類支持稀疏矩陣
import scipy.sparse as sp
X=sp.csc_matrix([[1,2],[0,-1],[8,4]])
imp=SimpleImputer(missing_values=-1,strategy='mean')
imp.fit(X)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=-1, strategy='mean', verbose=0)

X_test=sp.csc_matrix([[-1,2],[6,-1],[7,6]])
print(imp.transform(X_test))

  (0, 0)    3.0
  (1, 0)    6.0
  (2, 0)    7.0
  (0, 1)    2.0
  (1, 1)    3.0
  (2, 1)    6.0

print(imp.transform(X_test).toarray())
[[3. 2.]
 [6. 3.]
 [7. 6.]]

import pandas as pd
df=pd.DataFrame([['a','x'],
                [np.nan,'y'],
                ['a',np.nan],
                ['b','y']],dtype='category')

df

	0	1
0	a	x
1	NaN	y
2	a	NaN
3	b	y

imp=SimpleImputer(strategy='most_frequent')
print(imp.fit_transform(df))

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]

2 多元特征估計

使用IterativeImputer類，它將每一個特征的缺失值建模為其它特性的函數，並使用該估計值進行估計。
工作模式：迭代循環
在每一步，都指定一個功能列出作為輸出

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp=IterativeImputer(max_iter=10,random_state=0)
imp.fit([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=0,
                 sample_posterior=False, skip_complete=False, tol=0.001,
                 verbose=0)

imp.transform([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])

array([[ 1.        ,  2.        ],
       [ 3.        ,  6.        ],
       [ 4.        ,  8.        ],
       [ 1.50004509,  3.        ],
       [ 7.        , 14.00004135]])

X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
print(imp.transform(X_test))

[[ 1.00007297  2.        ]
 [ 6.         12.00002754]
 [ 2.99996145  6.        ]]

3 K-近鄰法

這個KNNImputer類提供了使用k-最近鄰方法填充缺失值的估算。默認情況下，支持缺失值的歐氏距離度量，nan_euclidean_distances，用於查找最近的鄰居。每個缺失的特性都使用n_neighbors具有該功能值的最近鄰居。

from sklearn.impute import KNNImputer

help(KNNImputer)：

Imputation for completing missing values using k-Nearest Neighbors.

(使用k近鄰方法補全缺失值。)

Each sample's missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

(每個樣本的缺失值是使用在訓練集中找到的最近鄰居的‘n_neighbors’的平均值來推算的.如果兩個都不缺少的要素都不接近，則兩個樣本是接近的。)

Parameters：

missing_values : number, string, np.nan or None, default=np.nan

The placeholder for the missing values. All occurrences of missing_values will be imputed.

n_neighbors : int, default=5 Number of neighboring samples to use for imputation.

weights : {'uniform', 'distance'} or callable, default='uniform' Weight function used in prediction.

import numpy as np from sklearn.impute import KNNImputer nan = np.nan X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]] print(X) imputer = KNNImputer(n_neighbors=2, weights="uniform") imputer.fit_transform(X)

[[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]] array([[1. , 2. , 4. ], [3. , 4. , 3. ], [5.5, 6. , 5. ], [8. , 8. , 7. ]])

4 標記推算值

這個MissingIndicator轉換器用於將數據集轉換為相應的二進制矩陣，以指示數據集中是否存在缺失值。這種轉換與計算相結合是很有用的。在使用估算時，保存有關哪些值丟失的信息可以提供信息。

from sklearn.impute import MissingIndicator

help(MissingIndicator):

class MissingIndicator(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)

Binary indicators for missing values(缺失值的二進制指示符).

MissingIndicator(missing_values=nan, features='missing-only', sparse='auto', error_on_new=True)

X = np.array([[-1, -1, 1, 3], [4, -1, 0, -1], [8, -1, 1, 0]]) indicator = MissingIndicator(missing_values=-1) mask_missing_values_only = indicator.fit_transform(X) mask_missing_values_only

array([[ True, True, False], [False, True, True], [False, True, False]])

#只返回存在缺失值的列的索引 indicator.features_

array([0, 1, 2, 3])

#這個features參數可以設置為'all'若要返回所有特征，無論它們是否包含缺失的值 indicator = MissingIndicator(missing_values=-1, features="all") mask_all = indicator.fit_transform(X) mask_all

array([[ True, True, False, False], [False, True, False, True], [False, True, False, False]])

indicator.features_
#特征所在的列索引

array([0, 1, 2, 3])

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習sklearn（六）：數據處理（三）數值型數據處理（一）歸一化( MinMaxScaler/MaxAbsScaler) 機器學習sklearn（四）：數據處理（一）數據集拆分（一）train_test_split 數據處理————缺失值處理機器學習（二十二）— 數據缺失處理方法【機器學習】scikit-learn中的數據預處理小結(歸一化、缺失值填充、離散特征編碼、連續值分箱) Pandas | 缺失數據處理 [源碼解析] 機器學習參數服務器Paracel (3)------數據處理機器學習sklearn（三）：加載數據集(數據導入) Python之機器學習-sklearn生成隨機數據【原】關於使用Sklearn進行數據預處理 —— 缺失值（Missing Value）處理