機器學習sklearn(五): 數據處理(二)缺失值處理


來源 https://www.cnblogs.com/B-Hanan/articles/12774433.html

 

 

1 單變量缺失

import numpy as np
from sklearn.impute import SimpleImputer

help(SimpleImputer):

class SimpleImputer(_BaseImputer):Imputation transformer for completing missing values.

Parameters(參數設置)

missing_values(缺失值類型) : number, string, np.nan (default) or None

The placeholder for the missing values. All occurrences of missing_values will be imputed.

strategy : string, default='mean'

The imputation strategy.

  • If "mean", then replace missing values using the mean along each column. Can only be used with numeric data.

  • If "median", then replace missing values using the median along each column. Can only be used with numeric data.

  • If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data.

  • If "constant", then replace missing values with fill_value. Can be used with strings or numeric data.strategy="constant" for fixed value imputation.

fill_value : string or numerical value, default=None

When strategy == "constant", fill_value is used to replace all occurrences of missing_values.If left to the default, fill_value will be 0 when imputing numericaldata and "missing_value" for strings or object data types.

imp=SimpleImputer(missing_values=np.nan,strategy='mean')
imp.fit([[1,2],[np.nan,3],[7,6]])
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)
##SimpleImputer類支持稀疏矩陣
import scipy.sparse as sp
X=sp.csc_matrix([[1,2],[0,-1],[8,4]])
imp=SimpleImputer(missing_values=-1,strategy='mean')
imp.fit(X)
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=-1, strategy='mean', verbose=0)
X_test=sp.csc_matrix([[-1,2],[6,-1],[7,6]])
print(imp.transform(X_test))
  (0, 0)    3.0
  (1, 0)    6.0
  (2, 0)    7.0
  (0, 1)    2.0
  (1, 1)    3.0
  (2, 1)    6.0
print(imp.transform(X_test).toarray())
[[3. 2.]
 [6. 3.]
 [7. 6.]]

import pandas as pd
df=pd.DataFrame([['a','x'],
                [np.nan,'y'],
                ['a',np.nan],
                ['b','y']],dtype='category')

df

 

  0 1
0 a x
1 NaN y
2 a NaN
3 b y
imp=SimpleImputer(strategy='most_frequent')
print(imp.fit_transform(df))
[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]

2 多元特征估計

使用IterativeImputer類,它將每一個特征的缺失值建模為其它特性的函數,並使用該估計值進行估計。
工作模式:迭代循環
在每一步,都指定一個功能列出作為輸出

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp=IterativeImputer(max_iter=10,random_state=0)
imp.fit([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])
IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=0,
                 sample_posterior=False, skip_complete=False, tol=0.001,
                 verbose=0)
imp.transform([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])
array([[ 1.        ,  2.        ],
       [ 3.        ,  6.        ],
       [ 4.        ,  8.        ],
       [ 1.50004509,  3.        ],
       [ 7.        , 14.00004135]])
X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
print(imp.transform(X_test))
[[ 1.00007297  2.        ]
 [ 6.         12.00002754]
 [ 2.99996145  6.        ]]

3 K-近鄰法

這個KNNImputer類提供了使用k-最近鄰方法填充缺失值的估算。默認情況下,支持缺失值的歐氏距離度量,nan_euclidean_distances,用於查找最近的鄰居。每個缺失的特性都使用n_neighbors具有該功能值的最近鄰居。

from sklearn.impute import KNNImputer 

help(KNNImputer):

Imputation for completing missing values using k-Nearest Neighbors.

(使用k近鄰方法補全缺失值。)

Each sample's missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

(每個樣本的缺失值是使用在訓練集中找到的最近鄰居的‘n_neighbors’的平均值來推算的.如果兩個都不缺少的要素都不接近,則兩個樣本是接近的。)

Parameters:

missing_values : number, string, np.nan or None, default=np.nan

The placeholder for the missing values. All occurrences of missing_values will be imputed.

n_neighbors : int, default=5 Number of neighboring samples to use for imputation.

weights : {'uniform', 'distance'} or callable, default='uniform' Weight function used in prediction.

import numpy as np from sklearn.impute import KNNImputer nan = np.nan X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]] print(X) imputer = KNNImputer(n_neighbors=2, weights="uniform") imputer.fit_transform(X) 
[[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]] array([[1. , 2. , 4. ], [3. , 4. , 3. ], [5.5, 6. , 5. ], [8. , 8. , 7. ]])

 

4 標記推算值

這個MissingIndicator轉換器用於將數據集轉換為相應的二進制矩陣,以指示數據集中是否存在缺失值。這種轉換與計算相結合是很有用的。在使用估算時,保存有關哪些值丟失的信息可以提供信息。

from sklearn.impute import MissingIndicator 

help(MissingIndicator):

class MissingIndicator(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)

Binary indicators for missing values(缺失值的二進制指示符).

MissingIndicator(missing_values=nan, features='missing-only', sparse='auto', error_on_new=True)

X = np.array([[-1, -1, 1, 3], [4, -1, 0, -1], [8, -1, 1, 0]]) indicator = MissingIndicator(missing_values=-1) mask_missing_values_only = indicator.fit_transform(X) mask_missing_values_only 
array([[ True, True, False], [False, True, True], [False, True, False]]) 
#只返回存在缺失值的列的索引 indicator.features_ 
array([0, 1, 2, 3]) 
#這個features參數可以設置為'all'若要返回所有特征,無論它們是否包含缺失的值 indicator = MissingIndicator(missing_values=-1, features="all") mask_all = indicator.fit_transform(X) mask_all 
array([[ True, True, False, False], [False, True, False, True], [False, True, False, False]]) 
indicator.features_
#特征所在的列索引 
array([0, 1, 2, 3])

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM