機器學習sklearn(二十): 特征工程(十一)特征編碼(五)類別特征編碼(三)獨熱編碼 OneHotEncoder


另外一種將標稱型特征轉換為能夠被scikit-learn中模型使用的編碼是one-of-K, 又稱為 獨熱碼或dummy encoding。 這種編碼類型已經在類OneHotEncoder中實現。該類把每一個具有n_categories個可能取值的categorical特征變換為長度為n_categories的二進制特征向量,里面只有一個地方是1,其余位置都是0。

繼續我們上面的示例:

>>>
>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)  
OneHotEncoder(categorical_features=None, categories=None,
       dtype=<... 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari'],
...                ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

默認情況下,每個特征使用幾維的數值可以從數據集自動推斷。而且也可以在屬性categories_中找到:

>>>
>>> enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]

可以使用參數categories_顯式地指定這一點。我們的數據集中有兩種性別、四種可能的大陸和四種web瀏覽器:

>>> genders = ['female', 'male']
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
>>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(categorical_features=None,
       categories=[...], drop=None,
       dtype=<... 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

如果訓練數據可能缺少分類特性,通常最好指定handle_unknown='ignore',而不是像上面那樣手動設置類別。當指定handle_unknown='ignore',並且在轉換過程中遇到未知類別時,不會產生錯誤,但是為該特性生成的一熱編碼列將全部為零(handle_unknown='ignore'只支持一熱編碼):

如果訓練數據中可能含有缺失的標稱型特征, 通過指定handle_unknown='ignore'比像上面代碼那樣手動設置categories更好。 當handle_unknown='ignore' 被指定並在變換過程中真的碰到了未知的 categories, 則不會拋出任何錯誤,但是由此產生的該特征的one-hot編碼列將會全部變成 0 。(這個參數設置選項 handle_unknown='ignore' 僅僅在 one-hot encoding的時候有效):

>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
       dtype=<... 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])

還可以使用drop參數將每個列編碼為n_categories-1列,而不是n_categories列。此參數允許用戶為要刪除的每個特征指定類別。這對於避免某些分類器中輸入矩陣的共線性是有用的。例如,當使用非正則化回歸(線性回歸)時,這種功能是有用的,因為共線性會導致協方差矩陣是不可逆的。當這個參數不是None時,handle_unknown必須設置為error:

>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
>>> drop_enc.transform(X).toarray()
array([[1., 1., 1.],
       [0., 0., 0.]])

標稱型特征有時是用字典來表示的,而不是標量,具體請參閱從字典中加載特征

class sklearn.preprocessing.OneHotEncoder(*categories='auto'drop=Nonesparse=Truedtype=<class 'numpy.float64'>handle_unknown='error')

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in the User Guide.

Parameters
categories ‘auto’ or a list of array-like, default=’auto’

Categories (unique values) per feature:

  • ‘auto’ : Determine categories automatically from the training data.

  • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

The used categories can be found in the categories_ attribute.

New in version 0.20.

drop {‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

  • None : retain all features (the default).

  • ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

  • ‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

  • array : drop[i] is the category in feature X[:, i] that should be dropped.

New in version 0.21: The parameter drop was added in 0.21.

Changed in version 0.23: The option drop='if_binary' was added in 0.23.

sparse bool, default=True

Will return sparse matrix if set True else will return an array.

dtype number type, default=float

Desired dtype of output.

handle_unknown {‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Attributes
categories_ list of arrays

The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).

drop_idx_ array of shape (n_features,)
  • drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.

  • drop_idx_[i] None if no category is to be dropped from the feature with index i, e.g. when drop='if_binary' and the feature isn’t binary.

  • drop_idx_ None if all the transformed features will be retained.

Changed in version 0.23: Added the possibility to contain None values.

 

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder

One can discard categories not seen during fit:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'],
  dtype=object)

One can always drop the first column for each feature:

>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 0., 0.],
       [1., 1., 0.]])

Or drop a column for feature only having 2 categories:

>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
>>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 1., 0., 0.],
       [1., 0., 1., 0.]])

Methods

fit(X[, y])

Fit OneHotEncoder to X.

fit_transform(X[, y])

Fit OneHotEncoder to X, then transform X.

get_feature_names([input_features])

Return feature names for output features.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Convert the data back to the original representation.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform X using one-hot encoding.

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM