python 數據處理中的 LabelEncoder 和 OneHotEncoder

本文轉載自查看原文 2018-05-29 16:54 8677 機器學習 Machine learning/ sklearn/ Python

One-Hot 編碼即獨熱編碼，又稱一位有效編碼，其方法是使用N位狀態寄存器來對N個狀態進行編碼，每個狀態都由他獨立的寄存器位，並且在任意時候，其中只有一位有效。這樣做的好處主要有：1. 解決了分類器不好處理屬性數據的問題； 2. 在一定程度上也起到了擴充特征的作用。

將離散型特征進行one-hot編碼的作用，是為了讓距離計算更合理，但如果特征是離散的，並且不用one-hot編碼就可以很合理的計算出距離，那么就沒必要進行one-hot編碼。離散特征進行one-hot編碼，編碼后的特征，其實每一維度的特征都可以看做是連續的特征。就可以跟對連續型特征的歸一化方法一樣，對每一維特征進行歸一化。比如歸一化到[-1,1]或歸一化到均值為0,方差為1。

基於樹的方法是不需要進行特征的歸一化，例如隨機森林，bagging 和 boosting等。基於參數的模型或基於距離的模型，都是要進行特征的歸一化。Tree Model不太需要one-hot編碼：對於決策樹來說，one-hot的本質是增加樹的深度。

one hot encoding的優點就是它的值只有0和1，不同的類型存儲在垂直的空間。缺點就是，當類別的數量很多時，特征空間會變得非常大。在這種情況下，一般可以用PCA來減少維度。而且one hot encoding+PCA這種組合在實際中也非常有用。總的來說，要是one hot encoding的類別數目不太多，建議優先考慮。

one hot 編碼及數據歸一化
對於非負數類型編碼利用onehotEncode
對於字符以及混合類型編碼利用labelEncode

# 簡單來說 LabelEncoder 是對不連續的數字或者文本進行編號
# sklearn.preprocessing.LabelEncoder()：標准化標簽，將標簽值統一轉換成range(標簽值個數-1)范圍內

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit([1,5,67,100])
le.transform([1,1,100,67,5])
out： array([0, 0, 3, 2, 1], dtype=int64)

#OneHotEncoder 用於將表示分類的數據擴維：
from sklearn.preprocessing import OneHotEncode
ohe = OneHotEncoder()
ohe.fit([[1],[2],[3],[4]])
ohe.transform([[2],[3],[1],[4]]).toarray()
out：array([[ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

- 源碼：

 Examples
    --------
    Given a dataset with three features and four samples, we let the encoder
    find the maximum value per feature and transform the data to a binary
    one-hot encoding.

    >>> from sklearn.preprocessing import OneHotEncoder
    >>> enc = OneHotEncoder()
    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], \
[1, 0, 2]])  # doctest: +ELLIPSIS
    OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
           handle_unknown='error', n_values='auto', sparse=True)
    >>> enc.n_values_
    array([2, 3, 4])
    >>> enc.feature_indices_
    array([0, 2, 5, 9])
    >>> enc.transform([[0, 1, 1]]).toarray()
    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

 Examples
    --------
    `LabelEncoder` can be used to normalize labels.

    >>> from sklearn import preprocessing
    >>> le = preprocessing.LabelEncoder()
    >>> le.fit([1, 2, 2, 6])
    LabelEncoder()
    >>> le.classes_
    array([1, 2, 6])
    >>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS
    array([0, 0, 1, 2]...)
    >>> le.inverse_transform([0, 0, 1, 2])
    array([1, 1, 2, 6])

    It can also be used to transform non-numerical labels (as long as they are
    hashable and comparable) to numerical labels.

    >>> le = preprocessing.LabelEncoder()
    >>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
    LabelEncoder()
    >>> list(le.classes_)
    ['amsterdam', 'paris', 'tokyo']
    >>> le.transform(["tokyo", "tokyo", "paris"]) #doctest: +ELLIPSIS
    array([2, 2, 1]...)
    >>> list(le.inverse_transform([2, 2, 1]))
    ['tokyo', 'tokyo', 'paris']

LabelEncoder和OneHotEncoder 在特征工程中的應用
下面引入scikit learn中的OneHotEncoder的介紹。

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

一、One-Hot Encoding

One-Hot編碼，又稱為一位有效編碼，主要是采用

位狀態寄存器來對

個狀態進行編碼，每個狀態都由他獨立的寄存器位，並且在任意時候只有一位有效。

有如下三個特征屬性：

二、One-Hot Encoding的處理方法

三、實際的Python代碼

在實際的機器學習的應用任務中，特征有時候並不總是連續值，有可能是一些分類值，如性別可分為“male”和“female”。在機器學習任務中，對於這樣的特征，通常我們需要對其進行特征數字化，如下面的例子：

性別：["male"，"female"]
地區：["Europe"，"US"，"Asia"]
瀏覽器：["Firefox"，"Chrome"，"Safari"，"Internet Explorer"]

 
          對於某一個樣本，如["male"，"US"，"Internet Explorer"]，我們需要將這個分類值的特征數字化，最直接的方法，我們可以采用序列化的方式：[0,1,3]。但是這樣的特征處理並不能直接放入機器學習算法中。 
         

對於上述的問題，性別的屬性是二維的，同理，地區是三維的，瀏覽器則是4維的，這樣，我們可以采用One-Hot編碼的方式對上述的樣本“["male"，"US"，"Internet Explorer"]”編碼，“male”則對應着[1，0]，同理“US”對應着[0，1，0]，“Internet Explorer”對應着[0,0,0,1]。則完整的特征數字化的結果為：[1,0,0,1,0,0,0,0,1]。這樣導致的一個結果就是數據會變得非常的稀疏。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [數據處理] LabelEncoder編碼 python pandas 數據處理 Python爬蟲數據處理 Python數據處理實戰 python pandas 數據處理 python數據處理中內存優化的一些tricks Python中基本的讀文件和簡單數據處理 python數據處理之 ddt，@data， @unpack Python數據處理進階——pandas python爬蟲之json數據處理