處理分類變量的一些方法


1、簡要介紹

分類變量類似於枚舉,擁有特定數量的值類型。

比如:紅白藍以顏色為分類的元素,大中小以形狀為分類的元素。

而這類值基本是給出一個big或者red等英文字符串做為數據,這時候的話,我們就得去進行一些操作,把它們弄成可以去處理的映射值或是直接給刪掉。

 

2、三種方法(伴隨代碼一同解釋)

首先來預處理

 

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)#Because our dependent variable is SalePrice,we need to drop some missing targets

y = X.SalePrice#Select dependent variable
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
#Now we have the dataframe without missing values

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)
View Code

 

 

1) 刪除分類變量

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
#exclude=['object'] means categorical data
View Code

然后可以check一波它們的mean_absolute_error值

 

2) 標簽編碼

即構建映射值。

 

 

①但是這里尤其要注意了,我們一開始是分了train和valid兩個樣本,如果直接簡單暴力的去把train里面的分類變量都直接標簽的話,就會編譯錯誤,因為你說不准valid樣本里面會出現一些沒有在train上面出現過的分類變量。

②在本例中,這個假設是有意義的,因為對類別有個唯一的排名。 並不是所有的分類變量在值中都有一個明確的順序,但是我們將那些有順序的變量稱為有序變量。對於基於樹的模型(如決策樹和隨機森林),有序變量的標簽編碼可能效果不錯。

# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
#See that we must ensure X_train dataset have the same label encoded as X_valid

# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
View Code
from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply label encoder 
label_encoder=LabelEncoder() 
for col in good_label_cols:
    label_X_train[col]=label_encoder.fit_transform(label_X_train[col])
    label_X_valid[col]=label_encoder.transform(label_X_valid[col])
View Code

 

3) One-Hot 編碼

 

 ①可以看到我們要添加許多列在數據上,有多少類別我們就添加多少列,所以如果類別很多,就意味着我們的列表要拓展得很大,因此,我們通常只對一個基數相對較低的列進行熱編碼。然后,可以從數據集中刪除高基數列,也可以使用標簽編碼。一般情況下,選類別為10做為刪除標准

②與標簽編碼不同,one-hot編碼不假定類別的順序。因此,如果在分類數據中沒有明確的順序(例如,“紅色”既不比“黃色”多也不比“黃色”少),這種方法可能會特別有效。我們把沒有內在排序的分類變量稱為名義變量。

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
View Code
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
View Code

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM