1、簡要介紹

分類變量類似於枚舉，擁有特定數量的值類型。

比如：紅白藍以顏色為分類的元素，大中小以形狀為分類的元素。

而這類值基本是給出一個big或者red等英文字符串做為數據，這時候的話，我們就得去進行一些操作，把它們弄成可以去處理的映射值或是直接給刪掉。

2、三種方法（伴隨代碼一同解釋）

首先來預處理

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)#Because our dependent variable is SalePrice,we need to drop some missing targets

y = X.SalePrice#Select dependent variable
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
#Now we have the dataframe without missing values

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

View Code

1) 刪除分類變量

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
#exclude=['object'] means categorical data

View Code

然后可以check一波它們的mean_absolute_error值

2) 標簽編碼

即構建映射值。

①但是這里尤其要注意了，我們一開始是分了train和valid兩個樣本，如果直接簡單暴力的去把train里面的分類變量都直接標簽的話，就會編譯錯誤，因為你說不准valid樣本里面會出現一些沒有在train上面出現過的分類變量。

②在本例中，這個假設是有意義的，因為對類別有個唯一的排名。並不是所有的分類變量在值中都有一個明確的順序，但是我們將那些有順序的變量稱為有序變量。對於基於樹的模型(如決策樹和隨機森林)，有序變量的標簽編碼可能效果不錯。

# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
#See that we must ensure X_train dataset have the same label encoded as X_valid

# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))

View Code

from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply label encoder 
label_encoder=LabelEncoder() 
for col in good_label_cols:
    label_X_train[col]=label_encoder.fit_transform(label_X_train[col])
    label_X_valid[col]=label_encoder.transform(label_X_valid[col])

View Code

3) One-Hot 編碼

①可以看到我們要添加許多列在數據上，有多少類別我們就添加多少列，所以如果類別很多，就意味着我們的列表要拓展得很大，因此，我們通常只對一個基數相對較低的列進行熱編碼。然后，可以從數據集中刪除高基數列，也可以使用標簽編碼。一般情況下，選類別為10做為刪除標准

②與標簽編碼不同，one-hot編碼不假定類別的順序。因此，如果在分類數據中沒有明確的順序(例如，“紅色”既不比“黃色”多也不比“黃色”少)，這種方法可能會特別有效。我們把沒有內在排序的分類變量稱為名義變量。

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

View Code

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 按照分類變量切割數據數據預處理：分類變量實體嵌入做特征提取 python類變量的分類和調用方式 JAVA類變量、類方法 Ruby類變量和類方法第四節有序分類變量的比--非參數方法類變量/方法(靜態變量/方法) 連續數值變量的一些特征工程方法：二值化、多項式、數據傾斜處理【java】一些圖片印章的處理方法圖像處理的一些方法