1、簡要介紹
分類變量類似於枚舉,擁有特定數量的值類型。
比如:紅白藍以顏色為分類的元素,大中小以形狀為分類的元素。
而這類值基本是給出一個big或者red等英文字符串做為數據,這時候的話,我們就得去進行一些操作,把它們弄成可以去處理的映射值或是直接給刪掉。
2、三種方法(伴隨代碼一同解釋)
首先來預處理

import pandas as pd from sklearn.model_selection import train_test_split # Read the data X = pd.read_csv('../input/train.csv', index_col='Id') X_test = pd.read_csv('../input/test.csv', index_col='Id') # Remove rows with missing target, separate target from predictors X.dropna(axis=0, subset=['SalePrice'], inplace=True)#Because our dependent variable is SalePrice,we need to drop some missing targets y = X.SalePrice#Select dependent variable X.drop(['SalePrice'], axis=1, inplace=True) # To keep things simple, we'll drop columns with missing values cols_with_missing = [col for col in X.columns if X[col].isnull().any()] X.drop(cols_with_missing, axis=1, inplace=True) X_test.drop(cols_with_missing, axis=1, inplace=True) #Now we have the dataframe without missing values # Break off validation set from training data X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
1) 刪除分類變量

drop_X_train = X_train.select_dtypes(exclude=['object']) drop_X_valid = X_valid.select_dtypes(exclude=['object']) #exclude=['object'] means categorical data
然后可以check一波它們的mean_absolute_error值
2) 標簽編碼
即構建映射值。
①但是這里尤其要注意了,我們一開始是分了train和valid兩個樣本,如果直接簡單暴力的去把train里面的分類變量都直接標簽的話,就會編譯錯誤,因為你說不准valid樣本里面會出現一些沒有在train上面出現過的分類變量。
②在本例中,這個假設是有意義的,因為對類別有個唯一的排名。 並不是所有的分類變量在值中都有一個明確的順序,但是我們將那些有順序的變量稱為有序變量
。對於基於樹的模型(如決策樹和隨機森林),有序變量的標簽編碼可能效果不錯。

# All categorical columns object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"] # Columns that can be safely label encoded good_label_cols = [col for col in object_cols if set(X_train[col]) == set(X_valid[col])] #See that we must ensure X_train dataset have the same label encoded as X_valid # Problematic columns that will be dropped from the dataset bad_label_cols = list(set(object_cols)-set(good_label_cols))

from sklearn.preprocessing import LabelEncoder # Drop categorical columns that will not be encoded label_X_train = X_train.drop(bad_label_cols, axis=1) label_X_valid = X_valid.drop(bad_label_cols, axis=1) # Apply label encoder label_encoder=LabelEncoder() for col in good_label_cols: label_X_train[col]=label_encoder.fit_transform(label_X_train[col]) label_X_valid[col]=label_encoder.transform(label_X_valid[col])
3) One-Hot 編碼
①可以看到我們要添加許多列在數據上,有多少類別我們就添加多少列,所以如果類別很多,就意味着我們的列表要拓展得很大,因此,我們通常只對一個基數相對較低的列進行熱編碼。然后,可以從數據集中刪除高基數列,也可以使用標簽編碼。一般情況下,選類別為10做為刪除標准
②與標簽編碼不同,one-hot編碼不假定類別的順序。因此,如果在分類數據中沒有明確的順序(例如,“紅色”既不比“黃色”多也不比“黃色”少),這種方法可能會特別有效。我們把沒有內在排序的分類變量稱為名義變量。

# Columns that will be one-hot encoded low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10] # Columns that will be dropped from the dataset high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

from sklearn.preprocessing import OneHotEncoder # Apply one-hot encoder to each column with categorical data OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols])) OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols])) # One-hot encoding removed index; put it back OH_cols_train.index = X_train.index OH_cols_valid.index = X_valid.index # Remove categorical columns (will replace with one-hot encoding) num_X_train = X_train.drop(object_cols, axis=1) num_X_valid = X_valid.drop(object_cols, axis=1) # Add one-hot encoded columns to numerical features OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1) OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)