- 實際應用pandas過程中,經常會用到category數據類型,通常以string的形式顯示,包括顏色(紅,綠,藍),尺寸的大小(大,中,小),還有地理信息等(國家,省份),這些數據的處理經常會有各種各樣的問題,pandas以及scikit-learn兩個包可以將category數據轉化為合適的數值型格式,這篇主要介紹通過這兩個包處理category類型的數據轉化為數值類型,也就是encoding的過程。
- 數據來源UCI Machine Learning Repository,這個數據集中包含了很多的category類型的數據,可以從鏈接匯總查看數據的代表的含義。
- 下面開始導入需要用到的包
import numpy as np
import pandas as pd
# 規定一下數據列的各個名稱,
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
"num_doors", "body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
"city_mpg", "highway_mpg", "price"]
# 從pandas導入csv文件,將?標記為NaN缺失值
df=pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",header=None,names=headers,na_values="?")
df.head()
|
symboling |
normalized_losses |
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
wheel_base |
... |
engine_size |
fuel_system |
bore |
stroke |
compression_ratio |
horsepower |
peak_rpm |
city_mpg |
highway_mpg |
price |
0 |
3 |
NaN |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111.0 |
5000.0 |
21 |
27 |
13495.0 |
1 |
3 |
NaN |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
88.6 |
... |
130 |
mpfi |
3.47 |
2.68 |
9.0 |
111.0 |
5000.0 |
21 |
27 |
16500.0 |
2 |
1 |
NaN |
alfa-romero |
gas |
std |
two |
hatchback |
rwd |
front |
94.5 |
... |
152 |
mpfi |
2.68 |
3.47 |
9.0 |
154.0 |
5000.0 |
19 |
26 |
16500.0 |
3 |
2 |
164.0 |
audi |
gas |
std |
four |
sedan |
fwd |
front |
99.8 |
... |
109 |
mpfi |
3.19 |
3.40 |
10.0 |
102.0 |
5500.0 |
24 |
30 |
13950.0 |
4 |
2 |
164.0 |
audi |
gas |
std |
four |
sedan |
4wd |
front |
99.4 |
... |
136 |
mpfi |
3.19 |
3.40 |
8.0 |
115.0 |
5500.0 |
18 |
22 |
17450.0 |
5 rows × 26 columns
df.dtypes
symboling int64
normalized_losses float64
make object
fuel_type object
aspiration object
num_doors object
body_style object
drive_wheels object
engine_location object
wheel_base float64
length float64
width float64
height float64
curb_weight int64
engine_type object
num_cylinders object
engine_size int64
fuel_system object
bore float64
stroke float64
compression_ratio float64
horsepower float64
peak_rpm float64
city_mpg int64
highway_mpg int64
price float64
dtype: object
# 如果只關注category 類型的數據,其實根本沒有必要拿到這些全部數據,只需要將object類型的數據取出,然后進行后續分析即可
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
engine_type |
num_cylinders |
fuel_system |
0 |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
dohc |
four |
mpfi |
1 |
alfa-romero |
gas |
std |
two |
convertible |
rwd |
front |
dohc |
four |
mpfi |
2 |
alfa-romero |
gas |
std |
two |
hatchback |
rwd |
front |
ohcv |
six |
mpfi |
3 |
audi |
gas |
std |
four |
sedan |
fwd |
front |
ohc |
four |
mpfi |
4 |
audi |
gas |
std |
four |
sedan |
4wd |
front |
ohc |
five |
mpfi |
# 在進行下一步處理的之前,需要將數據進行缺失值的處理,對列進行處理axis=1
obj_df[obj_df.isnull().any(axis=1)]
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
engine_type |
num_cylinders |
fuel_system |
27 |
dodge |
gas |
turbo |
NaN |
sedan |
fwd |
front |
ohc |
four |
mpfi |
63 |
mazda |
diesel |
std |
NaN |
sedan |
fwd |
front |
ohc |
four |
idi |
# 處理缺失值的方式有很多種,根據項目的不同或者填補缺失值或者去掉該樣本。本文中的數據缺失用該列的眾數來補充。
obj_df.num_doors.value_counts()
four 114
two 89
Name: num_doors, dtype: int64
obj_df=obj_df.fillna({"num_doors":"four"})
在處理完缺失值之后,有以下幾種方式進行category數據轉化encoding
- Find and Replace
- label encoding
- One Hot encoding
- Custom Binary encoding
- sklearn
- advanced Approaches
# pandas里面的replace文檔非常豐富,筆者在使用該功能時候,深感其參數眾多,深感提供的功能也非常的強大
# 本文中使用replace的功能,創建map的字典,針對需要數據清理的列進行清理更加方便,例如:
cleanup_nums= {
"num_doors":{"four":4,"two":2},
"num_cylinders":{
"four":4,"six":6,"five":5,"eight":8,"two":2,"twelve":12,"three":3
}
}
obj_df.replace(cleanup_nums,inplace=True)
obj_df.head()
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
engine_type |
num_cylinders |
fuel_system |
0 |
alfa-romero |
gas |
std |
2 |
convertible |
rwd |
front |
dohc |
4 |
mpfi |
1 |
alfa-romero |
gas |
std |
2 |
convertible |
rwd |
front |
dohc |
4 |
mpfi |
2 |
alfa-romero |
gas |
std |
2 |
hatchback |
rwd |
front |
ohcv |
6 |
mpfi |
3 |
audi |
gas |
std |
4 |
sedan |
fwd |
front |
ohc |
4 |
mpfi |
4 |
audi |
gas |
std |
4 |
sedan |
4wd |
front |
ohc |
5 |
mpfi |
label encoding 是將一組無規則的,沒有大小比較的數據轉化為數字
- 比如body_style 字段中含有多個數據值,可以使用該方法將其轉化
- convertible > 0
- hardtop > 1
- hatchback > 2
- sedan > 3
- wagon > 4
這種方式就像是密碼編碼一樣,這,個比喻很有意思,就像之前看電影,記得一句台詞,他們倆親密的像做賊一樣
# 通過pandas里面的 category數據類型,可以很方便的或者該編碼
obj_df["body_style"]=obj_df["body_style"].astype("category")
obj_df.dtypes
make object
fuel_type object
aspiration object
num_doors int64
body_style category
drive_wheels object
engine_location object
engine_type object
num_cylinders int64
fuel_system object
dtype: object
# 我們可以通過賦值新的列,保存其對應的code
# 通過這種方法可以舒服的數據,便於以后的數據分析以及整理
obj_df["body_style_code"] = obj_df["body_style"].cat.codes
obj_df.head()
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
drive_wheels |
engine_location |
engine_type |
num_cylinders |
fuel_system |
body_style_code |
0 |
alfa-romero |
gas |
std |
2 |
convertible |
rwd |
front |
dohc |
4 |
mpfi |
0 |
1 |
alfa-romero |
gas |
std |
2 |
convertible |
rwd |
front |
dohc |
4 |
mpfi |
0 |
2 |
alfa-romero |
gas |
std |
2 |
hatchback |
rwd |
front |
ohcv |
6 |
mpfi |
2 |
3 |
audi |
gas |
std |
4 |
sedan |
fwd |
front |
ohc |
4 |
mpfi |
3 |
4 |
audi |
gas |
std |
4 |
sedan |
4wd |
front |
ohc |
5 |
mpfi |
3 |
one hot encoding
- label encoding 因為將wagon轉化為4,而convertible變成了0,這里面是不是會有大大小的比較,可能會造成誤解,然后利用one hot encoding這種方式
是將特征轉化為0或者1,這樣會增加數據的列的數量,同時也減少了label encoding造成的衡量數據大小的誤解。
- pandas中提供了get_dummies 方法可以將需要轉化的列的值轉化為0,1,兩種編碼
# 新生成DataFrame包含了新生成的三列數據,
# drive_wheels_4wd
# drive_wheels_fwd
# drive_wheels_rwd
pd.get_dummies(obj_df,columns=["drive_wheels"]).head()
|
make |
fuel_type |
aspiration |
num_doors |
body_style |
engine_location |
engine_type |
num_cylinders |
fuel_system |
body_style_code |
drive_wheels_4wd |
drive_wheels_fwd |
drive_wheels_rwd |
0 |
alfa-romero |
gas |
std |
2 |
convertible |
front |
dohc |
4 |
mpfi |
0 |
0 |
0 |
1 |
1 |
alfa-romero |
gas |
std |
2 |
convertible |
front |
dohc |
4 |
mpfi |
0 |
0 |
0 |
1 |
2 |
alfa-romero |
gas |
std |
2 |
hatchback |
front |
ohcv |
6 |
mpfi |
2 |
0 |
0 |
1 |
3 |
audi |
gas |
std |
4 |
sedan |
front |
ohc |
4 |
mpfi |
3 |
0 |
1 |
0 |
4 |
audi |
gas |
std |
4 |
sedan |
front |
ohc |
5 |
mpfi |
3 |
1 |
0 |
0 |
# 該方法之所以強大,是因為可以同時處理多個category的列,同時選擇prefix前綴分別對應好
# 產生的新的DataFrame所有數據都包含
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()
|
make |
fuel_type |
aspiration |
num_doors |
engine_location |
engine_type |
num_cylinders |
fuel_system |
body_style_code |
body_convertible |
body_hardtop |
body_hatchback |
body_sedan |
body_wagon |
drive_4wd |
drive_fwd |
drive_rwd |
0 |
alfa-romero |
gas |
std |
2 |
front |
dohc |
4 |
mpfi |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
alfa-romero |
gas |
std |
2 |
front |
dohc |
4 |
mpfi |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
2 |
alfa-romero |
gas |
std |
2 |
front |
ohcv |
6 |
mpfi |
2 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
3 |
audi |
gas |
std |
4 |
front |
ohc |
4 |
mpfi |
3 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
4 |
audi |
gas |
std |
4 |
front |
ohc |
5 |
mpfi |
3 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
自定義0,1 encoding
- 有的時候回根據業務需要,可能會結合label encoding以及not hot 兩種方式進行二值化。
obj_df["engine_type"].value_counts()
ohc 148
ohcf 15
ohcv 13
dohc 12
l 12
rotor 4
dohcv 1
Name: engine_type, dtype: int64
# 有的時候為了區分出 engine_type是否是och技術的,可以使用二值化,將該列進行處理
# 這也突出了領域知識是如何以最有效的方式解決問題
obj_df["engine_type_code"] = np.where(obj_df["engine_type"].str.contains("ohc"),1,0)
obj_df[["make","engine_type","engine_type_code"]].head()
|
make |
engine_type |
engine_type_code |
0 |
alfa-romero |
dohc |
1 |
1 |
alfa-romero |
dohc |
1 |
2 |
alfa-romero |
ohcv |
1 |
3 |
audi |
ohc |
1 |
4 |
audi |
ohc |
1 |
scikit-learn中的數據轉化
- sklearn.processing模塊提供了很多方便的數據轉化以及缺失值處理方式(Imputer),可以直接從該模塊導入LabelEncoder,LabelBinarizer,0,1歸一化(最大最小標准化),Normalizer正則化(L1,L2)一般用的不多,標准化(最大最小標准化max_mix),非線性轉換,生成多項式特征(PolynomialFeatures),將每個特征縮放在同樣的范圍或分布情況下
- sklearn processing 模塊官網文檔鏈接
- category_encoders包官方文檔
至此,數據預處理以及category轉化大致講完了。