pandas category數據類型

本文轉載自查看原文 2018-08-02 15:53 4327 pandas

實際應用pandas過程中，經常會用到category數據類型，通常以string的形式顯示，包括顏色（紅，綠，藍），尺寸的大小（大，中，小），還有地理信息等（國家，省份），這些數據的處理經常會有各種各樣的問題，pandas以及scikit-learn兩個包可以將category數據轉化為合適的數值型格式，這篇主要介紹通過這兩個包處理category類型的數據轉化為數值類型，也就是encoding的過程。
數據來源UCI Machine Learning Repository，這個數據集中包含了很多的category類型的數據，可以從鏈接匯總查看數據的代表的含義。
下面開始導入需要用到的包

import numpy as np
import pandas as pd

# 規定一下數據列的各個名稱，
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]
# 從pandas導入csv文件，將?標記為NaN缺失值
df=pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",header=None,names=headers,na_values="?")
df.head()

	symboling	normalized_losses	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	wheel_base	...	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	13495.0
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
3	2	164.0	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
4	2	164.0	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0

5 rows × 26 columns

df.dtypes

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

# 如果只關注category 類型的數據，其實根本沒有必要拿到這些全部數據，只需要將object類型的數據取出，然后進行后續分析即可
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system
0	alfa-romero	gas	std	two	convertible	rwd	front	dohc	four	mpfi
1	alfa-romero	gas	std	two	convertible	rwd	front	dohc	four	mpfi
2	alfa-romero	gas	std	two	hatchback	rwd	front	ohcv	six	mpfi
3	audi	gas	std	four	sedan	fwd	front	ohc	four	mpfi
4	audi	gas	std	four	sedan	4wd	front	ohc	five	mpfi

#  在進行下一步處理的之前，需要將數據進行缺失值的處理，對列進行處理axis=1
obj_df[obj_df.isnull().any(axis=1)]

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system
27	dodge	gas	turbo	NaN	sedan	fwd	front	ohc	four	mpfi
63	mazda	diesel	std	NaN	sedan	fwd	front	ohc	four	idi

# 處理缺失值的方式有很多種，根據項目的不同或者填補缺失值或者去掉該樣本。本文中的數據缺失用該列的眾數來補充。
obj_df.num_doors.value_counts()

four    114
two      89
Name: num_doors, dtype: int64

obj_df=obj_df.fillna({"num_doors":"four"})

在處理完缺失值之后，有以下幾種方式進行category數據轉化encoding

Find and Replace
label encoding
One Hot encoding
Custom Binary encoding
sklearn
advanced Approaches

#  pandas里面的replace文檔非常豐富，筆者在使用該功能時候，深感其參數眾多，深感提供的功能也非常的強大
# 本文中使用replace的功能，創建map的字典，針對需要數據清理的列進行清理更加方便，例如：
cleanup_nums= {
    "num_doors":{"four":4,"two":2},
    "num_cylinders":{
        "four":4,"six":6,"five":5,"eight":8,"two":2,"twelve":12,"three":3
    }
}
obj_df.replace(cleanup_nums,inplace=True)
obj_df.head()

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system
0	alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi
1	alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi
2	alfa-romero	gas	std	2	hatchback	rwd	front	ohcv	6	mpfi
3	audi	gas	std	4	sedan	fwd	front	ohc	4	mpfi
4	audi	gas	std	4	sedan	4wd	front	ohc	5	mpfi

label encoding 是將一組無規則的，沒有大小比較的數據轉化為數字

比如body_style 字段中含有多個數據值，可以使用該方法將其轉化
convertible > 0
hardtop > 1
hatchback > 2
sedan > 3
wagon > 4

這種方式就像是密碼編碼一樣，這，個比喻很有意思，就像之前看電影，記得一句台詞，他們倆親密的像做賊一樣

# 通過pandas里面的 category數據類型，可以很方便的或者該編碼
obj_df["body_style"]=obj_df["body_style"].astype("category")
obj_df.dtypes

make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
dtype: object

# 我們可以通過賦值新的列，保存其對應的code
# 通過這種方法可以舒服的數據，便於以后的數據分析以及整理
obj_df["body_style_code"] = obj_df["body_style"].cat.codes
obj_df.head()

	make	fuel_type	aspiration	num_doors	body_style	drive_wheels	engine_location	engine_type	num_cylinders	fuel_system	body_style_code
0	alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi	0
1	alfa-romero	gas	std	2	convertible	rwd	front	dohc	4	mpfi	0
2	alfa-romero	gas	std	2	hatchback	rwd	front	ohcv	6	mpfi	2
3	audi	gas	std	4	sedan	fwd	front	ohc	4	mpfi	3
4	audi	gas	std	4	sedan	4wd	front	ohc	5	mpfi	3

one hot encoding

label encoding 因為將wagon轉化為4，而convertible變成了0，這里面是不是會有大大小的比較，可能會造成誤解，然后利用one hot encoding這種方式
是將特征轉化為0或者1，這樣會增加數據的列的數量，同時也減少了label encoding造成的衡量數據大小的誤解。
pandas中提供了get_dummies 方法可以將需要轉化的列的值轉化為0,1,兩種編碼

# 新生成DataFrame包含了新生成的三列數據,
# drive_wheels_4wd 
# drive_wheels_fwd
# drive_wheels_rwd
pd.get_dummies(obj_df,columns=["drive_wheels"]).head()

	make	fuel_type	aspiration	num_doors	body_style	engine_location	engine_type	num_cylinders	fuel_system	body_style_code	drive_wheels_4wd	drive_wheels_fwd	drive_wheels_rwd
0	alfa-romero	gas	std	2	convertible	front	dohc	4	mpfi	0	0	0	1
1	alfa-romero	gas	std	2	convertible	front	dohc	4	mpfi	0	0	0	1
2	alfa-romero	gas	std	2	hatchback	front	ohcv	6	mpfi	2	0	0	1
3	audi	gas	std	4	sedan	front	ohc	4	mpfi	3	0	1	0
4	audi	gas	std	4	sedan	front	ohc	5	mpfi	3	1	0	0

# 該方法之所以強大，是因為可以同時處理多個category的列，同時選擇prefix前綴分別對應好
# 產生的新的DataFrame所有數據都包含
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()

	make	fuel_type	aspiration	num_doors	engine_location	engine_type	num_cylinders	fuel_system	body_style_code	body_convertible	body_hatchback	body_sedan	drive_4wd	drive_fwd	drive_rwd
0	alfa-romero	gas	std	2	front	dohc	4	mpfi	0	1	0	0	0	0	1
1	alfa-romero	gas	std	2	front	dohc	4	mpfi	0	1	0	0	0	0	1
2	alfa-romero	gas	std	2	front	ohcv	6	mpfi	2	0	1	0	0	0	1
3	audi	gas	std	4	front	ohc	4	mpfi	3	0	0	1	0	1	0
4	audi	gas	std	4	front	ohc	5	mpfi	3	0	0	1	1	0	0

自定義0,1 encoding

有的時候回根據業務需要，可能會結合label encoding以及not hot 兩種方式進行二值化。

obj_df["engine_type"].value_counts()

ohc      148
ohcf      15
ohcv      13
dohc      12
l         12
rotor      4
dohcv      1
Name: engine_type, dtype: int64

# 有的時候為了區分出 engine_type是否是och技術的，可以使用二值化，將該列進行處理
# 這也突出了領域知識是如何以最有效的方式解決問題
obj_df["engine_type_code"] = np.where(obj_df["engine_type"].str.contains("ohc"),1,0)
obj_df[["make","engine_type","engine_type_code"]].head()

	make	engine_type	engine_type_code
0	alfa-romero	dohc	1
1	alfa-romero	dohc	1
2	alfa-romero	ohcv	1
3	audi	ohc	1
4	audi	ohc	1

scikit-learn中的數據轉化

sklearn.processing模塊提供了很多方便的數據轉化以及缺失值處理方式(Imputer)，可以直接從該模塊導入LabelEncoder，LabelBinarizer，0,1歸一化(最大最小標准化)，Normalizer正則化（L1，L2）一般用的不多，標准化（最大最小標准化max_mix），非線性轉換，生成多項式特征(PolynomialFeatures),將每個特征縮放在同樣的范圍或分布情況下
sklearn processing 模塊官網文檔鏈接
category_encoders包官方文檔

至此，數據預處理以及category轉化大致講完了。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pandas 數據類型研究（三）數據類型object與category Pandas高級教程之:category數據類型 python category 和object 數據類型區別 Pandas的DataFrame數據類型 1.pandas數據類型修改及數據類型格式 pandas數據類型判斷（三）數據判斷在 Pandas 中更改列的數據類型 pandas Series和DataFrame數據類型 pandas-數據類型轉換 pandas（8）：數據類型轉換