實驗目標

對數據集做數據預處理以便可以進行后續的機器學習。具體包括通過多種方式處理缺失值、將變量轉為數值類型，使用機器學習模型填充缺失值，數據shuffle和持久化。

實驗要求

完成對數據集缺失值的處理
完成對數據集非數值變量的轉換
完成對數據集的標准化
保存預處理后的數據集

實驗過程

變量介紹

銀行客戶信息:

1 - age：年齡 (數字)
2 - job：工作類型。管理員（admin）,藍領（blue-collar）,企業家（entrepreneur）,家庭主婦（housemaid）,管理者（'management'）,退休（'retired'）,個體經營（'self-employed'）,服務業（'services'）,學生（'student'）,技術人員（'technician'）,無業（'unemployed'）,未知（'unknown')
3 - marital : 婚姻狀態，離婚（'divorced'）,結婚（'married'）,單身（'single'）,未知（'unknown'）。說明：離婚也包括寡居
4 - education：教育情況：基本4年('basic.4y'), 基本6年（'basic.6y'）,基本九年（'basic.9y'）,高中（'high.school'）,文盲（'illiterate'）,專業課程（'professional.course'）,大學學位（'university.degree'）,未知（'unknown')
5 - default: 是否有信用違約? ('no','yes','unknown')
6 - housing: 是否有房貸 ( 'no','yes','unknown')
7 - loan: 是否有個人貸款 (categorical: 'no','yes','unknown')
與聯絡相關信息:
8 - contact: 聯系類型，手機（ 'cellular'）,電話：'telephone'
9 - month: 年度最后一次聯系的月份 (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: 最后一次聯系的星期 (categorical: 'mon','tue','wed','thu','fri')
11 - duration: 上一次聯系的通話時長（秒）. 重要提示：此屬性高度影響輸出目標（例如，如果持續時間=0，則y='no'）。然而，在執行呼叫之前，持續時間還不知道。而且，在通話結束后，Y顯然是已知的。因此，這個輸入應該只包括在基准測試中，如果想要有一個實際的預測模型，就應該丟棄它。（預測時不知道會通話的時長）
其他屬性:
12 - campaign: 針對該客戶，為了此次營銷所發起聯系的數量。（數字，包括最后一次聯絡）
13 - pdays: 上次營銷到現在已經過了多少天。(數字，如果是999表示這個客戶還沒有聯系過)
14 - previous: 在本次營銷之前和客戶聯系過幾次（數字）
15 - poutcome: 上一次營銷活動的結果 ( 'failure','nonexistent','success')
社會和經濟相關屬性
16 - emp.var.rate: 就業變動率 -系度指標(numeric)
17 - cons.price.idx: 消費物價指數-月度指標 (numeric)
18 - cons.conf.idx: 消費者信心指數--月度指標(numeric)
19 - euribor3m: 歐元同業拆借利率3個月 - 每日指標 (numeric)
20 - nr.employed: 員工數量-季度指標 (numeric)
輸出變量（目標）:
21 - y -客戶存錢了嗎（被成功營銷了嗎）? (binary: 'yes','no')

數據預處理

1. 數據裝載

數據裝載，使用head()觀察數據
為了方便后續處理，將分類變量和數值變量的列名分別存放在不同列表中
```
numberVar=['age',...] categoryVar = [ ...]
```


import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df=pd.read_csv("bank-additional-full.csv",sep=';')
df.shape

(41188, 21)

numberVar=['age','duration','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
categoryVar=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','y']

2.缺失值處理

數據集的輸入變量是20個特征量，分為數值變量（numeric）和分類（categorical）變量。從前期數據信息可以看出，數值型變量（int64和float64）沒有缺失。非數值型變量可能存在unknown值。本小節要求：

檢查每個變量的缺失值占比情況
給出存在缺失值的變量中：高、中、低三類缺失情況

2.1 缺失值檢查

數據集的輸入變量是20個特征量，分為數值變量（numeric）和分類（categorical）變量。
使用df.isnull().any()觀察缺失值情況，沒有發現特征含有缺失值(NaN)。
但是在本數據集中，缺失值是以其他的形式存在的。分類變量大部分的特征都是使用unknown來表示缺失值，而poutcome是使用nonexistent來表示；數值變量中只有pdays存在缺失值（以數字999形式存在）。 本步驟要求對所有存在缺失值的分類變量打印其缺失值占比

對所有分類變量（外加一個pdays變量）進行缺失值的比例檢查。對比Demo所不同的是，有三種值(unknown,nonexistent,999)都算作缺失：

 
                  cols = categoryVar + ['pdays']
total=df.shape[0]
for col in cols:
    v = df[col].value_counts().to_dict()
    if 'unknown' in v.keys():
        unCount = v['unknown']
    elif 'nonexistent' in v.keys():
        unCount = v['nonexistent']
    elif '999' in v.keys():
        unCount = v['999']
    else:
        continue    
    print ("%-10s: %5.1f%%"%(col,unCount/total*100)) 
                 

job       :   0.8%
marital   :   0.2%
education :   4.2%
default   :  20.9%
housing   :   2.4%
loan      :   2.4%
poutcome  :  86.3%

2.2 高缺失比例的變量處理

通過直方圖對pdays變量進行可視化，請給出分析，未缺失的pdays大概都在一個怎樣的數值范圍內？
通過pdays與poutcome的交叉表，觀察這兩個變量取值的關系，通過數據分析得到進一步結論

將pdays中非缺失值的部分進行直方圖可視化：

dfPdays=df.loc[df.pdays != 999, 'pdays']

使用dfPdays進行直方圖可視化，配合.value_counts()方法，分析大部分的營銷間隔在什么時間范圍內？

# 對pdays繪制直方圖
dfPdays = df.loc[df.pdays!=999,'pdays']
plt.hist(dfPdays,bins=30,rwidth=0.8)

(array([ 15.,  26.,  61., 439., 118.,  46., 412.,  60.,  18.,   0.,  64.,
         52.,  28.,  58.,  36.,  20.,  24.,  11.,   8.,   0.,   7.,   3.,
          1.,   2.,   3.,   0.,   0.,   1.,   1.,   1.]),
 array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ,
         9.9, 10.8, 11.7, 12.6, 13.5, 14.4, 15.3, 16.2, 17.1, 18. , 18.9,
        19.8, 20.7, 21.6, 22.5, 23.4, 24.3, 25.2, 26.1, 27. ]),
 <a list of 30 Patch objects>)

雖然這兩個變量的缺失較多，但是未缺失的記錄還是有一定的參考意義。根據前文熱力圖分析，發現pdays（-0.31）和poutcom（-0.13）對營銷結果相關性較很多其他變量都要高，雖然此列的缺失值較多，但是不做刪除考慮，保持現有狀態。

要求使用交叉表觀察pdays和poutcome之間的關系。為了方便觀察，需要將pdays對5取整轉為時間段（類似年齡段的做法）

 
                  pdaysDf = df['pdays'].apply(lambda x: int(x /5 )*5)
pd.crosstab(pdaysDf,df['poutcome']) #顯示交叉表

poutcome	failure	nonexistent	success
pdays
0	6	0	653
5	74	0	526
10	36	0	158
15	22	0	31
20	3	0	3
25	1	0	2
995	4110	35563	0

2.3 default（信用違約）缺失值分析和處理

default: 缺失值占比20.9%，考慮對缺失值進行分析和修補
要求：

default的取值分布中有何啟示？
對存在信用違約記錄缺失的用戶群體特征進行描述。（請在變量的用戶信息中取出變量一一與default進行可視化）
說明最后對default的處理，為何采用unknown與yes記錄合並的做法

在對default進行修補之前，先觀察該變量取值情況。（使用value_counts()）

df['default'].value_counts()

no         32588
unknown     8597
yes            3
Name: default, dtype: int64

定義如下函數，參數1為dataframe，參數2為需要與default進行對比的列

In [7]:

 
               def defaultAsso(dataset, col):
    tab = pd.crosstab(dataset['default'],dataset[col]).apply(lambda x: x/x.sum() * 100)
    tab_pct = tab.transpose()
    x = tab_pct.index.values
    plt.figure(figsize=(14,3))
    plt.plot(x, tab_pct['unknown'],color='green', label='unknown')
    plt.plot(x, tab_pct['yes'],color='blue', label='yes')
    plt.plot(x, tab_pct['no'],color='red', label='no')
    plt.legend() 
    plt.xlabel(col)
    plt.ylabel('rate')
    plt.show()

 
              

defaultAsso(df,'job')

defaultAsso(df,'education')

defaultAsso(df,'marital')

年齡需要轉為年齡組來處理：

In [11]:

 
                  def get_age_group(age):
    if age <30:
        return 2
    elif age>60:
        return 6
    else:
        return age//10
df['ageGroup'] =df['age'].apply(lambda x:get_age_group(x))#打印年齡組的取值是否正確
defaultAsso(df,'ageGroup') #對照defualt與年齡組
df.drop('ageGroup',axis=1)#將新增的年齡組這一列刪除 
                 

Out[11]:

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	previous	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
5	45	services	married	basic.9y	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
6	59	admin.	married	professional.course	no	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
7	41	blue-collar	married	unknown	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
8	24	technician	single	professional.course	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
9	25	services	single	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
10	41	blue-collar	married	unknown	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
11	25	services	single	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
12	29	blue-collar	single	high.school	no	no	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
13	57	housemaid	divorced	basic.4y	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
14	35	blue-collar	married	basic.6y	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
15	54	retired	married	basic.9y	unknown	yes	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
16	35	blue-collar	married	basic.6y	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
17	46	blue-collar	married	basic.6y	unknown	yes	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
18	50	blue-collar	married	basic.9y	no	yes	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
19	39	management	single	basic.9y	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
20	30	unemployed	married	high.school	no	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
21	55	blue-collar	married	basic.4y	unknown	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
22	55	retired	single	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
23	41	technician	single	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
24	37	admin.	married	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
25	35	technician	married	university.degree	no	no	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
26	59	technician	married	unknown	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
27	39	self-employed	married	basic.9y	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
28	54	technician	single	university.degree	unknown	no	no	telephone	may	mon	...	2	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
29	55	unknown	married	university.degree	unknown	unknown	unknown	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
41158	35	technician	divorced	basic.4y	no	no	no	cellular	nov	tue	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.035	4963.6	yes
41159	35	technician	divorced	basic.4y	no	yes	no	cellular	nov	tue	...	1	9	4	success	-1.1	94.767	-50.8	1.035	4963.6	yes
41160	33	admin.	married	university.degree	no	no	no	cellular	nov	tue	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.035	4963.6	yes
41161	33	admin.	married	university.degree	no	yes	no	cellular	nov	tue	...	1	999	1	failure	-1.1	94.767	-50.8	1.035	4963.6	no
41162	60	blue-collar	married	basic.4y	no	yes	no	cellular	nov	tue	...	2	4	1	success	-1.1	94.767	-50.8	1.035	4963.6	no
41163	35	technician	divorced	basic.4y	no	yes	no	cellular	nov	tue	...	3	4	2	success	-1.1	94.767	-50.8	1.035	4963.6	yes
41164	54	admin.	married	professional.course	no	no	no	cellular	nov	tue	...	2	10	1	success	-1.1	94.767	-50.8	1.035	4963.6	yes
41165	38	housemaid	divorced	university.degree	no	no	no	cellular	nov	wed	...	2	999	0	nonexistent	-1.1	94.767	-50.8	1.030	4963.6	yes
41166	32	admin.	married	university.degree	no	no	no	telephone	nov	wed	...	1	999	1	failure	-1.1	94.767	-50.8	1.030	4963.6	yes
41167	32	admin.	married	university.degree	no	yes	no	cellular	nov	wed	...	3	999	0	nonexistent	-1.1	94.767	-50.8	1.030	4963.6	no
41168	38	entrepreneur	married	university.degree	no	no	no	cellular	nov	wed	...	2	999	0	nonexistent	-1.1	94.767	-50.8	1.030	4963.6	no
41169	62	services	married	high.school	no	yes	no	cellular	nov	wed	...	5	999	0	nonexistent	-1.1	94.767	-50.8	1.030	4963.6	no
41170	40	management	divorced	university.degree	no	yes	no	cellular	nov	wed	...	2	999	4	failure	-1.1	94.767	-50.8	1.030	4963.6	no
41171	33	student	married	professional.course	no	yes	no	telephone	nov	thu	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.031	4963.6	yes
41172	31	admin.	single	university.degree	no	yes	no	cellular	nov	thu	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.031	4963.6	yes
41173	62	retired	married	university.degree	no	yes	no	cellular	nov	thu	...	1	999	2	failure	-1.1	94.767	-50.8	1.031	4963.6	yes
41174	62	retired	married	university.degree	no	yes	no	cellular	nov	thu	...	1	1	6	success	-1.1	94.767	-50.8	1.031	4963.6	yes
41175	34	student	single	unknown	no	yes	no	cellular	nov	thu	...	1	999	2	failure	-1.1	94.767	-50.8	1.031	4963.6	no
41176	38	housemaid	divorced	high.school	no	yes	yes	cellular	nov	thu	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.031	4963.6	no
41177	57	retired	married	professional.course	no	yes	no	cellular	nov	thu	...	6	999	0	nonexistent	-1.1	94.767	-50.8	1.031	4963.6	no
41178	62	retired	married	university.degree	no	no	no	cellular	nov	thu	...	2	6	3	success	-1.1	94.767	-50.8	1.031	4963.6	yes
41179	64	retired	divorced	professional.course	no	yes	no	cellular	nov	fri	...	3	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	no
41180	36	admin.	married	university.degree	no	no	no	cellular	nov	fri	...	2	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	no
41181	37	admin.	married	university.degree	no	yes	no	cellular	nov	fri	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	yes
41182	29	unemployed	single	basic.4y	no	yes	no	cellular	nov	fri	...	1	9	1	success	-1.1	94.767	-50.8	1.028	4963.6	no
41183	73	retired	married	professional.course	no	yes	no	cellular	nov	fri	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	yes
41184	46	blue-collar	married	professional.course	no	no	no	cellular	nov	fri	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	no
41185	56	retired	married	university.degree	no	yes	no	cellular	nov	fri	...	2	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	no
41186	44	technician	married	professional.course	no	no	no	cellular	nov	fri	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	yes
41187	74	retired	married	professional.course	no	yes	no	cellular	nov	fri	...	3	999	1	failure	-1.1	94.767	-50.8	1.028	4963.6	no

41188 rows × 21 columns

根據以上分析，在數據處理中，將default變量的unknown與yes記錄合並（使用map方法,將unknown與yes映射成同一個值）,然后使用value_counts()觀察轉換結果。

 
                  df['default']=df['default'].map({'unknown':1 ,'yes':1,'no':0})
df['default'].value_counts()

0    32588
1     8600
Name: default, dtype: int64

2.4 處理極少量缺失比例的變量

2.4.1 刪除缺失記錄

job和marital只有少量缺失，缺失值記錄占比不到百分之一，這里要求將job和marital中取值為unknown的記錄刪除
刪除記錄后，調用value_counts()檢查缺失值是否真的已經去除這里以job刪除為例:

 
                  df.drop(df[df.job == 'unknown'].index,inplace = True,axis=0)
df.job.value_counts()

admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
Name: job, dtype: int64

 
                  df.drop(df[df.marital == 'unknown'].index,inplace = True,axis=0)
df.marital.value_counts()

married     24694
single      11494
divorced     4599
Name: marital, dtype: int64

pd.crosstab(df['job'],df['marital'])

marital	divorced	married	single
job
admin.	1280	5253	3875
blue-collar	728	6687	1825
entrepreneur	179	1071	203
housemaid	161	777	119
management	331	2089	501
retired	348	1274	93
self-employed	133	904	379
services	532	2294	1137
student	9	41	824
technician	774	3670	2287
unemployed	124	634	251

df['housing'].value_counts()

yes        21376
no         18427
unknown      984
Name: housing, dtype: int64

df['loan'].value_counts()

no         33620
yes         6183
unknown      984
Name: loan, dtype: int64

2.4.2 處理關聯的缺失值

從熱力圖上看，除了housing，loan與education的關系最為密切。因此使用交叉表觀察housing和loan的關系。
刪除housing的缺失記錄
針對housing和loan分別調用value_counts()觀察缺失值是否已經去除

 
                  pd.crosstab(df['housing'],df['loan'])
df.drop(df[df.housing == 'unknown'].index,inplace = True,axis=0)
df['housing'].value_counts()

yes    21376
no     18427
Name: housing, dtype: int64

df['loan'].value_counts()

no     33620
yes     6183
Name: loan, dtype: int64

pd.crosstab(df['housing'],df['loan'])

loan	no	yes
housing
no	15897	2530
yes	17723	3653

pd.crosstab(df['job'],df['loan'])

loan	no	yes
job
admin.	8472	1709
blue-collar	7636	1365
entrepreneur	1212	205
housemaid	874	154
management	2411	439
retired	1431	240
self-employed	1182	194
services	3263	599
student	709	142
technician	5596	988
unemployed	834	148

pd.crosstab(df['housing'],df['marital'])

marital	divorced	married	single
housing
no	2086	11273	5068
yes	2392	12837	6147

最后剩下education的缺失值尚未處理，由於缺失值數量有1.5k條記錄，不宜直接刪除，考慮使用隨機森林進行缺失值補充。在將所有參數數值化之后進行統一處理

3. 將分類變量轉為數值

分類變量數值化 為了能使分類變量參與模型計算，我們需要將分類變量數值化，也就是編碼。因此尚未被編碼的分類變量（教育、工作、違約、聯系方式、住房和貸款）都需要進一步被轉換為數值變量。
分類變量又可以分為二項分類變量、有序分類變量和無序分類變量。不同種類的分類變量編碼方式也有區別。

3.1 只有兩種取值的變量

二分類變量編碼: 在本數據集中，變量y, default 、contact、housing 和loan 都是只有兩種取值，即二分類變量，可對其進行0，1編碼。Default在前面的步驟中取值已經被轉為數字0和1。
要求：

使用map方法，將y 、contact、housing 和loan 的取值映射成數字0和1

使用df[['y','default','contact','housing','loan']].head()，觀察以上變量已經被正確轉換：

df['y'].value_counts()

no     35316
yes     4487
Name: y, dtype: int64

 
                  df['y'] = df['y'].map({'no':0, 'yes':1})
df['contact']=df['contact'].map({'cellular':0,'telephone':1})
df['housing'] = df['housing'].map({"no":0, "yes":1})
df['loan'] = df['loan'].map({"no":0, "yes":1})
df.y.value_counts()#檢查目標變量，未發現缺失值 
                 

0    35316
1     4487
Name: y, dtype: int64

df[['y','default','contact','housing','loan']].head()

	default	contact	housing	loan
0	0	1	0	0
1	1	1	0	0
2	0	1	1	0
3	0	1	0	0
4	0	1	0	1

3.2 有序分類變量編碼

觀察education的取值，可以根據學歷高低，認為變量education是有序分類變量，影響大小排序為"illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school", "professional.course", "university.degree", 變量影響由小到大的順序編碼為1、2、3、...，但是由於缺失值的存在，unknown將無法進行排序。為了處理方便，我們在這里先將unknown設置為0，后續再重新對該值進行修正。

完成轉換之后，調用value_counts()觀察education的轉換結果是否正確。

 
                  values = ["unknown","illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school",  "professional.course", "university.degree"]
levels = range(0,len(values))
dict_levels = dict(zip(values, levels))
for v in values:
    df.loc[df['education'] == v, 'education'] = dict_levels[v]
df['education'].value_counts() 
                 

7    11821
5     9244
4     5856
6     5100
2     4002
3     2204
0     1558
1       18
Name: education, dtype: int64

3.3 將無序分類變量轉為虛擬變量

根據上文的輸入變量描述，可以認為變量job，marital，poutcome，month，day_of_week為無序分類變量。需要說明的是，雖然變量month和day_of_week從時間角度是有序的，但是對於目標變量而言是無序的。對於無序分類變量，可以利用獨熱編碼（one-hot）。
獨熱編碼（one-hot）：又稱為一位有效編碼，主要是采用N位狀態寄存器來對N個狀態進行編碼，每個狀態都由他獨立的寄存器位，並且在任意時候只有一位有效。
獨熱編碼的轉換方法:

要求：

將本數據集中的無序分類變量（job，marital，poutcome，month，day_of_week）轉為虛擬變量（one-hot編碼）
調用df.info()觀察轉換后的變量變化

 
                  df = pd.get_dummies(df, columns = ['job','marital','poutcome','month','day_of_week'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39803 entries, 0 to 41187
Data columns (total 49 columns):
age                     39803 non-null int64
education               39803 non-null int64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null int64
campaign                39803 non-null int64
pdays                   39803 non-null int64
previous                39803 non-null int64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(5), int64(12), uint8(32)
memory usage: 6.7 MB

4. 通過隨機森林補充缺失值

對於education這個變量的缺失值，這里采用機器學習的方式來實現缺失值的預測。思路是通過其他變量的值，預測缺失值最可能的取值。
步驟：

將數據集切分為訓練集和測試集。其中無education缺失的記錄歸入訓練集；education缺失的記錄歸入測試集。education作為預測目標（注意，這里與本數據集以營銷成功與否作為目標是不同的）
使用機器學習在訓練集上學習，並且將學習結果應用在測試集中

參數：

trainX 訓練集輸入變量
trainY 訓練集目標值
testX 測試集輸入變量

 
                  from sklearn.ensemble import RandomForestClassifier
def train_predict_unknown(trainX, trainY, testX):
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainX, trainY)
    test_predictY = forest.predict(testX).astype(int)
    return pd.DataFrame(test_predictY,index=testX.index) 
                 

 
                  # 將education值已知的記錄作為訓練集，education的值未知（等於0）記錄放入測試集
test_data = df[df['education'] == 0]#education等於0的記錄作為測試集
train_data = df[df['education'] != 0] #education不等於0的記錄作為訓練集
# 將education變量作為目標變量，將訓練集分為目標變量和輸入變量兩個dataframe
trainY =train_data['education'] # 將education列放入trainY
trainX = train_data.drop('education', axis=1)  # 將education列從train_data中刪除
testX =test_data.drop('education', axis=1)#將education列從testX中刪除  
                 

使用機器學習算法預測education的缺失值

test_data['education'] = train_predict_unknown(trainX, trainY, testX)

使用value_counts觀察test_data的education變量的取值，看看缺失值是否都得到了補充：

test_data['education'].value_counts()

7    446
5    383
2    261
4    256
6    165
3     47
Name: education, dtype: int64

將測試集與訓練集合並成一張表格：

 
                  df = pd.concat([train_data, test_data])
df.shape

(39803, 49)

觀察合並后education變量的取值是否在1~7之間（缺失值0不存在），同時通過df.head()觀察整個數據表的狀況

 
                  train_data['education'].value_counts()
df.head()

	age	education	default	housing	loan	contact	duration	campaign	pdays	...	month_may	day_of_week_mon
0	56	2	0	0	0	1	261	1	999	...	1	1
1	57	5	1	0	0	1	149	1	999	...	1	1
2	37	5	0	1	0	1	226	1	999	...	1	1
3	40	3	0	0	0	1	151	1	999	...	1	1
4	56	5	0	0	1	1	307	1	999	...	1	1

5 rows × 49 columns

5.對數值變量進行標准化

並不是所有算法都需要對數值變量進行標准化的。一些算法對於變量是否標准化比較敏感，例如邏輯回歸，支持向量機，神經網絡等；而隨機森林和決策樹不需要變量的標准化。為了方便后續的機器學習算法選擇，這里統一進行標准化。
在本例中，需要對所有的數值變量進行標准化，由於education作為有序數列，也需要進行標准化。

 
                  from sklearn.preprocessing import StandardScaler
def scaleColumns(data, cols_to_scale):
    scaler = StandardScaler()
    idx = data.index.values
    for col in cols_to_scale:
        x = scaler.fit_transform(pd.DataFrame(data[col]))
        data[col] = pd.DataFrame(x,columns=['col'],index=idx)
    return data 
                 

 
                  df = scaleColumns(df,numberVar+['education'])
df.head()

	age	education	default	housing	loan	contact	duration	campaign	pdays	previous	...	month_may	day_of_week_mon
0	1.539987	-1.925742	0	0	0	1	0.009489	-0.566762	0.194855	-0.349299	...	1	1
1	1.636117	-0.096859	1	0	0	1	-0.422339	-0.566762	0.194855	-0.349299	...	1	1
2	-0.286490	-0.096859	0	1	0	1	-0.125457	-0.566762	0.194855	-0.349299	...	1	1
3	0.001901	-1.316115	0	0	0	1	-0.414628	-0.566762	0.194855	-0.349299	...	1	1
4	1.539987	-0.096859	0	0	1	1	0.186846	-0.566762	0.194855	-0.349299	...	1	1

5 rows × 49 columns

6. 特征選擇

一些情況下原始數據維度非常高，維度越高，數據在每個特征維度上的分布就越稀疏，這對機器學習算法基本都是災難性（維度災難）。當我們又沒有辦法挑選出有效的特征時，需要使用PCA等算法來降低數據維度，使得數據可以用於統計學習的算法。但是，如果能夠挑選出少而精的特征了，那么PCA等降維算法沒有很大必要。在本次實驗中，數據集中的特征已經比較有代表性而且並不過多，所以應該不需要降維。
根據前文分析可知，duration（最后一次和用戶的通話時間）只有在通話結束時才會知道該變量的值。營銷的目的就是減少工作人員的工作量，如果已經完成了通話才對是否需要聯系此用戶進行預測是沒有價值的。因此該變量不應該作為預測模型的一個輸入變量。

刪除duration這一列
使用shape、info方法觀察數據集最終的變量數、記錄

 
                  df.drop(['duration'],axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39803 entries, 0 to 41175
Data columns (total 49 columns):
age                     39803 non-null float64
education               39803 non-null float64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null float64
campaign                39803 non-null float64
pdays                   39803 non-null float64
previous                39803 non-null float64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(11), int64(6), uint8(32)
memory usage: 6.7 MB

6. 保存預處理數據

將預處理后的數據保存，后續進行機器學習時，就可以直接使用預處理后的數據，而不需要重新做預處理了。
要求：

由於原始數據集中，樣本是按照時間順序排列的，因此這里需要將其打亂，變成無序數據集，以免在訓練過程中出現過擬合。
對數據集進行持久化（保存為.csv文件）,index=False表示不保存索引

 
                  from sklearn.utils import shuffle
df = shuffle(df) 
                 
                  df.to_csv('bank-preprocess.csv',index=False)

大數據實踐（三）：葡萄牙銀行數據集的數據預處理