大數據實踐(三):葡萄牙銀行數據集的數據預處理


實驗目標

對數據集做數據預處理以便可以進行后續的機器學習。具體包括通過多種方式處理缺失值、將變量轉為數值類型,使用機器學習模型填充缺失值,數據shuffle和持久化。

實驗要求

  1. 完成對數據集缺失值的處理
  2. 完成對數據集非數值變量的轉換
  3. 完成對數據集的標准化
  4. 保存預處理后的數據集

實驗過程

image.png

 

變量介紹

銀行客戶信息:

  • 1 - age: 年齡 (數字)
  • 2 - job: 工作類型 。管理員(admin),藍領(blue-collar),企業家(entrepreneur),家庭主婦(housemaid),管理者('management'),退休('retired'),個體經營('self-employed'),服務業('services'),學生('student'),技術人員('technician'),無業('unemployed'),未知('unknown')
  • 3 - marital : 婚姻狀態,離婚('divorced'),結婚('married'),單身('single'),未知('unknown')。說明:離婚也包括寡居
  • 4 - education: 教育情況 : 基本4年('basic.4y'), 基本6年('basic.6y'),基本九年('basic.9y'),高中('high.school'),文盲('illiterate'),專業課程('professional.course'),大學學位('university.degree'),未知('unknown')
  • 5 - default: 是否有信用違約? ('no','yes','unknown')
  • 6 - housing: 是否有房貸 ( 'no','yes','unknown')
  • 7 - loan: 是否有個人貸款 (categorical: 'no','yes','unknown')

    與聯絡相關信息:

  • 8 - contact: 聯系類型,手機( 'cellular'),電話:'telephone'
  • 9 - month: 年度最后一次聯系的月份 (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  • 10 - day_of_week: 最后一次聯系的星期 (categorical: 'mon','tue','wed','thu','fri')
  • 11 - duration: 上一次聯系的通話時長(秒). 重要提示:此屬性高度影響輸出目標(例如,如果持續時間=0,則y='no')。然而,在執行呼叫之前,持續時間還不知道。而且,在通話結束后,Y顯然是已知的。因此,這個輸入應該只包括在基准測試中,如果想要有一個實際的預測模型,就應該丟棄它。(預測時不知道會通話的時長)

    其他屬性:

  • 12 - campaign: 針對該客戶,為了此次營銷所發起聯系的數量。(數字,包括最后一次聯絡)
  • 13 - pdays: 上次營銷到現在已經過了多少天。(數字,如果是999表示這個客戶還沒有聯系過)
  • 14 - previous: 在本次營銷之前和客戶聯系過幾次(數字)
  • 15 - poutcome: 上一次營銷活動的結果 ( 'failure','nonexistent','success')

    社會和經濟相關屬性

  • 16 - emp.var.rate: 就業變動率 -系度指標(numeric)
  • 17 - cons.price.idx: 消費物價指數-月度指標 (numeric)
  • 18 - cons.conf.idx: 消費者信心指數--月度指標(numeric)
  • 19 - euribor3m: 歐元同業拆借利率3個月 - 每日指標 (numeric)
  • 20 - nr.employed: 員工數量-季度指標 (numeric)

    輸出變量(目標):

  • 21 - y -客戶存錢了嗎(被成功營銷了嗎)? (binary: 'yes','no')
 

數據預處理

 

1. 數據裝載

  • 數據裝載,使用head()觀察數據
  • 為了方便后續處理,將分類變量和數值變量的列名分別存放在不同列表中
    numberVar=['age',...] categoryVar = [ ...]

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df=pd.read_csv("bank-additional-full.csv",sep=';')
df.shape
(41188, 21)
numberVar=['age','duration','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
categoryVar=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','y']

 

2.缺失值處理

數據集的輸入變量是20個特征量,分為數值變量(numeric)和分類(categorical)變量。從前期數據信息可以看出,數值型變量(int64和float64)沒有缺失。非數值型變量可能存在unknown值。 本小節要求:

  1. 檢查每個變量的缺失值占比情況
  2. 給出存在缺失值的變量中:高、中、低三類缺失情況

2.1 缺失值檢查

  • 數據集的輸入變量是20個特征量,分為數值變量(numeric)和分類(categorical)變量。
  • 使用df.isnull().any()觀察缺失值情況,沒有發現特征含有缺失值(NaN)。
  • 但是在本數據集中,缺失值是以其他的形式存在的。分類變量大部分的特征都是使用unknown來表示缺失值,而poutcome是使用nonexistent來表示;數值變量中只有pdays存在缺失值(以數字999形式存在)。 本步驟要求對所有存在缺失值的分類變量打印其缺失值占比

對所有分類變量(外加一個pdays變量)進行缺失值的比例檢查。對比Demo所不同的是,有三種值(unknown,nonexistent,999)都算作缺失:

 
cols = categoryVar + ['pdays']
total=df.shape[0]
for col in cols:
    v = df[col].value_counts().to_dict()
    if 'unknown' in v.keys():
        unCount = v['unknown']
    elif 'nonexistent' in v.keys():
        unCount = v['nonexistent']
    elif '999' in v.keys():
        unCount = v['999']
    else:
        continue    
    print ("%-10s: %5.1f%%"%(col,unCount/total*100))
job       :   0.8%
marital   :   0.2%
education :   4.2%
default   :  20.9%
housing   :   2.4%
loan      :   2.4%
poutcome  :  86.3%
 

2.2 高缺失比例的變量處理

 
  1. 通過直方圖對pdays變量進行可視化,請給出分析,未缺失的pdays大概都在一個怎樣的數值范圍內?
  2. 通過pdays與poutcome的交叉表,觀察這兩個變量取值的關系,通過數據分析得到進一步結論

將pdays中非缺失值的部分進行直方圖可視化:

dfPdays=df.loc[df.pdays != 999, 'pdays']

使用dfPdays進行直方圖可視化,配合.value_counts()方法,分析大部分的營銷間隔在什么時間范圍內?

# 對pdays繪制直方圖
dfPdays = df.loc[df.pdays!=999,'pdays']
plt.hist(dfPdays,bins=30,rwidth=0.8)
(array([ 15.,  26.,  61., 439., 118.,  46., 412.,  60.,  18.,   0.,  64.,
         52.,  28.,  58.,  36.,  20.,  24.,  11.,   8.,   0.,   7.,   3.,
          1.,   2.,   3.,   0.,   0.,   1.,   1.,   1.]),
 array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ,
         9.9, 10.8, 11.7, 12.6, 13.5, 14.4, 15.3, 16.2, 17.1, 18. , 18.9,
        19.8, 20.7, 21.6, 22.5, 23.4, 24.3, 25.2, 26.1, 27. ]),
 <a list of 30 Patch objects>)
 
 

雖然這兩個變量的缺失較多,但是未缺失的記錄還是有一定的參考意義。根據前文熱力圖分析,發現pdays(-0.31)和poutcom(-0.13)對營銷結果相關性較很多其他變量都要高,雖然此列的缺失值較多,但是不做刪除考慮,保持現有狀態。

要求使用交叉表觀察pdays和poutcome之間的關系。為了方便觀察,需要將pdays對5取整轉為時間段(類似年齡段的做法)

pdaysDf = df['pdays'].apply(lambda x: int(x /5 )*5)
pd.crosstab(pdaysDf,df['poutcome']) #顯示交叉表
poutcome failure nonexistent success
pdays      
0 6 0 653
5 74 0 526
10 36 0 158
15 22 0 31
20 3 0 3
25 1 0 2
995 4110 35563 0
 

2.3 default(信用違約)缺失值分析和處理

default: 缺失值占比20.9%,考慮對缺失值進行分析和修補
要求:

  1. default的取值分布中有何啟示?
  2. 對存在信用違約記錄缺失的用戶群體特征進行描述。(請在變量的用戶信息中取出變量一一與default進行可視化)
  3. 說明最后對default的處理,為何采用unknown與yes記錄合並的做法
 

在對default進行修補之前,先觀察該變量取值情況。(使用value_counts())

df['default'].value_counts()
no         32588
unknown     8597
yes            3
Name: default, dtype: int64
 

定義如下函數,參數1為dataframe,參數2為需要與default進行對比的列

In [7]:
def defaultAsso(dataset, col):
    tab = pd.crosstab(dataset['default'],dataset[col]).apply(lambda x: x/x.sum() * 100)
    tab_pct = tab.transpose()
    x = tab_pct.index.values
    plt.figure(figsize=(14,3))
    plt.plot(x, tab_pct['unknown'],color='green', label='unknown')
    plt.plot(x, tab_pct['yes'],color='blue', label='yes')
    plt.plot(x, tab_pct['no'],color='red', label='no')
    plt.legend() 
    plt.xlabel(col)
    plt.ylabel('rate')
    plt.show()

defaultAsso(df,'job')
defaultAsso(df,'education')
defaultAsso(df,'marital')
 

年齡需要轉為年齡組來處理:

In [11]:
def get_age_group(age):
    if age <30:
        return 2
    elif age>60:
        return 6
    else:
        return age//10
df['ageGroup'] =df['age'].apply(lambda x:get_age_group(x))#打印年齡組的取值是否正確
defaultAsso(df,'ageGroup') #對照defualt與年齡組
df.drop('ageGroup',axis=1)#將新增的年齡組這一列刪除
Out[11]:
  age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
5 45 services married basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
6 59 admin. married professional.course no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
7 41 blue-collar married unknown unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
8 24 technician single professional.course no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
9 25 services single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
10 41 blue-collar married unknown unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
11 25 services single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
12 29 blue-collar single high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
13 57 housemaid divorced basic.4y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
14 35 blue-collar married basic.6y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
15 54 retired married basic.9y unknown yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
16 35 blue-collar married basic.6y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
17 46 blue-collar married basic.6y unknown yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
18 50 blue-collar married basic.9y no yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
19 39 management single basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
20 30 unemployed married high.school no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
21 55 blue-collar married basic.4y unknown yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
22 55 retired single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
23 41 technician single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
24 37 admin. married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
25 35 technician married university.degree no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
26 59 technician married unknown no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
27 39 self-employed married basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
28 54 technician single university.degree unknown no no telephone may mon ... 2 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
29 55 unknown married university.degree unknown unknown unknown telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
41158 35 technician divorced basic.4y no no no cellular nov tue ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.035 4963.6 yes
41159 35 technician divorced basic.4y no yes no cellular nov tue ... 1 9 4 success -1.1 94.767 -50.8 1.035 4963.6 yes
41160 33 admin. married university.degree no no no cellular nov tue ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.035 4963.6 yes
41161 33 admin. married university.degree no yes no cellular nov tue ... 1 999 1 failure -1.1 94.767 -50.8 1.035 4963.6 no
41162 60 blue-collar married basic.4y no yes no cellular nov tue ... 2 4 1 success -1.1 94.767 -50.8 1.035 4963.6 no
41163 35 technician divorced basic.4y no yes no cellular nov tue ... 3 4 2 success -1.1 94.767 -50.8 1.035 4963.6 yes
41164 54 admin. married professional.course no no no cellular nov tue ... 2 10 1 success -1.1 94.767 -50.8 1.035 4963.6 yes
41165 38 housemaid divorced university.degree no no no cellular nov wed ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 yes
41166 32 admin. married university.degree no no no telephone nov wed ... 1 999 1 failure -1.1 94.767 -50.8 1.030 4963.6 yes
41167 32 admin. married university.degree no yes no cellular nov wed ... 3 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41168 38 entrepreneur married university.degree no no no cellular nov wed ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41169 62 services married high.school no yes no cellular nov wed ... 5 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41170 40 management divorced university.degree no yes no cellular nov wed ... 2 999 4 failure -1.1 94.767 -50.8 1.030 4963.6 no
41171 33 student married professional.course no yes no telephone nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 yes
41172 31 admin. single university.degree no yes no cellular nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 yes
41173 62 retired married university.degree no yes no cellular nov thu ... 1 999 2 failure -1.1 94.767 -50.8 1.031 4963.6 yes
41174 62 retired married university.degree no yes no cellular nov thu ... 1 1 6 success -1.1 94.767 -50.8 1.031 4963.6 yes
41175 34 student single unknown no yes no cellular nov thu ... 1 999 2 failure -1.1 94.767 -50.8 1.031 4963.6 no
41176 38 housemaid divorced high.school no yes yes cellular nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 no
41177 57 retired married professional.course no yes no cellular nov thu ... 6 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 no
41178 62 retired married university.degree no no no cellular nov thu ... 2 6 3 success -1.1 94.767 -50.8 1.031 4963.6 yes
41179 64 retired divorced professional.course no yes no cellular nov fri ... 3 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41180 36 admin. married university.degree no no no cellular nov fri ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41181 37 admin. married university.degree no yes no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41182 29 unemployed single basic.4y no yes no cellular nov fri ... 1 9 1 success -1.1 94.767 -50.8 1.028 4963.6 no
41183 73 retired married professional.course no yes no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41184 46 blue-collar married professional.course no no no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41185 56 retired married university.degree no yes no cellular nov fri ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41186 44 technician married professional.course no no no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41187 74 retired married professional.course no yes no cellular nov fri ... 3 999 1 failure -1.1 94.767 -50.8 1.028 4963.6 no

41188 rows × 21 columns

 

根據以上分析,在數據處理中,將default變量的unknown與yes記錄合並(使用map方法,將unknown與yes映射成同一個值),然后使用value_counts()觀察轉換結果。

df['default']=df['default'].map({'unknown':1 ,'yes':1,'no':0})
df['default'].value_counts()
0    32588
1     8600
Name: default, dtype: int64

2.4 處理極少量缺失比例的變量

 

2.4.1 刪除缺失記錄

 
  • job和marital只有少量缺失,缺失值記錄占比不到百分之一,這里要求將job和marital中取值為unknown的記錄刪除
  • 刪除記錄后,調用value_counts()檢查缺失值是否真的已經去除 這里以job刪除為例:
df.drop(df[df.job == 'unknown'].index,inplace = True,axis=0)
df.job.value_counts()
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
Name: job, dtype: int64
df.drop(df[df.marital == 'unknown'].index,inplace = True,axis=0)
df.marital.value_counts()
married     24694
single      11494
divorced     4599
Name: marital, dtype: int64
pd.crosstab(df['job'],df['marital'])
 
marital divorced married single
job      
admin. 1280 5253 3875
blue-collar 728 6687 1825
entrepreneur 179 1071 203
housemaid 161 777 119
management 331 2089 501
retired 348 1274 93
self-employed 133 904 379
services 532 2294 1137
student 9 41 824
technician 774 3670 2287
unemployed 124 634 251
 
df['housing'].value_counts()
yes        21376
no         18427
unknown      984
Name: housing, dtype: int64
df['loan'].value_counts()
no         33620
yes         6183
unknown      984
Name: loan, dtype: int64
 

2.4.2 處理關聯的缺失值

 
  • 從熱力圖上看,除了housing,loan與education的關系最為密切。因此使用交叉表觀察housing和loan的關系。
  • 刪除housing的缺失記錄
  • 針對housing和loan分別調用value_counts()觀察缺失值是否已經去除
pd.crosstab(df['housing'],df['loan'])
df.drop(df[df.housing == 'unknown'].index,inplace = True,axis=0)
df['housing'].value_counts()
 
yes    21376
no     18427
Name: housing, dtype: int64
df['loan'].value_counts()
no     33620
yes     6183
Name: loan, dtype: int64
pd.crosstab(df['housing'],df['loan'])
 
loan no yes
housing    
no 15897 2530
yes 17723 3653
pd.crosstab(df['job'],df['loan'])
loan no yes
job    
admin. 8472 1709
blue-collar 7636 1365
entrepreneur 1212 205
housemaid 874 154
management 2411 439
retired 1431 240
self-employed 1182 194
services 3263 599
student 709 142
technician 5596 988
unemployed 834 148
pd.crosstab(df['housing'],df['marital'])
marital divorced married single
housing      
no 2086 11273 5068
yes 2392 12837 6147
 

最后剩下education的缺失值尚未處理,由於缺失值數量有1.5k條記錄,不宜直接刪除,考慮使用隨機森林進行缺失值補充。在將所有參數數值化之后進行統一處理

3. 將分類變量轉為數值

分類變量數值化 為了能使分類變量參與模型計算,我們需要將分類變量數值化,也就是編碼。因此尚未被編碼的分類變量(教育、工作、違約、聯系方式、住房和貸款)都需要進一步被轉換為數值變量。
分類變量又可以分為二項分類變量、有序分類變量和無序分類變量。不同種類的分類變量編碼方式也有區別。

3.1 只有兩種取值的變量

二分類變量編碼: 在本數據集中,變量y, default 、contact、housing 和loan 都是只有兩種取值,即二分類變量,可對其進行0,1編碼。Default在前面的步驟中取值已經被轉為數字0和1。
要求:

  1. 使用map方法,將y 、contact、housing 和loan 的取值映射成數字0和1
  1. 使用df[['y','default','contact','housing','loan']].head(),觀察以上變量已經被正確轉換:
 
df['y'].value_counts()
no     35316
yes     4487
Name: y, dtype: int64
df['y'] = df['y'].map({'no':0, 'yes':1})
df['contact']=df['contact'].map({'cellular':0,'telephone':1})
df['housing'] = df['housing'].map({"no":0, "yes":1})
df['loan'] = df['loan'].map({"no":0, "yes":1})
df.y.value_counts()#檢查目標變量,未發現缺失值
0    35316
1     4487
Name: y, dtype: int64
df[['y','default','contact','housing','loan']].head()
  y default contact housing loan
0 0 0 1 0 0
1 0 1 1 0 0
2 0 0 1 1 0
3 0 0 1 0 0
4 0 0 1 0 1
 

3.2 有序分類變量編碼

觀察education的取值,可以根據學歷高低,認為變量education是有序分類變量,影響大小排序為"illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school", "professional.course", "university.degree", 變量影響由小到大的順序編碼為1、2、3、..., 但是由於缺失值的存在,unknown將無法進行排序。為了處理方便,我們在這里先將unknown設置為0,后續再重新對該值進行修正。

完成轉換之后,調用value_counts()觀察education的轉換結果是否正確。

values = ["unknown","illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school",  "professional.course", "university.degree"]
levels = range(0,len(values))
dict_levels = dict(zip(values, levels))
for v in values:
    df.loc[df['education'] == v, 'education'] = dict_levels[v]
df['education'].value_counts()
7    11821
5     9244
4     5856
6     5100
2     4002
3     2204
0     1558
1       18
Name: education, dtype: int64
 

3.3 將無序分類變量轉為虛擬變量

根據上文的輸入變量描述,可以認為變量job,marital,poutcome,month,day_of_week為無序分類變量。需要說明的是,雖然變量month和day_of_week從時間角度是有序的,但是對於目標變量而言是無序的。對於無序分類變量,可以利用獨熱編碼(one-hot)。
獨熱編碼(one-hot):又稱為一位有效編碼,主要是采用N位狀態寄存器來對N個狀態進行編碼,每個狀態都由他獨立的寄存器位,並且在任意時候只有一位有效。
獨熱編碼的轉換方法:

要求

  1. 將本數據集中的無序分類變量(job,marital,poutcome,month,day_of_week)轉為虛擬變量(one-hot編碼)
  2. 調用df.info()觀察轉換后的變量變化
df = pd.get_dummies(df, columns = ['job','marital','poutcome','month','day_of_week'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39803 entries, 0 to 41187
Data columns (total 49 columns):
age                     39803 non-null int64
education               39803 non-null int64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null int64
campaign                39803 non-null int64
pdays                   39803 non-null int64
previous                39803 non-null int64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(5), int64(12), uint8(32)
memory usage: 6.7 MB
 

4. 通過隨機森林補充缺失值

對於education這個變量的缺失值,這里采用機器學習的方式來實現缺失值的預測。思路是通過其他變量的值,預測缺失值最可能的取值。
步驟:

  1. 將數據集切分為訓練集和測試集。其中無education缺失的記錄歸入訓練集;education缺失的記錄歸入測試集。education作為預測目標(注意,這里與本數據集以營銷成功與否作為目標是不同的)
  2. 使用機器學習在訓練集上學習,並且將學習結果應用在測試集中

參數:

  • trainX 訓練集輸入變量
  • trainY 訓練集目標值
  • testX 測試集輸入變量
from sklearn.ensemble import RandomForestClassifier
def train_predict_unknown(trainX, trainY, testX):
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainX, trainY)
    test_predictY = forest.predict(testX).astype(int)
    return pd.DataFrame(test_predictY,index=testX.index)
# 將education值已知的記錄作為訓練集,education的值未知(等於0)記錄放入測試集
test_data = df[df['education'] == 0]#education等於0的記錄作為測試集
train_data = df[df['education'] != 0] #education不等於0的記錄作為訓練集
# 將education變量作為目標變量,將訓練集分為目標變量和輸入變量兩個dataframe
trainY =train_data['education'] # 將education列放入trainY
trainX = train_data.drop('education', axis=1)  # 將education列從train_data中刪除
testX =test_data.drop('education', axis=1)#將education列從testX中刪除 

使用機器學習算法預測education的缺失值

test_data['education'] = train_predict_unknown(trainX, trainY, testX)

使用value_counts觀察test_data的education變量的取值,看看缺失值是否都得到了補充:

test_data['education'].value_counts()
7    446
5    383
2    261
4    256
6    165
3     47
Name: education, dtype: int64
 

將測試集與訓練集合並成一張表格:

df = pd.concat([train_data, test_data])
df.shape
(39803, 49)
 

觀察合並后education變量的取值是否在1~7之間(缺失值0不存在),同時通過df.head()觀察整個數據表的狀況

train_data['education'].value_counts()
df.head()
  age education default housing loan contact duration campaign pdays previous ... month_mar month_may month_nov month_oct month_sep day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed
0 56 2 0 0 0 1 261 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
1 57 5 1 0 0 1 149 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
2 37 5 0 1 0 1 226 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
3 40 3 0 0 0 1 151 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
4 56 5 0 0 1 1 307 1 999 0 ... 0 1 0 0 0 0 1 0 0 0

5 rows × 49 columns

 

5.對數值變量進行標准化

並不是所有算法都需要對數值變量進行標准化的。一些算法對於變量是否標准化比較敏感,例如邏輯回歸,支持向量機,神經網絡等;而隨機森林和決策樹不需要變量的標准化。為了方便后續的機器學習算法選擇,這里統一進行標准化。
在本例中,需要對所有的數值變量進行標准化,由於education作為有序數列,也需要進行標准化。

from sklearn.preprocessing import StandardScaler
def scaleColumns(data, cols_to_scale):
    scaler = StandardScaler()
    idx = data.index.values
    for col in cols_to_scale:
        x = scaler.fit_transform(pd.DataFrame(data[col]))
        data[col] = pd.DataFrame(x,columns=['col'],index=idx)
    return data
df = scaleColumns(df,numberVar+['education'])
df.head()
  age education default housing loan contact duration campaign pdays previous ... month_mar month_may month_nov month_oct month_sep day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed
0 1.539987 -1.925742 0 0 0 1 0.009489 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
1 1.636117 -0.096859 1 0 0 1 -0.422339 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
2 -0.286490 -0.096859 0 1 0 1 -0.125457 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
3 0.001901 -1.316115 0 0 0 1 -0.414628 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
4 1.539987 -0.096859 0 0 1 1 0.186846 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0

5 rows × 49 columns

 

6. 特征選擇

 

一些情況下原始數據維度非常高,維度越高,數據在每個特征維度上的分布就越稀疏,這對機器學習算法基本都是災難性(維度災難)。當我們又沒有辦法挑選出有效的特征時,需要使用PCA等算法來降低數據維度,使得數據可以用於統計學習的算法。但是,如果能夠挑選出少而精的特征了,那么PCA等降維算法沒有很大必要。在本次實驗中,數據集中的特征已經比較有代表性而且並不過多,所以應該不需要降維。
根據前文分析可知,duration(最后一次和用戶的通話時間)只有在通話結束時才會知道該變量的值。營銷的目的就是減少工作人員的工作量,如果已經完成了通話才對是否需要聯系此用戶進行預測是沒有價值的。因此該變量不應該作為預測模型的一個輸入變量。

  1. 刪除duration這一列
  2. 使用shape、info方法觀察數據集最終的變量數、記錄
df.drop(['duration'],axis=1)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39803 entries, 0 to 41175
Data columns (total 49 columns):
age                     39803 non-null float64
education               39803 non-null float64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null float64
campaign                39803 non-null float64
pdays                   39803 non-null float64
previous                39803 non-null float64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(11), int64(6), uint8(32)
memory usage: 6.7 MB

6. 保存預處理數據

將預處理后的數據保存,后續進行機器學習時,就可以直接使用預處理后的數據,而不需要重新做預處理了。
要求:

  1. 由於原始數據集中,樣本是按照時間順序排列的,因此這里需要將其打亂,變成無序數據集,以免在訓練過程中出現過擬合。
  2. 對數據集進行持久化(保存為.csv文件),index=False表示不保存索引
from sklearn.utils import shuffle
df = shuffle(df)
df.to_csv('bank-preprocess.csv',index=False)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM