大數據實踐（一）：對葡萄牙銀行數據集做簡單的觀察與探索

本文轉載自查看原文 2020-04-14 17:30 2075

實驗

實驗目標：對葡萄牙銀行數據集做簡單的觀察與探索

完成時間：1小時（實驗），0.5小時（實驗報告）

實驗要求：

查看數據的基本情況。
觀察所有分類變量的取值情況，並且進行數據可視化
觀察所有數值變量的數值分布情況，並且進行數據可視化

項目背景

本項目背景是基於葡萄牙銀行電話營銷數據的基礎上，通過數據分析和機器學習，優化營銷策略，提高營銷效率。通過數據分析了解客戶的需求，通過機器學習，根據客戶和當時的社會經濟情況，預測用戶是否可能購買儲蓄產品。

這個數據集是關於一家葡萄牙銀行機構從2008年5月到2010年11月的直接電話營銷活動，旨在促進現有客戶的定期存款。它在UCI機器學習庫中公開，鏈接： http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#. 這是真實的商業數據，總共4萬多條記錄，每一條有21個屬性。主要的任務目標是分類預測，用戶是否會接受營銷。（如果會y列則是yes）
此數據集中有新舊兩個數據集，本例采用了bank-additional-full.csv

變量介紹

銀行客戶信息:

1 - age：年齡 (數字)
2 - job：工作類型。管理員（admin）,藍領（blue-collar）,企業家（entrepreneur）,家庭主婦（housemaid）,管理者（'management'）,退休（'retired'）,個體經營（'self-employed'）,服務業（'services'）,學生（'student'）,技術人員（'technician'）,無業（'unemployed'）,未知（'unknown')
3 - marital : 婚姻狀態，離婚（'divorced'）,結婚（'married'）,單身（'single'）,未知（'unknown'）。說明：離婚也包括寡居
4 - education：教育情況：基本4年('basic.4y'), 基本6年（'basic.6y'）,基本九年（'basic.9y'）,高中（'high.school'）,文盲（'illiterate'）,專業課程（'professional.course'）,大學學位（'university.degree'）,未知（'unknown')
5 - default: 是否有信用違約? ('no','yes','unknown')
6 - housing: 是否有房貸 ( 'no','yes','unknown')
7 - loan: 是否有個人貸款 (categorical: 'no','yes','unknown')
與聯絡相關信息:
8 - contact: 聯系類型，手機（ 'cellular'）,電話：'telephone'
9 - month: 年度最后一次聯系的月份 (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: 最后一次聯系的星期 (categorical: 'mon','tue','wed','thu','fri')
11 - duration: 上一次聯系的通話時長（秒）. 重要提示：此屬性高度影響輸出目標（例如，如果持續時間=0，則y='no'）。然而，在執行呼叫之前，持續時間還不知道。而且，在通話結束后，Y顯然是已知的。因此，這個輸入應該只包括在基准測試中，如果想要有一個實際的預測模型，就應該丟棄它。（預測時不知道會通話的時長）
其他屬性:
12 - campaign: 針對該客戶，為了此次營銷所發起聯系的數量。（數字，包括最后一次聯絡）
13 - pdays: 上次營銷到現在已經過了多少天。(數字，如果是999表示這個客戶還沒有聯系過)
14 - previous: 在本次營銷之前和客戶聯系過幾次（數字）
15 - poutcome: 上一次營銷活動的結果 ( 'failure','nonexistent','success')
社會和經濟相關屬性
16 - emp.var.rate: 就業變動率 -系度指標(numeric)
17 - cons.price.idx: 消費物價指數-月度指標 (numeric)
18 - cons.conf.idx: 消費者信心指數--月度指標(numeric)
19 - euribor3m: 歐元同業拆借利率3個月 - 每日指標 (numeric)
20 - nr.employed: 員工數量-季度指標 (numeric)
輸出變量（目標）:
21 - y -客戶存錢了嗎（被成功營銷了嗎）? (binary: 'yes','no')

主要目標: 提高銀行電話營銷活動的有效性

該項目將使銀行更加細致地了解客戶群體，預測客戶對於電話營銷活動的反應，並為未來的營銷計划建立目標客戶檔案。
通過分析客戶特征以及當時的社會經濟情況等信息，銀行將能夠預測客戶儲蓄行為，並確定哪些類型的客戶更有可能進行定期存款。然后，銀行可以將其營銷工作重點放在這些客戶身上。這不僅可以使銀行更有效地獲得存款，還可以通過減少某些客戶的不良廣告來提高客戶滿意度。

1、導入庫文件

1 import numpy as np
2 import pandas as pd
3 import warnings
4 import seaborn as sns
5 import matplotlib.pyplot as plt
6 %matplotlib inline

2. 數據整體狀態觀察

2.1 查看數據基本情況

裝載數據后，首先需要對數據進行一個基本觀察，例如樣本的數量（行數），變量的數量(列數），哪些變量是數值變量，哪些是類別變量（通常表現為char類型）

使用shape()方法，觀察數據的行列數量。觀察數據是否與前文介紹的樣本量、變量數相符。
使用info()方法觀察所有的變量名稱、變量的類型，找出哪些屬於數值類型。
使用head()方法查看數據的大體情況

 
                  df=pd.read_csv("bank-additional-full.csv",sep=';')
df.shape
 
                  df.head()

(41188, 21)

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no

5 rows × 21 columns

 
                  df=pd.read_csv("bank-additional-full.csv",sep=';')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB

2.2 記錄分類和數值變量

根據info輸出的結果，區分並且記錄下分類和數值變量

到此為止，需要分出哪些屬於數值類型變量，哪些屬於類別變量，請將這些變量分別記錄到兩個列表中，作為后續實驗的基礎。結合上述信息，我們可以了解到,數值變量10個,分類變量11個。
由於數據集特征較多，分類變量和數值在處理上具有相似性，因此將不同的變量放入不同的list中，便於后續對於同類變量采用循環的方式統一處理。

numberVar=['age','duration','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
categoryVar=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','y']

3. 觀察分類變量的取值情況

針對前文所有的分類變量逐一進行分析。

3.1 查看分類變量的取值

使用value_counts()方法，查看任意變量的取值和數量。觀察這些取值與前面的數據介紹是否一致。
unique()方法，快速查看所有分類變量的取值情況。要求使用for循環實現.輸出參考如下：

 
                  var=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','y']
for i in var:
    print(i, " :  ", df[i].unique()) 
                 

job  :   ['housemaid' 'services' 'admin.' 'blue-collar' 'technician' 'retired'
 'management' 'unemployed' 'self-employed' 'unknown' 'entrepreneur'
 'student']
marital  :   ['married' 'single' 'divorced' 'unknown']
education  :   ['basic.4y' 'high.school' 'basic.6y' 'basic.9y' 'professional.course'
 'unknown' 'university.degree' 'illiterate']
default  :   ['no' 'unknown' 'yes']
housing  :   ['no' 'yes' 'unknown']
loan  :   ['no' 'yes' 'unknown']
contact  :   ['telephone' 'cellular']
month  :   ['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'mar' 'apr' 'sep']
day_of_week  :   ['mon' 'tue' 'wed' 'thu' 'fri']
poutcome  :   ['nonexistent' 'failure' 'success']
y  :   ['no' 'yes']

3.2 使用條形圖查看分類變量取值分布情況

繪制所有分類變量取值分布的條形圖
使用循環的方式，針對所有分類變量繪制條形圖

 
                  for j in categoryVar:
    plt.figure(figsize=(14,4))
    sns.barplot(df[j].value_counts().index,df[j].value_counts().values)
    plt.title(j) 
                 

4. 觀察數值型變量的取值情況

4.1 快速瀏覽數值型變量分布情況

通過describe()方法可以快速瀏覽數據集中指定數值變量的分布情況。

mean 均值
std 標准差
min，max：最小值和最大值
25%：第一四分位數(Q1)，即該樣本中所有數值由小到大排列后，第25%位置上的數字
50%: 中位數，該樣本中所有數值由小到大排列后第50%位置上的數字
75%：第三四分位數(Q3)，即該樣本中所有數值由小到大排列后，第75%的數字。

df[['duration']].describe()

	duration
count	41188.000000
mean	258.285010
std	259.279249
min	0.000000
25%	102.000000
50%	180.000000
75%	319.000000
max	4918.000000

df['age'].isnull().sum()

4.2 數值型變量分布可視化

通過進一步的可視化，觀察數值型變量的分布情況，數據分布情況的分析。

箱形圖

箱形圖很直觀地展示了數據的分散情況，箱型圖中矩形較扁，說明數據分布比較集中，矩形較長，說明數據分布比較離散。從箱形圖的上下邊緣之外觀察是否有離散數據，可以了解該變量是否存在較多離群點

以下以年齡分布為例，結合上文describe輸出結果進行分析：

矩形盒子中間這條線是中位線，結合圖形和上文分析可知，平均年齡為40歲
第一四分位數（下四分位數）和第三四分位數（上四分位數）分別對應着箱子的頂部和底部，結合上文可知其分別是32和47。這是大部分群體的年齡范圍。可以看出從上下四分位數差距不大，年齡分布基本比較集中。
從圖中看，在上邊緣之外存在較多離群點（年齡偏大的顧客）

直方圖

直方圖顯示了不同的數據取值的分布情況。

從圖上看，客戶的分布從60歲以后出現一個大幅度下滑。這與60歲是退休年齡有關。
客戶大幅度集中在30~40這個年齡段

for i in numberVar:
    fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 4))
    sns.boxplot(x = i, data = df, orient = 'v', ax = ax1)
    ax1.set_ylabel(i,fontsize=15)
    ax1.set_title(i, fontsize=15)
    ax2.hist(df[i],30)
    ax2.set_title(i, fontsize=15)
    plt.tight_layout() 
    plt.show()