分析背景
某電信公司市場部為了預防用戶流失,收集了已經打好流失標簽的用戶數據。現在要對流失用戶情況進行分析,找出哪些用戶可能會流失?
理解數據
采集數據
本數據集描述了電信用戶是否流失以及其相關信息,共包含7043條數據,共21個字段,分別介紹如下:
- customerID : 用戶ID。
- gender:性別。(Female & Male)
- SeniorCitizen :老年用戶 (1表示是,0表示不是)
- Partner :伴侶用戶 (Yes or No)
- Dependents :親屬用戶 (Yes or No)
- tenure : 在網時長(0-72月)
- PhoneService : 是否開通電話服務業務 (Yes or No)
- MultipleLines: 是否開通了多線業務(Yes 、No or No phoneservice 三種)
- InternetService:是否開通互聯網服務 (No, DSL數字網絡,fiber optic光纖網絡 三種)
- OnlineSecurity:是否開通網絡安全服務(Yes,No,No internetserive 三種)
- OnlineBackup:是否開通在線備份業務(Yes,No,No internetserive 三種)
- DeviceProtection:是否開通了設備保護業務(Yes,No,No internetserive 三種)
- TechSupport:是否開通了技術支持服務(Yes,No,No internetserive 三種)
- StreamingTV:是否開通網絡電視(Yes,No,No internetserive 三種)
- StreamingMovies:是否開通網絡電影(Yes,No,No internetserive 三種)
- Contract:簽訂合同方式 (按月,一年,兩年)
- PaperlessBilling:是否開通電子賬單(Yes or No)
- PaymentMethod:付款方式(bank transfer,credit card,electronic check,mailed check)
- MonthlyCharges:月費用
- TotalCharges:總費用
- Churn:該用戶是否流失(Yes or No)
導入數據
import pandas as pd
df=pd.read_csv(r"D:\PycharmProjects\ku_pandas\WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head(5) #顯示數據前n行,不指定n,df.head則會顯示所有的行
[/code]
| customerID | gender | SeniorCitizen | Partner | Dependents |
tenure | PhoneService | MultipleLines | InternetService |
OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV |
StreamingMovies | Contract | PaperlessBilling | PaymentMethod |
MonthlyCharges | TotalCharges | Churn
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone
service | DSL | No | ... | No | No | No | No | Month-to-month
| Yes | Electronic check | 29.85 | 29.85 | No
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL |
Yes | ... | Yes | No | No | No | One year | No | Mailed check
| 56.95 | 1889.5 | No
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL |
Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed
check | 53.85 | 108.15 | Yes
3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone
service | DSL | Yes | ... | Yes | Yes | No | No | One year |
No | Bank transfer (automatic) | 42.30 | 1840.75 | No
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber
optic | No | ... | No | No | No | No | Month-to-month | Yes |
Electronic check | 70.70 | 151.65 | Yes
5 rows × 21 columns
## 查看數據
```code
df.shape #顯示數據的格式
[/code]
```code
(7043, 21)
df.dtypes #輸出每一列對應的數據類型
[/code]
```code
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object
df.columns #顯示全部的列名
[/code]
```code
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
df.columns.tolist() #使用tolist()函數轉化為list
[/code]
```code
['customerID',
'gender',
'SeniorCitizen',
'Partner',
'Dependents',
'tenure',
'PhoneService',
'MultipleLines',
'InternetService',
'OnlineSecurity',
'OnlineBackup',
'DeviceProtection',
'TechSupport',
'StreamingTV',
'StreamingMovies',
'Contract',
'PaperlessBilling',
'PaymentMethod',
'MonthlyCharges',
'TotalCharges',
'Churn']
type(df.columns.tolist())
[/code]
```code
list
df.columns.values #獲取所有列索引的名稱
[/code]
```code
array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
'TotalCharges', 'Churn'], dtype=object)
df.isnull().sum().values.sum() #查找缺失值
[/code]
```code
0
df.nunique() #查看不同值
[/code]
```code
customerID 7043
gender 2
SeniorCitizen 2
Partner 2
Dependents 2
tenure 73
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
MonthlyCharges 1585
TotalCharges 6531
Churn 2
dtype: int64
df.describe() #查看數值型列的匯總統計
[/code]
| SeniorCitizen | tenure | MonthlyCharges
---|---|---|---
count | 7043.000000 | 7043.000000 | 7043.000000
mean | 0.162147 | 32.371149 | 64.761692
std | 0.368612 | 24.559481 | 30.090047
min | 0.000000 | 0.000000 | 18.250000
25% | 0.000000 | 9.000000 | 35.500000
50% | 0.000000 | 29.000000 | 70.350000
75% | 0.000000 | 55.000000 | 89.850000
max | 1.000000 | 72.000000 | 118.750000
```code
df.info() #查看索引、數據類型和內存信息
[/code]
```code
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
數據清洗、數據一致化
1. 簡化屬性值
- 將InternetService中的DSL數字網絡,fiber optic光纖網絡替換為Yes
- 將MultipleLines中的No phoneservice替換成No
- 將SeniorCitizen中的1改為Yes,0改為No
- 將Churn中的Yes改為非流失客戶,No改為流失客戶
- 將TotalCharges轉換為數字型
# 將InternetService中的DSL數字網絡,fiber optic光纖網絡替換為Yes
# 將MultipleLines中的No phoneservice替換成No
replace_list=['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
for i in replace_list:
df[i]=df[i].str.replace('No internet service','No')
df['InternetService']=df['InternetService'].str.replace('Fiber optic','Yes')
df['InternetService']=df['InternetService'].str.replace('DSL','Yes')
df['MultipleLines']=df['MultipleLines'].str.replace('No phone service','No')
# SeniorCitizen中的1改為Yes,0改為No
df.SeniorCitizen=df.SeniorCitizen.replace({0:'No',1:'Yes'})
# 將Churn中的Yes改為非流失客戶,No改為流失客戶
df.Churn=df.Churn.replace({'No':'非流失客戶','Yes':'流失客戶'})
# 將TotalCharges轉換為數字型
df.TotalCharges=pd.to_numeric(df.TotalCharges,errors="coerce") #.to_numeric()將參數轉換為數字類型,其中coerce表示無效的解析將設置為NaN
df.TotalCharges.dtypes
[/code]
```code
dtype('float64')
- str.replace()函數
s.replace(1,’one’):用‘one’代替所有等於1的值 - pd.to_numeric()函數
to_numeric(arg, errors=‘coerce’, downcast=None),將參數轉換為數字類型- arg : list, tuple, 1-d array, or Series
- errors="coerce"表示無效的解析將設置為NaN
2. 將連續數值型數據分箱
首先是 tenure(在網時長) ,分箱需要知道該列數據的最大最小值,以便確定分箱間隔
df.tenure.describe()
[/code]
```code
count 7043.000000
mean 32.371149
std 24.559481
min 0.000000
25% 9.000000
50% 29.000000
75% 55.000000
max 72.000000
Name: tenure, dtype: float64
# 在網時長分組/分箱操作
bins_t=[0,6,12,18,24,30,36,42,48,54,60,66,72]
level_t=['0.5年','1年', '1.5年', '2年', '2.5年', '3年', '3.5年', '4年', '4.5年','5年','5.5年','6年']
df['tenure_group']=pd.cut(df.tenure,bins=bins_t,labels=level_t,right=True)
df.head(5)
[/code]
| customerID | gender | SeniorCitizen | Partner | Dependents |
tenure | PhoneService | MultipleLines | InternetService |
OnlineSecurity | ... | TechSupport | StreamingTV | StreamingMovies |
Contract | PaperlessBilling | PaymentMethod | MonthlyCharges |
TotalCharges | Churn | tenure_group
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 7590-VHVEG | Female | No | Yes | No | 1 | No | No | Yes
| No | ... | No | No | No | Month-to-month | Yes | Electronic
check | 29.85 | 29.85 | 非流失客戶 | 0.5年
1 | 5575-GNVDE | Male | No | No | No | 34 | Yes | No | Yes
| Yes | ... | No | No | No | One year | No | Mailed check |
56.95 | 1889.50 | 非流失客戶 | 3年
2 | 3668-QPYBK | Male | No | No | No | 2 | Yes | No | Yes |
Yes | ... | No | No | No | Month-to-month | Yes | Mailed check
| 53.85 | 108.15 | 流失客戶 | 0.5年
3 | 7795-CFOCW | Male | No | No | No | 45 | No | No | Yes |
Yes | ... | Yes | No | No | One year | No | Bank transfer
(automatic) | 42.30 | 1840.75 | 非流失客戶 | 4年
4 | 9237-HQITU | Female | No | No | No | 2 | Yes | No | Yes
| No | ... | No | No | No | Month-to-month | Yes | Electronic
check | 70.70 | 151.65 | 流失客戶 | 0.5年
5 rows × 22 columns
* **pd.cut()函數**
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3,
include_lowest=False)
pd.cut函數有7個參數,主要用於對數據從最大值到最小值進行等距划分
* x : 輸入待cut的一維數組
* bins : cut的段數,一般為整型,但也可以為序列向量(若不在該序列中,則是NaN)。
* right : 布爾值,確定右區間是否開閉,取True時右區間閉合
* labels : 數組或布爾值,默認為None,用來標識分后的bins,長度必須與結果bins相等,返回值為整數或者對bins的標識
* retbins : 布爾值,可選。是否返回數值所在分組,Ture則返回
* precision : 整型,bins小數精度,也就是數據以幾位小數顯示
* include_lowest : 布爾類型,是否包含左區間
```code
df.MonthlyCharges.describe()
[/code]
```code
count 7043.000000
mean 64.761692
std 30.090047
min 18.250000
25% 35.500000
50% 70.350000
75% 89.850000
max 118.750000
Name: MonthlyCharges, dtype: float64
# 月租費分組
bins_M=[0,20,40,60,80,100,120]
level_M=['20','40','60','80','100','120']
df['MonthlyCharges_group']=pd.cut(df.MonthlyCharges,bins=bins_M,labels=level_M,right=True)
df.head(5)
[/code]
| customerID | gender | SeniorCitizen | Partner | Dependents |
tenure | PhoneService | MultipleLines | InternetService |
OnlineSecurity | ... | StreamingTV | StreamingMovies | Contract |
PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges |
Churn | tenure_group | MonthlyCharges_group
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 7590-VHVEG | Female | No | Yes | No | 1 | No | No | Yes
| No | ... | No | No | Month-to-month | Yes | Electronic check |
29.85 | 29.85 | 非流失客戶 | 0.5年 | 40
1 | 5575-GNVDE | Male | No | No | No | 34 | Yes | No | Yes
| Yes | ... | No | No | One year | No | Mailed check | 56.95 |
1889.50 | 非流失客戶 | 3年 | 60
2 | 3668-QPYBK | Male | No | No | No | 2 | Yes | No | Yes |
Yes | ... | No | No | Month-to-month | Yes | Mailed check |
53.85 | 108.15 | 流失客戶 | 0.5年 | 60
3 | 7795-CFOCW | Male | No | No | No | 45 | No | No | Yes |
Yes | ... | No | No | One year | No | Bank transfer (automatic) |
42.30 | 1840.75 | 非流失客戶 | 4年 | 60
4 | 9237-HQITU | Female | No | No | No | 2 | Yes | No | Yes
| No | ... | No | No | Month-to-month | Yes | Electronic check |
70.70 | 151.65 | 流失客戶 | 0.5年 | 80
5 rows × 23 columns
```code
df.dropna(inplace=True) #缺失值數量不多,刪除
df.isnull().sum()
[/code]
```code
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
tenure_group 0
MonthlyCharges_group 0
dtype: int64
df.Churn.value_counts()
[/code]
```code
非流失客戶 5163
流失客戶 1869
Name: Churn, dtype: int64
數據可視化呈現
計算整體流失率
# 數據可視化呈現,計算整體流失率
import matplotlib as mlp
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df_Churn=df[df['Churn']=='流失客戶']
Rate_Churn=df[df['Churn']=='流失客戶'].shape[0]/df['Churn'].shape[0]
print('經計算,整體流失率={:.2%}'.format(Rate_Churn))
[/code]
```code
經計算,整體流失率=26.58%
- 注意,%matplotlib inline不能缺少
plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標簽
plt.rcParams['axes.unicode_minus']=False #用來正常顯示負號
%matplotlib inline
fig=plt.figure(num=1,figsize=(5,5))
plt.pie(df['Churn'].value_counts(),autopct="%.2f%%",colors=['grey','lightcoral'])
plt.title('Proportion of Customer Churn')
plt.legend(labels=['非流失客戶','流失客戶'],loc='best')
[/code]
```code
<matplotlib.legend.Legend at 0x2191a4cc8c8>
- 注意:要想顯示中文,必須加plt.rcParams[‘font.sans-serif’]=[‘SimHei’]
問題1:流失用戶的特征是什么?
對指標進行歸納梳理,分用戶畫像指標,消費產品指標,消費信息指標。
- 用戶畫像指標
- 人口統計指標:‘gender’,‘SeniorCitizen’,‘Partner’,‘Dependents’
- 用戶活躍度:‘tenure’
- 消費產品指標
- 手機服務:‘PhoneService’,‘MultipleLines’
- 網絡服務:‘InternetService’,‘OnlineSecurity’,‘OnlineBackup’,‘DeviceProtection’,‘TechSupport’,‘StreamingTV’,‘StreamingMovies’
- 消費信息指標
- 收入指標:‘MonthlyCharges’,‘TotalCharges’
- 收入相關指標:‘Contract’,‘PaperlessBilling’,‘PaymentMethod’
采用整體流失率作為標准,用於后面分析各維度的流失率做對比。
未完待續
參考文章: 電信客戶流失數據分析