分析背景
某电信公司市场部为了预防用户流失,收集了已经打好流失标签的用户数据。现在要对流失用户情况进行分析,找出哪些用户可能会流失?
理解数据
采集数据
本数据集描述了电信用户是否流失以及其相关信息,共包含7043条数据,共21个字段,分别介绍如下:
- customerID : 用户ID。
- gender:性别。(Female & Male)
- SeniorCitizen :老年用户 (1表示是,0表示不是)
- Partner :伴侣用户 (Yes or No)
- Dependents :亲属用户 (Yes or No)
- tenure : 在网时长(0-72月)
- PhoneService : 是否开通电话服务业务 (Yes or No)
- MultipleLines: 是否开通了多线业务(Yes 、No or No phoneservice 三种)
- InternetService:是否开通互联网服务 (No, DSL数字网络,fiber optic光纤网络 三种)
- OnlineSecurity:是否开通网络安全服务(Yes,No,No internetserive 三种)
- OnlineBackup:是否开通在线备份业务(Yes,No,No internetserive 三种)
- DeviceProtection:是否开通了设备保护业务(Yes,No,No internetserive 三种)
- TechSupport:是否开通了技术支持服务(Yes,No,No internetserive 三种)
- StreamingTV:是否开通网络电视(Yes,No,No internetserive 三种)
- StreamingMovies:是否开通网络电影(Yes,No,No internetserive 三种)
- Contract:签订合同方式 (按月,一年,两年)
- PaperlessBilling:是否开通电子账单(Yes or No)
- PaymentMethod:付款方式(bank transfer,credit card,electronic check,mailed check)
- MonthlyCharges:月费用
- TotalCharges:总费用
- Churn:该用户是否流失(Yes or No)
导入数据
import pandas as pd
df=pd.read_csv(r"D:\PycharmProjects\ku_pandas\WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head(5) #显示数据前n行,不指定n,df.head则会显示所有的行
[/code]
| customerID | gender | SeniorCitizen | Partner | Dependents |
tenure | PhoneService | MultipleLines | InternetService |
OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV |
StreamingMovies | Contract | PaperlessBilling | PaymentMethod |
MonthlyCharges | TotalCharges | Churn
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone
service | DSL | No | ... | No | No | No | No | Month-to-month
| Yes | Electronic check | 29.85 | 29.85 | No
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL |
Yes | ... | Yes | No | No | No | One year | No | Mailed check
| 56.95 | 1889.5 | No
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL |
Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed
check | 53.85 | 108.15 | Yes
3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone
service | DSL | Yes | ... | Yes | Yes | No | No | One year |
No | Bank transfer (automatic) | 42.30 | 1840.75 | No
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber
optic | No | ... | No | No | No | No | Month-to-month | Yes |
Electronic check | 70.70 | 151.65 | Yes
5 rows × 21 columns
## 查看数据
```code
df.shape #显示数据的格式
[/code]
```code
(7043, 21)
df.dtypes #输出每一列对应的数据类型
[/code]
```code
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object
df.columns #显示全部的列名
[/code]
```code
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
df.columns.tolist() #使用tolist()函数转化为list
[/code]
```code
['customerID',
'gender',
'SeniorCitizen',
'Partner',
'Dependents',
'tenure',
'PhoneService',
'MultipleLines',
'InternetService',
'OnlineSecurity',
'OnlineBackup',
'DeviceProtection',
'TechSupport',
'StreamingTV',
'StreamingMovies',
'Contract',
'PaperlessBilling',
'PaymentMethod',
'MonthlyCharges',
'TotalCharges',
'Churn']
type(df.columns.tolist())
[/code]
```code
list
df.columns.values #获取所有列索引的名称
[/code]
```code
array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
'TotalCharges', 'Churn'], dtype=object)
df.isnull().sum().values.sum() #查找缺失值
[/code]
```code
0
df.nunique() #查看不同值
[/code]
```code
customerID 7043
gender 2
SeniorCitizen 2
Partner 2
Dependents 2
tenure 73
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
MonthlyCharges 1585
TotalCharges 6531
Churn 2
dtype: int64
df.describe() #查看数值型列的汇总统计
[/code]
| SeniorCitizen | tenure | MonthlyCharges
---|---|---|---
count | 7043.000000 | 7043.000000 | 7043.000000
mean | 0.162147 | 32.371149 | 64.761692
std | 0.368612 | 24.559481 | 30.090047
min | 0.000000 | 0.000000 | 18.250000
25% | 0.000000 | 9.000000 | 35.500000
50% | 0.000000 | 29.000000 | 70.350000
75% | 0.000000 | 55.000000 | 89.850000
max | 1.000000 | 72.000000 | 118.750000
```code
df.info() #查看索引、数据类型和内存信息
[/code]
```code
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
数据清洗、数据一致化
1. 简化属性值
- 将InternetService中的DSL数字网络,fiber optic光纤网络替换为Yes
- 将MultipleLines中的No phoneservice替换成No
- 将SeniorCitizen中的1改为Yes,0改为No
- 将Churn中的Yes改为非流失客户,No改为流失客户
- 将TotalCharges转换为数字型
# 将InternetService中的DSL数字网络,fiber optic光纤网络替换为Yes
# 将MultipleLines中的No phoneservice替换成No
replace_list=['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
for i in replace_list:
df[i]=df[i].str.replace('No internet service','No')
df['InternetService']=df['InternetService'].str.replace('Fiber optic','Yes')
df['InternetService']=df['InternetService'].str.replace('DSL','Yes')
df['MultipleLines']=df['MultipleLines'].str.replace('No phone service','No')
# SeniorCitizen中的1改为Yes,0改为No
df.SeniorCitizen=df.SeniorCitizen.replace({0:'No',1:'Yes'})
# 将Churn中的Yes改为非流失客户,No改为流失客户
df.Churn=df.Churn.replace({'No':'非流失客户','Yes':'流失客户'})
# 将TotalCharges转换为数字型
df.TotalCharges=pd.to_numeric(df.TotalCharges,errors="coerce") #.to_numeric()将参数转换为数字类型,其中coerce表示无效的解析将设置为NaN
df.TotalCharges.dtypes
[/code]
```code
dtype('float64')
- str.replace()函数
s.replace(1,’one’):用‘one’代替所有等于1的值 - pd.to_numeric()函数
to_numeric(arg, errors=‘coerce’, downcast=None),将参数转换为数字类型- arg : list, tuple, 1-d array, or Series
- errors="coerce"表示无效的解析将设置为NaN
2. 将连续数值型数据分箱
首先是 tenure(在网时长) ,分箱需要知道该列数据的最大最小值,以便确定分箱间隔
df.tenure.describe()
[/code]
```code
count 7043.000000
mean 32.371149
std 24.559481
min 0.000000
25% 9.000000
50% 29.000000
75% 55.000000
max 72.000000
Name: tenure, dtype: float64
# 在网时长分组/分箱操作
bins_t=[0,6,12,18,24,30,36,42,48,54,60,66,72]
level_t=['0.5年','1年', '1.5年', '2年', '2.5年', '3年', '3.5年', '4年', '4.5年','5年','5.5年','6年']
df['tenure_group']=pd.cut(df.tenure,bins=bins_t,labels=level_t,right=True)
df.head(5)
[/code]
| customerID | gender | SeniorCitizen | Partner | Dependents |
tenure | PhoneService | MultipleLines | InternetService |
OnlineSecurity | ... | TechSupport | StreamingTV | StreamingMovies |
Contract | PaperlessBilling | PaymentMethod | MonthlyCharges |
TotalCharges | Churn | tenure_group
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 7590-VHVEG | Female | No | Yes | No | 1 | No | No | Yes
| No | ... | No | No | No | Month-to-month | Yes | Electronic
check | 29.85 | 29.85 | 非流失客户 | 0.5年
1 | 5575-GNVDE | Male | No | No | No | 34 | Yes | No | Yes
| Yes | ... | No | No | No | One year | No | Mailed check |
56.95 | 1889.50 | 非流失客户 | 3年
2 | 3668-QPYBK | Male | No | No | No | 2 | Yes | No | Yes |
Yes | ... | No | No | No | Month-to-month | Yes | Mailed check
| 53.85 | 108.15 | 流失客户 | 0.5年
3 | 7795-CFOCW | Male | No | No | No | 45 | No | No | Yes |
Yes | ... | Yes | No | No | One year | No | Bank transfer
(automatic) | 42.30 | 1840.75 | 非流失客户 | 4年
4 | 9237-HQITU | Female | No | No | No | 2 | Yes | No | Yes
| No | ... | No | No | No | Month-to-month | Yes | Electronic
check | 70.70 | 151.65 | 流失客户 | 0.5年
5 rows × 22 columns
* **pd.cut()函数**
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3,
include_lowest=False)
pd.cut函数有7个参数,主要用于对数据从最大值到最小值进行等距划分
* x : 输入待cut的一维数组
* bins : cut的段数,一般为整型,但也可以为序列向量(若不在该序列中,则是NaN)。
* right : 布尔值,确定右区间是否开闭,取True时右区间闭合
* labels : 数组或布尔值,默认为None,用来标识分后的bins,长度必须与结果bins相等,返回值为整数或者对bins的标识
* retbins : 布尔值,可选。是否返回数值所在分组,Ture则返回
* precision : 整型,bins小数精度,也就是数据以几位小数显示
* include_lowest : 布尔类型,是否包含左区间
```code
df.MonthlyCharges.describe()
[/code]
```code
count 7043.000000
mean 64.761692
std 30.090047
min 18.250000
25% 35.500000
50% 70.350000
75% 89.850000
max 118.750000
Name: MonthlyCharges, dtype: float64
# 月租费分组
bins_M=[0,20,40,60,80,100,120]
level_M=['20','40','60','80','100','120']
df['MonthlyCharges_group']=pd.cut(df.MonthlyCharges,bins=bins_M,labels=level_M,right=True)
df.head(5)
[/code]
| customerID | gender | SeniorCitizen | Partner | Dependents |
tenure | PhoneService | MultipleLines | InternetService |
OnlineSecurity | ... | StreamingTV | StreamingMovies | Contract |
PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges |
Churn | tenure_group | MonthlyCharges_group
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 7590-VHVEG | Female | No | Yes | No | 1 | No | No | Yes
| No | ... | No | No | Month-to-month | Yes | Electronic check |
29.85 | 29.85 | 非流失客户 | 0.5年 | 40
1 | 5575-GNVDE | Male | No | No | No | 34 | Yes | No | Yes
| Yes | ... | No | No | One year | No | Mailed check | 56.95 |
1889.50 | 非流失客户 | 3年 | 60
2 | 3668-QPYBK | Male | No | No | No | 2 | Yes | No | Yes |
Yes | ... | No | No | Month-to-month | Yes | Mailed check |
53.85 | 108.15 | 流失客户 | 0.5年 | 60
3 | 7795-CFOCW | Male | No | No | No | 45 | No | No | Yes |
Yes | ... | No | No | One year | No | Bank transfer (automatic) |
42.30 | 1840.75 | 非流失客户 | 4年 | 60
4 | 9237-HQITU | Female | No | No | No | 2 | Yes | No | Yes
| No | ... | No | No | Month-to-month | Yes | Electronic check |
70.70 | 151.65 | 流失客户 | 0.5年 | 80
5 rows × 23 columns
```code
df.dropna(inplace=True) #缺失值数量不多,删除
df.isnull().sum()
[/code]
```code
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
tenure_group 0
MonthlyCharges_group 0
dtype: int64
df.Churn.value_counts()
[/code]
```code
非流失客户 5163
流失客户 1869
Name: Churn, dtype: int64
数据可视化呈现
计算整体流失率
# 数据可视化呈现,计算整体流失率
import matplotlib as mlp
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df_Churn=df[df['Churn']=='流失客户']
Rate_Churn=df[df['Churn']=='流失客户'].shape[0]/df['Churn'].shape[0]
print('经计算,整体流失率={:.2%}'.format(Rate_Churn))
[/code]
```code
经计算,整体流失率=26.58%
- 注意,%matplotlib inline不能缺少
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
%matplotlib inline
fig=plt.figure(num=1,figsize=(5,5))
plt.pie(df['Churn'].value_counts(),autopct="%.2f%%",colors=['grey','lightcoral'])
plt.title('Proportion of Customer Churn')
plt.legend(labels=['非流失客户','流失客户'],loc='best')
[/code]
```code
<matplotlib.legend.Legend at 0x2191a4cc8c8>
- 注意:要想显示中文,必须加plt.rcParams[‘font.sans-serif’]=[‘SimHei’]
问题1:流失用户的特征是什么?
对指标进行归纳梳理,分用户画像指标,消费产品指标,消费信息指标。
- 用户画像指标
- 人口统计指标:‘gender’,‘SeniorCitizen’,‘Partner’,‘Dependents’
- 用户活跃度:‘tenure’
- 消费产品指标
- 手机服务:‘PhoneService’,‘MultipleLines’
- 网络服务:‘InternetService’,‘OnlineSecurity’,‘OnlineBackup’,‘DeviceProtection’,‘TechSupport’,‘StreamingTV’,‘StreamingMovies’
- 消费信息指标
- 收入指标:‘MonthlyCharges’,‘TotalCharges’
- 收入相关指标:‘Contract’,‘PaperlessBilling’,‘PaymentMethod’
采用整体流失率作为标准,用于后面分析各维度的流失率做对比。
未完待续
参考文章: 电信客户流失数据分析