Kaggle數據集之電信客戶流失數據分析


分析背景

某電信公司市場部為了預防用戶流失,收集了已經打好流失標簽的用戶數據。現在要對流失用戶情況進行分析,找出哪些用戶可能會流失?

理解數據

采集數據

本數據集描述了電信用戶是否流失以及其相關信息,共包含7043條數據,共21個字段,分別介紹如下:

  • customerID : 用戶ID。
  • gender:性別。(Female & Male)
  • SeniorCitizen :老年用戶 (1表示是,0表示不是)
  • Partner :伴侶用戶 (Yes or No)
  • Dependents :親屬用戶 (Yes or No)
  • tenure : 在網時長(0-72月)
  • PhoneService : 是否開通電話服務業務 (Yes or No)
  • MultipleLines: 是否開通了多線業務(Yes 、No or No phoneservice 三種)
  • InternetService:是否開通互聯網服務 (No, DSL數字網絡,fiber optic光纖網絡 三種)
  • OnlineSecurity:是否開通網絡安全服務(Yes,No,No internetserive 三種)
  • OnlineBackup:是否開通在線備份業務(Yes,No,No internetserive 三種)
  • DeviceProtection:是否開通了設備保護業務(Yes,No,No internetserive 三種)
  • TechSupport:是否開通了技術支持服務(Yes,No,No internetserive 三種)
  • StreamingTV:是否開通網絡電視(Yes,No,No internetserive 三種)
  • StreamingMovies:是否開通網絡電影(Yes,No,No internetserive 三種)
  • Contract:簽訂合同方式 (按月,一年,兩年)
  • PaperlessBilling:是否開通電子賬單(Yes or No)
  • PaymentMethod:付款方式(bank transfer,credit card,electronic check,mailed check)
  • MonthlyCharges:月費用
  • TotalCharges:總費用
  • Churn:該用戶是否流失(Yes or No)

導入數據

    import pandas as pd
    df=pd.read_csv(r"D:\PycharmProjects\ku_pandas\WA_Fn-UseC_-Telco-Customer-Churn.csv")
    df.head(5) #顯示數據前n行,不指定n,df.head則會顯示所有的行
[/code]

|  customerID  |  gender  |  SeniorCitizen  |  Partner  |  Dependents  |
tenure  |  PhoneService  |  MultipleLines  |  InternetService  |
OnlineSecurity  |  ...  |  DeviceProtection  |  TechSupport  |  StreamingTV  |
StreamingMovies  |  Contract  |  PaperlessBilling  |  PaymentMethod  |
MonthlyCharges  |  TotalCharges  |  Churn  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  7590-VHVEG  |  Female  |  0  |  Yes  |  No  |  1  |  No  |  No phone
service  |  DSL  |  No  |  ...  |  No  |  No  |  No  |  No  |  Month-to-month
|  Yes  |  Electronic check  |  29.85  |  29.85  |  No  
1  |  5575-GNVDE  |  Male  |  0  |  No  |  No  |  34  |  Yes  |  No  |  DSL  |
Yes  |  ...  |  Yes  |  No  |  No  |  No  |  One year  |  No  |  Mailed check
|  56.95  |  1889.5  |  No  
2  |  3668-QPYBK  |  Male  |  0  |  No  |  No  |  2  |  Yes  |  No  |  DSL  |
Yes  |  ...  |  No  |  No  |  No  |  No  |  Month-to-month  |  Yes  |  Mailed
check  |  53.85  |  108.15  |  Yes  
3  |  7795-CFOCW  |  Male  |  0  |  No  |  No  |  45  |  No  |  No phone
service  |  DSL  |  Yes  |  ...  |  Yes  |  Yes  |  No  |  No  |  One year  |
No  |  Bank transfer (automatic)  |  42.30  |  1840.75  |  No  
4  |  9237-HQITU  |  Female  |  0  |  No  |  No  |  2  |  Yes  |  No  |  Fiber
optic  |  No  |  ...  |  No  |  No  |  No  |  No  |  Month-to-month  |  Yes  |
Electronic check  |  70.70  |  151.65  |  Yes  
  
5 rows × 21 columns

##  查看數據

```code
    df.shape #顯示數據的格式
[/code]

```code
    (7043, 21)
    df.dtypes #輸出每一列對應的數據類型
[/code]

```code
    customerID           object
    gender               object
    SeniorCitizen         int64
    Partner              object
    Dependents           object
    tenure                int64
    PhoneService         object
    MultipleLines        object
    InternetService      object
    OnlineSecurity       object
    OnlineBackup         object
    DeviceProtection     object
    TechSupport          object
    StreamingTV          object
    StreamingMovies      object
    Contract             object
    PaperlessBilling     object
    PaymentMethod        object
    MonthlyCharges      float64
    TotalCharges         object
    Churn                object
    dtype: object
    df.columns #顯示全部的列名
[/code]

```code
    Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
           'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
           'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
           'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
           'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
          dtype='object')
    df.columns.tolist() #使用tolist()函數轉化為list
[/code]

```code
    ['customerID',
     'gender',
     'SeniorCitizen',
     'Partner',
     'Dependents',
     'tenure',
     'PhoneService',
     'MultipleLines',
     'InternetService',
     'OnlineSecurity',
     'OnlineBackup',
     'DeviceProtection',
     'TechSupport',
     'StreamingTV',
     'StreamingMovies',
     'Contract',
     'PaperlessBilling',
     'PaymentMethod',
     'MonthlyCharges',
     'TotalCharges',
     'Churn']
    type(df.columns.tolist())
[/code]

```code
    list
    df.columns.values #獲取所有列索引的名稱
[/code]

```code
    array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
           'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
           'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
           'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
           'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
           'TotalCharges', 'Churn'], dtype=object)
    df.isnull().sum().values.sum() #查找缺失值
[/code]

```code
    0
    df.nunique() #查看不同值
[/code]

```code
    customerID          7043
    gender                 2
    SeniorCitizen          2
    Partner                2
    Dependents             2
    tenure                73
    PhoneService           2
    MultipleLines          3
    InternetService        3
    OnlineSecurity         3
    OnlineBackup           3
    DeviceProtection       3
    TechSupport            3
    StreamingTV            3
    StreamingMovies        3
    Contract               3
    PaperlessBilling       2
    PaymentMethod          4
    MonthlyCharges      1585
    TotalCharges        6531
    Churn                  2
    dtype: int64
    df.describe() #查看數值型列的匯總統計
[/code]

|  SeniorCitizen  |  tenure  |  MonthlyCharges  
---|---|---|---  
count  |  7043.000000  |  7043.000000  |  7043.000000  
mean  |  0.162147  |  32.371149  |  64.761692  
std  |  0.368612  |  24.559481  |  30.090047  
min  |  0.000000  |  0.000000  |  18.250000  
25%  |  0.000000  |  9.000000  |  35.500000  
50%  |  0.000000  |  29.000000  |  70.350000  
75%  |  0.000000  |  55.000000  |  89.850000  
max  |  1.000000  |  72.000000  |  118.750000

```code
    df.info() #查看索引、數據類型和內存信息
[/code]

```code
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 7043 entries, 0 to 7042
    Data columns (total 21 columns):
     #   Column            Non-Null Count  Dtype  
    ---  ------            --------------  -----  
     0   customerID        7043 non-null   object 
     1   gender            7043 non-null   object 
     2   SeniorCitizen     7043 non-null   int64  
     3   Partner           7043 non-null   object 
     4   Dependents        7043 non-null   object 
     5   tenure            7043 non-null   int64  
     6   PhoneService      7043 non-null   object 
     7   MultipleLines     7043 non-null   object 
     8   InternetService   7043 non-null   object 
     9   OnlineSecurity    7043 non-null   object 
     10  OnlineBackup      7043 non-null   object 
     11  DeviceProtection  7043 non-null   object 
     12  TechSupport       7043 non-null   object 
     13  StreamingTV       7043 non-null   object 
     14  StreamingMovies   7043 non-null   object 
     15  Contract          7043 non-null   object 
     16  PaperlessBilling  7043 non-null   object 
     17  PaymentMethod     7043 non-null   object 
     18  MonthlyCharges    7043 non-null   float64
     19  TotalCharges      7043 non-null   object 
     20  Churn             7043 non-null   object 
    dtypes: float64(1), int64(2), object(18)
    memory usage: 1.1+ MB

數據清洗、數據一致化

1. 簡化屬性值

  • 將InternetService中的DSL數字網絡,fiber optic光纖網絡替換為Yes
  • 將MultipleLines中的No phoneservice替換成No
  • 將SeniorCitizen中的1改為Yes,0改為No
  • 將Churn中的Yes改為非流失客戶,No改為流失客戶
  • 將TotalCharges轉換為數字型
    # 將InternetService中的DSL數字網絡,fiber optic光纖網絡替換為Yes
    # 將MultipleLines中的No phoneservice替換成No
    replace_list=['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
    for i in replace_list:
        df[i]=df[i].str.replace('No internet service','No')
    df['InternetService']=df['InternetService'].str.replace('Fiber optic','Yes')
    df['InternetService']=df['InternetService'].str.replace('DSL','Yes')
    df['MultipleLines']=df['MultipleLines'].str.replace('No phone service','No')
    # SeniorCitizen中的1改為Yes,0改為No
    df.SeniorCitizen=df.SeniorCitizen.replace({0:'No',1:'Yes'})
    # 將Churn中的Yes改為非流失客戶,No改為流失客戶
    df.Churn=df.Churn.replace({'No':'非流失客戶','Yes':'流失客戶'})
    # 將TotalCharges轉換為數字型
    df.TotalCharges=pd.to_numeric(df.TotalCharges,errors="coerce") #.to_numeric()將參數轉換為數字類型,其中coerce表示無效的解析將設置為NaN
    df.TotalCharges.dtypes
[/code]

```code
    dtype('float64')
  • str.replace()函數
    s.replace(1,’one’):用‘one’代替所有等於1的值
  • pd.to_numeric()函數
    to_numeric(arg, errors=‘coerce’, downcast=None),將參數轉換為數字類型
    • arg : list, tuple, 1-d array, or Series
    • errors="coerce"表示無效的解析將設置為NaN

2. 將連續數值型數據分箱

首先是 tenure(在網時長) ,分箱需要知道該列數據的最大最小值,以便確定分箱間隔

    df.tenure.describe()
[/code]

```code
    count    7043.000000
    mean       32.371149
    std        24.559481
    min         0.000000
    25%         9.000000
    50%        29.000000
    75%        55.000000
    max        72.000000
    Name: tenure, dtype: float64
    # 在網時長分組/分箱操作
    bins_t=[0,6,12,18,24,30,36,42,48,54,60,66,72]
    level_t=['0.5年','1年', '1.5年', '2年', '2.5年', '3年', '3.5年', '4年', '4.5年','5年','5.5年','6年']
    df['tenure_group']=pd.cut(df.tenure,bins=bins_t,labels=level_t,right=True)
    df.head(5)
[/code]

|  customerID  |  gender  |  SeniorCitizen  |  Partner  |  Dependents  |
tenure  |  PhoneService  |  MultipleLines  |  InternetService  |
OnlineSecurity  |  ...  |  TechSupport  |  StreamingTV  |  StreamingMovies  |
Contract  |  PaperlessBilling  |  PaymentMethod  |  MonthlyCharges  |
TotalCharges  |  Churn  |  tenure_group  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  7590-VHVEG  |  Female  |  No  |  Yes  |  No  |  1  |  No  |  No  |  Yes
|  No  |  ...  |  No  |  No  |  No  |  Month-to-month  |  Yes  |  Electronic
check  |  29.85  |  29.85  |  非流失客戶  |  0.5年  
1  |  5575-GNVDE  |  Male  |  No  |  No  |  No  |  34  |  Yes  |  No  |  Yes
|  Yes  |  ...  |  No  |  No  |  No  |  One year  |  No  |  Mailed check  |
56.95  |  1889.50  |  非流失客戶  |  3年  
2  |  3668-QPYBK  |  Male  |  No  |  No  |  No  |  2  |  Yes  |  No  |  Yes  |
Yes  |  ...  |  No  |  No  |  No  |  Month-to-month  |  Yes  |  Mailed check
|  53.85  |  108.15  |  流失客戶  |  0.5年  
3  |  7795-CFOCW  |  Male  |  No  |  No  |  No  |  45  |  No  |  No  |  Yes  |
Yes  |  ...  |  Yes  |  No  |  No  |  One year  |  No  |  Bank transfer
(automatic)  |  42.30  |  1840.75  |  非流失客戶  |  4年  
4  |  9237-HQITU  |  Female  |  No  |  No  |  No  |  2  |  Yes  |  No  |  Yes
|  No  |  ...  |  No  |  No  |  No  |  Month-to-month  |  Yes  |  Electronic
check  |  70.70  |  151.65  |  流失客戶  |  0.5年  
  
5 rows × 22 columns

  * **pd.cut()函數**   
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3,
include_lowest=False)  
pd.cut函數有7個參數,主要用於對數據從最大值到最小值進行等距划分

    * x : 輸入待cut的一維數組 
    * bins : cut的段數,一般為整型,但也可以為序列向量(若不在該序列中,則是NaN)。 
    * right : 布爾值,確定右區間是否開閉,取True時右區間閉合 
    * labels : 數組或布爾值,默認為None,用來標識分后的bins,長度必須與結果bins相等,返回值為整數或者對bins的標識 
    * retbins : 布爾值,可選。是否返回數值所在分組,Ture則返回 
    * precision : 整型,bins小數精度,也就是數據以幾位小數顯示 
    * include_lowest : 布爾類型,是否包含左區間 

```code
    df.MonthlyCharges.describe()
[/code]

```code
    count    7043.000000
    mean       64.761692
    std        30.090047
    min        18.250000
    25%        35.500000
    50%        70.350000
    75%        89.850000
    max       118.750000
    Name: MonthlyCharges, dtype: float64
    # 月租費分組
    bins_M=[0,20,40,60,80,100,120]
    level_M=['20','40','60','80','100','120']
    df['MonthlyCharges_group']=pd.cut(df.MonthlyCharges,bins=bins_M,labels=level_M,right=True)
    df.head(5)
[/code]

|  customerID  |  gender  |  SeniorCitizen  |  Partner  |  Dependents  |
tenure  |  PhoneService  |  MultipleLines  |  InternetService  |
OnlineSecurity  |  ...  |  StreamingTV  |  StreamingMovies  |  Contract  |
PaperlessBilling  |  PaymentMethod  |  MonthlyCharges  |  TotalCharges  |
Churn  |  tenure_group  |  MonthlyCharges_group  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  7590-VHVEG  |  Female  |  No  |  Yes  |  No  |  1  |  No  |  No  |  Yes
|  No  |  ...  |  No  |  No  |  Month-to-month  |  Yes  |  Electronic check  |
29.85  |  29.85  |  非流失客戶  |  0.5年  |  40  
1  |  5575-GNVDE  |  Male  |  No  |  No  |  No  |  34  |  Yes  |  No  |  Yes
|  Yes  |  ...  |  No  |  No  |  One year  |  No  |  Mailed check  |  56.95  |
1889.50  |  非流失客戶  |  3年  |  60  
2  |  3668-QPYBK  |  Male  |  No  |  No  |  No  |  2  |  Yes  |  No  |  Yes  |
Yes  |  ...  |  No  |  No  |  Month-to-month  |  Yes  |  Mailed check  |
53.85  |  108.15  |  流失客戶  |  0.5年  |  60  
3  |  7795-CFOCW  |  Male  |  No  |  No  |  No  |  45  |  No  |  No  |  Yes  |
Yes  |  ...  |  No  |  No  |  One year  |  No  |  Bank transfer (automatic)  |
42.30  |  1840.75  |  非流失客戶  |  4年  |  60  
4  |  9237-HQITU  |  Female  |  No  |  No  |  No  |  2  |  Yes  |  No  |  Yes
|  No  |  ...  |  No  |  No  |  Month-to-month  |  Yes  |  Electronic check  |
70.70  |  151.65  |  流失客戶  |  0.5年  |  80  
  
5 rows × 23 columns

```code
    df.dropna(inplace=True) #缺失值數量不多,刪除
    df.isnull().sum()
[/code]

```code
    customerID              0
    gender                  0
    SeniorCitizen           0
    Partner                 0
    Dependents              0
    tenure                  0
    PhoneService            0
    MultipleLines           0
    InternetService         0
    OnlineSecurity          0
    OnlineBackup            0
    DeviceProtection        0
    TechSupport             0
    StreamingTV             0
    StreamingMovies         0
    Contract                0
    PaperlessBilling        0
    PaymentMethod           0
    MonthlyCharges          0
    TotalCharges            0
    Churn                   0
    tenure_group            0
    MonthlyCharges_group    0
    dtype: int64
    df.Churn.value_counts()
[/code]

```code
    非流失客戶    5163
    流失客戶     1869
    Name: Churn, dtype: int64

數據可視化呈現

計算整體流失率

    # 數據可視化呈現,計算整體流失率
    import matplotlib as mlp
    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline
    df_Churn=df[df['Churn']=='流失客戶']
    Rate_Churn=df[df['Churn']=='流失客戶'].shape[0]/df['Churn'].shape[0]
    print('經計算,整體流失率={:.2%}'.format(Rate_Churn))
[/code]

```code
    經計算,整體流失率=26.58%
  • 注意,%matplotlib inline不能缺少
    plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標簽
    plt.rcParams['axes.unicode_minus']=False #用來正常顯示負號
    %matplotlib inline
    fig=plt.figure(num=1,figsize=(5,5))
    plt.pie(df['Churn'].value_counts(),autopct="%.2f%%",colors=['grey','lightcoral'])
    plt.title('Proportion of Customer Churn')
    plt.legend(labels=['非流失客戶','流失客戶'],loc='best')
[/code]

```code
    <matplotlib.legend.Legend at 0x2191a4cc8c8>

在這里插入圖片描述

  • 注意:要想顯示中文,必須加plt.rcParams[‘font.sans-serif’]=[‘SimHei’]

問題1:流失用戶的特征是什么?


對指標進行歸納梳理,分用戶畫像指標,消費產品指標,消費信息指標。

  1. 用戶畫像指標
  • 人口統計指標:‘gender’,‘SeniorCitizen’,‘Partner’,‘Dependents’
  • 用戶活躍度:‘tenure’
  1. 消費產品指標
  • 手機服務:‘PhoneService’,‘MultipleLines’
  • 網絡服務:‘InternetService’,‘OnlineSecurity’,‘OnlineBackup’,‘DeviceProtection’,‘TechSupport’,‘StreamingTV’,‘StreamingMovies’
  1. 消費信息指標
  • 收入指標:‘MonthlyCharges’,‘TotalCharges’
  • 收入相關指標:‘Contract’,‘PaperlessBilling’,‘PaymentMethod’

采用整體流失率作為標准,用於后面分析各維度的流失率做對比。


未完待續
參考文章: 電信客戶流失數據分析

在這里插入圖片描述


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM