Cohort Analysis Using Python


Cohort Analysis是將某一個時期內的用戶划分為一個cohort,並將多個cohort進行時間上的某個屬性的比較的一種分析方法。Cohort Analysis在有些場景下非常有用。比如一個網站或App,在某個連續的4周里陸續更新或新增了一個功能或設計,想要知道這些功能和設計上的改動對用戶的影響,就可以將每周的新注冊作為一個cohort,觀察這4個cohort在接下來的一段實際里的行為數據,就可以很清楚地觀察到4個改動的影響。

最近要做Cohort Analysis,數據都在數據庫里,就直接想用Python接數據庫直接分析了。Google到一篇講得很清楚的文章,學着用Python實現一遍。示例數據從這里下載。

1. 讀取數據

據說 from ... import ... 的方式在性能上有弊端,一般都推薦直接 import。示例數據是典型的購物數據,按客戶的第一次消費時間將客戶分為不同的cohort。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
df = pd.read_excel('relay-foods.xlsx', sheet='Purchase Data')
df.head()
OrderId OrderDate UserId TotalCharges CommonId PupId PickupDate
0 262 2009-01-11 47 50.67 TRQKD 2 2009-01-12
1 278 2009-01-20 47 26.60 4HH2S 3 2009-01-20
2 294 2009-02-03 47 38.71 3TRDC 2 2009-02-04
3 301 2009-02-06 47 53.38 NGAZJ 2 2009-02-09
4 302 2009-02-06 47 14.28 FFYHD 2 2009-02-09

2. 確定 OrderDate 的月份,根據 OrderDate 分群

分別創建兩個字段 OrderPeriodCohortGroupOrderPeriod是購買日期的月份,CohortGroup是根據購買日期對 UserId 的分群。

df['OrderPeriod'] = df.OrderDate.map(lambda x: x.strftime('%Y-%m'))

df = df.assign(OrderPeriod = df.OrderDate.map(lambda x: x.strftime('%Y-%m'))) \
       .set_index('UserId')

# goupby(level=0): level 是index的level, 對於multiIndex, 可用level=0或1指定根據那個index來group
df = df.assign(CohortGroup = df.groupby(level=0).OrderDate.min().apply(lambda x: x.strftime('%Y-%m'))) \
       .reset_index()

df.head()
UserId OrderId OrderDate TotalCharges CommonId PupId PickupDate OrderPeriod CohortGroup
0 47 262 2009-01-11 50.67 TRQKD 2 2009-01-12 2009-01 2009-01
1 47 278 2009-01-20 26.60 4HH2S 3 2009-01-20 2009-01 2009-01
2 47 294 2009-02-03 38.71 3TRDC 2 2009-02-04 2009-02 2009-01
3 47 301 2009-02-06 53.38 NGAZJ 2 2009-02-09 2009-02 2009-01
4 47 302 2009-02-06 14.28 FFYHD 2 2009-02-09 2009-02 2009-01

3. 計算每個CohortGroup在各個OrderPeriod的用戶量

# pd.Series.nunique --> Return number of unique elements in the object.
cohorts = df.groupby(['CohortGroup', 'OrderPeriod']) \
            .agg({'UserId': pd.Series.nunique,
                 'OrderId': pd.Series.nunique,
                 'TotalCharges': 'sum'})
cohorts.rename(columns={'UserId': 'TotalUsers', 'OrderId': 'TotalOrders'}, inplace=True)
cohorts.head()
TotalCharges TotalOrders TotalUsers
CohortGroup OrderPeriod
2009-01 2009-01 1850.255 30 22
2009-02 1351.065 25 8
2009-03 1357.360 26 10
2009-04 1604.500 28 9
2009-05 1575.625 26 10

4. 標記每個CohortGroup的Cohort時期

比如,對於2009-01 cohort,其第一個時期是 2009-01,第二到第五個時期為 2009-02,...,2009-05。這里需要將每個CohortGroup的OrderPeriod對應到其第1,2,...個時期。

def cohort_period(df):
    df['CohortPeriod'] = np.arange(len(df)) + 1
    return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.head()
TotalCharges TotalOrders TotalUsers CohortPeriod
CohortGroup OrderPeriod
2009-01 2009-01 1850.255 30 22 1
2009-02 1351.065 25 8 2
2009-03 1357.360 26 10 3
2009-04 1604.500 28 9 4
2009-05 1575.625 26 10 5

上面 level=0 實際上就是對 group by CohortGroup,然后對每個group結果 apply cohort_periodgroupby 后的結果是這樣的:

[(k, v) for k, v in cohorts.head(5).groupby(level=0)]
[('2009-01',
                           TotalCharges  TotalOrders  TotalUsers  CohortPeriod
  CohortGroup OrderPeriod                                                     
  2009-01     2009-01          1850.255           30          22             1
              2009-02          1351.065           25           8             2
              2009-03          1357.360           26          10             3
              2009-04          1604.500           28           9             4
              2009-05          1575.625           26          10             5)]

5. 確定分群后的結果正確

x = df[(df.CohortGroup=='2009-01') & (df.OrderPeriod=='2009-01')]
y = cohorts.ix[('2009-01', '2009-01')]

assert(x.UserId.nunique()==y.TotalUsers)
assert(x.OrderId.nunique()==y.TotalOrders)
assert(x.TotalCharges.sum()==y.TotalCharges)

x = df[(df.CohortGroup=='2009-03') & (df.OrderPeriod=='2009-05')]
y = cohorts.ix[('2009-03', '2009-05')]

assert(x.UserId.nunique()==y.TotalUsers)
assert(x.OrderId.nunique()==y.TotalOrders)
assert(x.TotalCharges.sum()==y.TotalCharges)

每個CohortGroup的留存

6. 計算每個CohortGroup在第一個CohortPeriod的用戶數量

cohorts = cohorts.reset_index() \
                 .set_index(['CohortGroup', 'CohortPeriod'])

cohort_group_size = cohorts.TotalUsers.groupby(level=0).first()
cohort_group_size
CohortGroup
2009-01     22
2009-02     15
2009-03     13
2009-04     39
2009-05     50
2009-06     32
2009-07     50
2009-08     31
2009-09     37
2009-10     54
2009-11    130
2009-12     65
2010-01     95
2010-02    100
2010-03     24
Name: TotalUsers, dtype: int64

7. 計算每個CohortPeriod的留存率

user_retention = cohorts.TotalUsers.unstack(0).divide(cohort_group_size, axis=1)
user_retention.head()
CohortGroup 2009-01 2009-02 2009-03 2009-04 2009-05 2009-06 2009-07 2009-08 2009-09 2009-10 2009-11 2009-12 2010-01 2010-02 2010-03
CohortPeriod
1 1.000000 1.000000 1.000000 1.000000 1.00 1.00000 1.00 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00 1.0
2 0.363636 0.200000 0.307692 0.333333 0.26 0.46875 0.46 0.354839 0.405405 0.314815 0.246154 0.261538 0.526316 0.19 NaN
3 0.454545 0.333333 0.384615 0.256410 0.24 0.28125 0.26 0.290323 0.378378 0.222222 0.200000 0.276923 0.273684 NaN NaN
4 0.409091 0.066667 0.307692 0.333333 0.10 0.18750 0.20 0.225806 0.216216 0.240741 0.223077 0.107692 NaN NaN NaN
5 0.454545 0.266667 0.076923 0.153846 0.08 0.21875 0.22 0.193548 0.351351 0.240741 0.100000 NaN NaN NaN NaN

留存率曲線

user_retention[['2009-01', '2009-05', '2009-08']] \
  .plot(figsize=(11, 6), color=['#4285f4', '#EA4335', '#A60628'])
plt.title("Cohorts: User Retention")
plt.xticks(np.arange(1, len(user_retention)+1, 1))
plt.xlim(1, len(user_retention))
plt.ylabel('% of Cohort Purchasing', fontsize=16)

留存率熱力圖

import seaborn as sns
sns.set(style='white')

plt.figure(figsize=(16, 8))
plt.title('Cohorts: User Retetion', fontsize=14)
sns.heatmap(user_retention.T, 
            mask=user_retention.T.isnull(), 
            annot=True, fmt='.0%')


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM