Python學習筆記：pd.cut和pd.qcut實現數據分箱

本文轉載自查看原文 2021-11-01 16:56 5404 Python

在機器學習中，經常會對數據進行分箱處理操作，即將一段連續的值切分為若干段，每一段的值當成一個分類。

這個將連續值轉換成離散值的過程，就是分箱處理。

例如：把年齡划分為18歲以下、18-30歲、30-45歲、45-60歲、60歲以上等5個標簽（類別）。

Pandas 包中的 cut 和 qcut 都可以實現分箱操作，區別在於：

cut：按照數值進行分割，等間隔
qcut：按照數據分布進行分割，等頻率

一、pd.cut函數

1.使用語法

pandas.cut(x,   # 被切分的數組
          bins, # 被切割后的區間（桶、箱）
          right=True, # 是否包含區間右部 默認為真
          labels=None, # 區間標簽 與區間個數一致
          retbins=False, # 是否返回分割后的bins
          precision=3, # 小數點位
          include_lowest=False, # 左開區間
          duplicates='raise') # 是否允許重復區間
                   # raise：不允許  drop：允許

2.實操

構造測試集

import pandas as pd
import numpy as np

ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32])

平分為5個區間

# 平分為5個區間
pd.cut(ages, 5)
'''
[(0.901, 20.8], (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], ..., (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], (20.8, 40.6]]
Length: 16
Categories (5, interval[float64]): [(0.901, 20.8] < (20.8, 40.6] < (40.6, 60.4] < (60.4, 80.2] <
                                    (80.2, 100.0]]
'''

pd.cut(ages, 5).value_counts()
'''
(0.901, 20.8]    6
(20.8, 40.6]     5
(40.6, 60.4]     1
(60.4, 80.2]     2
(80.2, 100.0]    2
dtype: int64
'''

區間兩邊均有擴展，以包含最大值和最小值。

平分並指定labels

pd.cut(ages, 5, labels=['嬰兒', '青年', '中年', '壯年', '老年'])
'''
['嬰兒', '嬰兒', '嬰兒', '青年', '青年', ..., '嬰兒', '嬰兒', '青年', '青年', '青年']
Length: 16
Categories (5, object): ['嬰兒' < '青年' < '中年' < '壯年' < '老年']
'''

指定區間進行分割

pd.cut(ages, 
       bins=[0,5,20,30,50,100],
       labels=['嬰兒', '青年', '中年', '壯年', '老年'])
'''
['嬰兒', '嬰兒', '青年', '壯年', '壯年', ..., '青年', '青年', '中年', '中年', '壯年']
Length: 16
Categories (5, object): ['嬰兒' < '青年' < '中年' < '壯年' < '老年']
'''

返回分割后的bins（設置 retbins=True 即可）

pd.cut(ages, 
       bins=[0,5,20,30,50,100],
       labels=['嬰兒', '青年', '中年', '壯年', '老年'],
       retbins=True)
'''
(['嬰兒', '嬰兒', '青年', '壯年', '壯年', ..., '青年', '青年', '中年', '中年', '壯年']
 Length: 16
 Categories (5, object): ['嬰兒' < '青年' < '中年' < '壯年' < '老年'],
 array([  0,   5,  20,  30,  50, 100]))
'''

只返回數據所屬的bins（設置 labels=False 即可）

pd.cut(ages,
       bins=[0,5,20,30,50,100],
       labels=False)
# array([0, 0, 1, 3, 3, 1, 4, 4, 4, 4, 4, 1, 1, 2, 2, 3], dtype=int64)

默認情況下，每個區間包括最大值，不包括最小值。

最左邊的值，一般設置成最小值減去最大值的 0.1%。

二、pd.qcut函數

1.使用語法

pd.qcut 實現按數據的數量進行分割，盡量保證每個分組里變量的個數相同。

pd.qcut(
    x, # 數組
    q, # 組數 int
    labels=None, # 標簽
    retbins: bool = False, # 是否返回邊界值
    precision: int = 3, # 精度
    duplicates: str = "raise",
)

2.實操

簡單按個數分箱

import numpy as np
import pandas as pd

factors = np.random.randn(9)
pd.qcut(factors, 3)
'''
[(-0.272, 0.33], (0.33, 1.116], (0.33, 1.116], (0.33, 1.116], (-0.272, 0.33], (-1.101, -0.272], (-1.101, -0.272], (-0.272, 0.33], (-1.101, -0.272]]
Categories (3, interval[float64]): [(-1.101, -0.272] < (-0.272, 0.33] < (0.33, 1.116]]
'''
pd.qcut(factors, 3).value_counts() # 均分
'''
(-1.101, -0.272]    3
(-0.272, 0.33]      3
(0.33, 1.116]       3
dtype: int64
'''

添加labels標簽

pd.qcut(factors,
        3,
        labels=['a','b','c'])
'''
['b', 'c', 'c', 'c', 'b', 'a', 'a', 'b', 'a']
Categories (3, object): ['a' < 'b' < 'c']
'''
# 返回對應的分組下標
pd.qcut(factors, 3, labels=False)
# array([1, 2, 2, 2, 1, 0, 0, 1, 0], dtype=int64)

返回bins值

pd.qcut(factors, 3, retbins=True)
'''
([(-0.272, 0.33], (0.33, 1.116], (0.33, 1.116], (0.33, 1.116], (-0.272, 0.33], (-1.101, -0.272], (-1.101, -0.272], (-0.272, 0.33], (-1.101, -0.272]]
 Categories (3, interval[float64]): [(-1.101, -0.272] < (-0.272, 0.33] < (0.33, 1.116]],
 array([-1.09994407, -0.27157169,  0.32984035,  1.11614022]))
'''

三、綜合用法

1.一個栗子

import pandas as pd
import numpy as np

df = pd.DataFrame([x**2 for x in range(11)],
                  columns=['number'])

# 按照數值由小到大 將數據分成4份
df['cut_group'] = pd.cut(df['number'], 4)


# 分成四組 並且讓每組變量的數量相同
df['qcut_group'] = pd.qcut(df['number'], 4)
df['qcut_group'].value_counts()
'''
(-0.001, 6.5]    3
(6.5, 25.0]      3
(56.5, 100.0]    3
(25.0, 56.5]     2
'''

3.日常工作使用

cz_fee_cut = [min(df_copy.cz_fee)-10, 30, 50, 100, 150, 200, max(df_copy.cz_fee)+10]
df_copy["cz_fee"] = pd.cut(df_copy['cz_fee'], 
                           bins=cz_fee_cut, 
                           right=False,
                           labels=['<30','[30,50)','[50,100)','[100,150)','[150,200)','≥200'])

參考鏈接：pandas.cut

參考鏈接：pandas.cut使用總結

參考鏈接：Pandas —— qcut( )與cut( )的區別

參考鏈接：pandas的cut，qcut函數的使用和區別

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pandas 的pd.cut() 和pd.qcut() 數據分箱 pd.qcut() 和 pd.cut() pandas 學習匯總統計：pd.cut與pd.qcut數字按區間划分添加標簽 pd.qcut, pd.cut, df.groupby()等在分組和聚合方面的應用 4-Pandas數據預處理之離散化、面元划分（等距pd.cut()、等頻pd.pcut())） pandas-08 pd.cut()的功能和作用 Python學習筆記：pd.sort_values實現排序 Python學習筆記：pd.filter、query篩選數據 Python學習筆記：pd.rank排序 Python學習筆記：pd.drop刪除行或列