特征工程之分箱--卡方分箱

本文轉載自查看原文 2019-03-17 16:24 4476 機器學習

1.定義

分箱就是將連續變量離散化，將多狀態的離散變量合並成少狀態。

2.分箱的用處

離散特征的增加和減少都很容易，易於模型的快速迭代；

稀疏向量內積乘法運算速度快，計算結果方便存儲，容易擴展；

列表內容離散化后的特征對異常數據有很強的魯棒性：比如一個特征是年齡>30是1，否則0。如果特征沒有離散化，一個異常數據“年齡300歲”會給模型造成很大的干擾；

列表內容邏輯回歸屬於廣義線性模型，表達能力受限；單變量離散化為N個后，每個變量有單獨的權重，相當於為模型引入了非線性，能夠提升模型表達能力，加大擬合；

離散化后可以進行特征交叉，由M+N個變量變為M*N個變量，進一步引入非線性，提升表達能力；

列表內容特征離散化后，模型會更穩定，比如如果對用戶年齡離散化，20-30作為一個區間，不會因為一個用戶年齡長了一歲就變成一個完全不同的人。當然處於區間相鄰處的樣本會剛好相反，所以怎么划分區間
是門學問；
特征離散化以后，起到了簡化了邏輯回歸模型的作用，降低了模型過擬合的風險。可以將缺失作為獨立的一類帶入模型。
將所有變量變換到相似的尺度上。

3.分箱方法

分箱方法分為無監督分箱和有監督分箱。

　　1）常用的無監督分箱方法有等頻分箱，等距分箱和聚類分箱。
　　2）有監督分箱主要有best-ks分箱和卡方分箱。基於我的項目中重點應用了卡方分箱，所以這里重點對卡方分箱做些總結。

4.卡方分箱的原理

卡方分箱是自底向上的(即基於合並的)數據離散化方法。

它依賴於卡方檢驗:具有最小卡方值的相鄰區間合並在一起,直到滿足確定的停止准則。
基本思想:對於精確的離散化，相對類頻率在一個區間內應當完全一致。
因此,如果兩個相鄰的區間具有非常類似的類分布，則這兩個區間可以合並；
否則，它們應當保持分開。而低卡方值表明它們具有相似的類分布。
分箱步驟：

這里需要注意初始化時需要對實例進行排序，在排序的基礎上進行合並。

卡方閾值的確定：
　　根據顯著性水平和自由度得到卡方值
　　自由度比類別數量小1。例如：有3類,自由度為2，則90%置信度(10%顯著性水平)下，卡方的值為4.6。

閾值的意義
　　類別和屬性獨立時,有90%的可能性,計算得到的卡方值會小於4.6。
　　大於閾值4.6的卡方值就說明屬性和類不是相互獨立的，不能合並。如果閾值選的大,區間合並就會進行很多次,離散后的區間數量少、區間大。

　　
注:
1,ChiMerge算法推薦使用0.90、0.95、0.99置信度,最大區間數取10到15之間.
2,也可以不考慮卡方閾值,此時可以考慮最小區間數或者最大區間數。指定區間數量的上限和下限,最多幾個區間,最少幾個區間。
3,對於類別型變量,需要分箱時需要按照某種方式進行排序。

5.分完箱之后評估指標

分為箱之后，需要評估。在積分卡模型中，最常用的評估手段是計算出WOE和IV值。

https://www.cnblogs.com/wqbin/p/10547628.html

6.分箱

def Chi2(df, total_col, bad_col,overallRate):
    '''
     #此函數計算卡方值
     :df dataFrame
     :total_col 每個值得總數量
     :bad_col 每個值的壞數據數量
     :overallRate 壞數據的占比
     : return 卡方值
    '''
    df2=df.copy()
    df2['expected']=df[total_col].apply(lambda x: x*overallRate)
    combined=list(zip(df2['expected'], df2[bad_col]))
    chi=[(i[0]-i[1])**2/i[0] for i in combined]
    chi2=sum(chi)
    return chi2

#基於卡方閾值卡方分箱，有個缺點，不好控制分箱個數。
def ChiMerge_MinChisq(df, col, target, confidenceVal=3.841):
    '''
    #此函數是以卡方閾值作為終止條件進行分箱
    : df dataFrame
    : col 被分箱的特征
    : target 目標值,是0,1格式
    : confidenceVal  閾值，自由度為1， 自信度為0.95時，卡方閾值為3.841
    : return 分箱。
    這里有個問題，卡方分箱對分箱的數量沒有限制，這樣子會導致最后分箱的結果是分箱太細。
    '''
    #對待分箱特征值進行去重
    colLevels=set(df[col])
    
    #count是求得數據條數
    total=df.groupby([col])[target].count()
   
    total=pd.DataFrame({'total':total})
 
    #sum是求得特征值的和
    #注意這里的target必須是0,1。要不然這樣求bad的數據條數，就沒有意義，並且bad是1，good是0。
    bad=df.groupby([col])[target].sum()
    bad=pd.DataFrame({'bad':bad})
    #對數據進行合並，求出col，每個值的出現次數（total，bad）
    regroup=total.merge(bad, left_index=True, right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
  
    #求出整的數據條數
    N=sum(regroup['total'])
    #求出黑名單的數據條數
    B=sum(regroup['bad'])
    overallRate=B*1.0/N
    
    #對待分箱的特征值進行排序
    colLevels=sorted(list(colLevels))
    groupIntervals=[[i] for i in colLevels]
   
    groupNum=len(groupIntervals)
    while(1):
        if len(groupIntervals) == 1:
            break
        chisqList=[]
        for interval in groupIntervals:
            df2=regroup.loc[regroup[col].isin(interval)]
            chisq=Chi2(df2, 'total', 'bad', overallRate)
            chisqList.append(chisq)

        min_position=chisqList.index(min(chisqList))
    
        if min(chisqList) >= confidenceVal:
            break
        
        if min_position==0:
            combinedPosition=1
        elif min_position== groupNum-1:
            combinedPosition=min_position-1
        else:
            if chisqList[min_position-1]<=chisqList[min_position + 1]:
                combinedPosition=min_position-1
            else:
                combinedPosition=min_position+1
        groupIntervals[min_position]=groupIntervals[min_position]+groupIntervals[combinedPosition]
        groupIntervals.remove(groupIntervals[combinedPosition])
        groupNum=len(groupIntervals)
    return groupIntervals

#最大分箱數分箱
def ChiMerge_MaxInterval_Original(df, col, target,max_interval=5):
    '''
    : df dataframe
    : col 要被分項的特征
    ： target 目標值 0,1 值
    : max_interval 最大箱數
    ：return 箱體
    '''
    colLevels=set(df[col])
    colLevels=sorted(list(colLevels))
    N_distinct=len(colLevels)
    if N_distinct <= max_interval:
        print("the row is cann't be less than interval numbers")
        return colLevels[:-1]
    else:
        total=df.groupby([col])[target].count()
        total=pd.DataFrame({'total':total})
        bad=df.groupby([col])[target].sum()
        bad=pd.DataFrame({'bad':bad})
        regroup=total.merge(bad, left_index=True, right_index=True, how='left')
        regroup.reset_index(level=0, inplace=True)
        N=sum(regroup['total'])
        B=sum(regroup['bad'])
        overallRate=B*1.0/N
        groupIntervals=[[i] for i in colLevels]
        groupNum=len(groupIntervals)
        while(len(groupIntervals)>max_interval):
            chisqList=[]
            for interval in groupIntervals:
                df2=regroup.loc[regroup[col].isin(interval)]
                chisq=Chi2(df2,'total','bad',overallRate)
                chisqList.append(chisq)
            min_position=chisqList.index(min(chisqList))
            if min_position==0:
                combinedPosition=1
            elif min_position==groupNum-1:
                combinedPosition=min_position-1
            else:
                if chisqList[min_position-1]<=chisqList[min_position + 1]:
                    combinedPosition=min_position-1
                else:
                    combinedPosition=min_position+1
            #合並箱體
            groupIntervals[min_position]=groupIntervals[min_position]+groupIntervals[combinedPosition]
            groupIntervals.remove(groupIntervals[combinedPosition])
            groupNum=len(groupIntervals)
        groupIntervals=[sorted(i) for i in groupIntervals]
        print(groupIntervals)
        cutOffPoints=[i[-1] for i in groupIntervals[:-1]]
        return cutOffPoints

#計算WOE和IV值
def CalcWOE(df,col, target):
    '''
    : df dataframe
    : col 注意這列已經分過箱了，現在計算每箱的WOE和總的IV
    ：target 目標列 0-1值
    ：return 返回每箱的WOE和總的IV
    '''
    total=df.groupby([col])[target].count()
    total=pd.DataFrame({'total':total})
    bad=df.groupby([col])[target].sum()
    bad=pd.DataFrame({'bad':bad})
    regroup=total.merge(bad, left_index=True, right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
    N=sum(regroup['total'])
    B=sum(regroup['bad'])
    regroup['good']=regroup['total']-regroup['bad']
    G=N-B
    regroup['bad_pcnt']=regroup['bad'].map(lambda x: x*1.0/B)
    regroup['good_pcnt']=regroup['good'].map(lambda x: x*1.0/G)
    regroup['WOE']=regroup.apply(lambda x: np.log(x.good_pcnt*1.0/x.bad_pcnt),axis=1)
    WOE_dict=regroup[[col,'WOE']].set_index(col).to_dict(orient='index')
    IV=regroup.apply(lambda x:(x.good_pcnt-x.bad_pcnt)*np.log(x.good_pcnt*1.0/x.bad_pcnt),axis=1)
    IV_SUM=sum(IV)
    return {'WOE':WOE_dict,'IV_sum':IV_SUM,'IV':IV}

#分箱以后檢查每箱的bad_rate的單調性，如果不滿足，那么繼續進行相鄰的兩項合並，直到bad_rate單調為止
def BadRateMonotone(df, sortByVar, target):
    #df[sortByVar]這列已經經過分箱
    df2=df.sort_values(by=[sortByVar])
    total=df2.groupby([sortByVar])[target].count()
    total=pd.DataFrame({'total':total})
    bad=df2.groupby([sortByVar])[target].sum()
    bad=pd.DataFrame({'bad':bad})
    regroup=total.merge(bad, left_index=True, right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
    combined=list(zip(regroup['total'], regroup['bad']))
    badRate=[x[1]*1.0/x[0] for x in combined]
    badRateMonotone=[badRate[i]<badRate[i+1] for i in range(len(badRate)-1)]
    Monotone = len(set(badRateMonotone))
    if Monotone==1:
        return True
    else:
        return False

 #檢查最大箱，如果最大箱里面數據數量占總數據的90%以上，那么棄用這個變量
def MaximumBinPcnt(df, col):
    N=df.shape[0]
    total=df.groupby([col])[col].count()
    pcnt=total*1.0/N
    return max(pcnt)

#對於類別型數據，以bad_rate代替原有值，轉化成連續變量再進行分箱計算。比如我們這里的戶籍地代碼，就是這種數據格式
#當然如果類別較少時，原則上不需要分箱
def BadRateEncoding(df, col, target):
    '''
    : df DataFrame
    : col 需要編碼成bad rate的特征列
    ：target值，0-1值
    ： return: the assigned bad rate 
    '''
    total=df.groupby([col])[target].count()
    total=pd.DataFrame({'total':total})
    bad=df.groupby([col])[target].sum()
    bad=pd.DataFrame({'bad':bad})
    regroup=total.merge(bad, left_index=True, right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
    regroup['bad_rate']=regroup.apply(lambda x: x.bad*1.0/x.total, axis=1)
    br_dict=regroup[[col,'bad_rate']].set_index([col]).to_dict(orient='index')
    badRateEnconding=df[col].map(lambda x: br_dict[x]['bad_rate'])
    return {'encoding':badRateEnconding,'br_rate':br_dict}

View Code

7.自動化分箱

在工程中，考慮到能夠自動化對數據里所有需要分箱的連續變量進行分箱，所以在工程上需要做些處理，需要寫個自動化分箱腳本：

class Woe_IV:


    def __init__(self,df,colList,target):
        '''
        :param df: 這個是用來分箱的dataframe
        :param colList: 這個分箱的列數據，數據結構是一個字段數組
         例如colList=[
              {
                'col':'openning_room_num_n3'
                'bandCol':'openning_room_num_n3_band',
                'bandNum':6,
                ‘toCsvPath':'/home/liuweitang/yellow_model/data/mk/my.txt'
              },

         ]
         :param target 目標列0-1值，1表示bad，0表示good
        '''
        self.df=df
        self.colList=colList
        self.target=target

    def to_band(self):
        for i in range(len(self.colList)):
            colParam=self.colList[i]
            #計算出箱體分別值，返回的是一個長度為5數組[0,4,13,45,78]或者長度為6的數組[0,2,4,56,67,89]
            cutOffPoints=ChiMerge_MaxInterval_Original(self.df,colParam['col'],self.target,colParam['bandNum'])
            print(cutOffPoints)
            
            indexValue=0
            value_band=[]
            #那么cutOffPoints第一個值就是作為一個獨立的箱
            if len(cutOffPoints) == colParam['bandNum']-1:
                print('len-1 type')
                for i in range(0,len(cutOffPoints)):
                    if i==0:
                        self.df.loc[self.df[colParam['col']]<=cutOffPoints[i], colParam['bandCol']]=indexValue
                        indexValue+=1
                        value_band.append('0-'+str(cutOffPoints[i]))
                    if 0<i<len(cutOffPoints):
                        self.df.loc[(self.df[colParam['col']] > cutOffPoints[i - 1]) & (self.df[colParam['col']] <= cutOffPoints[i]), colParam['bandCol']] = indexValue
                        indexValue+=1
                        value_band.append(str(cutOffPoints[i - 1]+1)+"-"+str(cutOffPoints[i]))
                    if i==len(cutOffPoints)-1:
                        self.df.loc[self.df[colParam['col']] > cutOffPoints[i], colParam['bandCol']] = indexValue
                        value_band.append(str(cutOffPoints[i]+1)+"-")

            #那么就是直接分割分箱，
            if len(cutOffPoints)==colParam['bandNum']:
                print('len type')
                for i in range(0,len(cutOffPoints)):
                    if 0< i < len(cutOffPoints):
                        self.df.loc[(self.df[colParam['col']] > cutOffPoints[i - 1]) & (self.df[colParam['col']] <= cutOffPoints[i]), colParam['bandCol']] = indexValue
                        value_band.append(str(cutOffPoints[i - 1]+1)+"-"+str(cutOffPoints[i]))
                        indexValue += 1
                    if i == len(cutOffPoints)-1:
                        self.df.loc[self.df[colParam['col']] > cutOffPoints[i], colParam['bandCol']] = indexValue
                        value_band.append(str(cutOffPoints[i]+1)+"-")
                        
            self.df[colParam['bandCol']].astype(int)
            #到此分箱結束，下面判斷單調性
            isMonotone = BadRateMonotone(self.df,colParam['bandCol'], self.target)

            #如果不單調，那就打印出錯誤，並且繼續執行下一個特征分箱
            if isMonotone==False:
                print(colParam['col']+' band error, reason is not monotone')
                continue

            #單調性判斷完之后，就要計算woe_IV值
            woe_IV=CalcWOE(self.df, colParam['bandCol'],self.target)
            woe=woe_IV['WOE']
            woe_result=[]
            for i in range(len(woe)):
                woe_result.append(woe[i]['WOE'])
            
            iv=woe_IV['IV']
            iv_result=[]
            for i in range(len(iv)):
                iv_result.append(iv[i])
                
            good_bad_count=self.df.groupby([colParam['bandCol'],self.target]).label.count()
            good_count=[]
            bad_count=[]
            for i in range(0,colParam['bandNum']):
                good_count.append(good_bad_count[i][0])
                bad_count.append(good_bad_count[i][1])
            
            print(value_band)
            print(good_count)
            print(bad_count)
            print(woe_result)
            print(iv_result)
            #將WOE_IV值保存為dataframe格式數據，然后導出到csv
            #這里其實還有個問題，就是
            woe_iv_df=pd.DataFrame({
                'IV':iv_result,
                'WOE':woe_result,
                'bad':bad_count,
                'good':good_count,
                colParam['bandCol']:value_band
            })
            bad_good_count=self.df.groupby([colParam['bandCol'],self.target])[self.target].count();
           
            woe_iv_df.to_csv(colParam['toCsvPath'])
            print(colParam['col']+'band finished')

View Code

應用方便：

openning_data=pd.read_csv('***',sep='$')
colList=[
    {
        'col':'openning_room_0_6_num_n3',
        'bandCol':'openning_room_0_6_num_n3_band',
        'bandNum':5,
        'toCsvPath':'/home/liuweitang/yellow_model/eda/band_result/openning_room_0_6_num_n3_band.csv'
    },
    {
        'col':'openning_room_6_12_num_n3',
        'bandCol':'openning_room_6_12_num_n3_band',
        'bandNum':5,
        'toCsvPath':'/home/liuweitang/yellow_model/eda/band_result/openning_room_6_12_num_n3_band.csv'
    }
]
band2=Woe_IV(openning_data,colList,'label')
band2.to_band()

View Code

8.分箱后處理

分箱后需要編碼

dummy
one-hot
label-encode

9.注意問題

對於分箱需要注意的是，分完箱之后，某些箱區間里，bad或者good分布比例極不均勻，極端時，會出現bad或者good數量直接為0。那么這樣子會直接導致后續計算WOE時出現inf無窮大的情況，這是不合理的。這種情況，說明分箱太細，需要進一步縮小分箱的數量。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 特征工程之分箱--Best-KS分箱特征工程 - 分箱特征工程－數據分箱 5-6-機器學習-特征工程之WOE、IV編碼和分箱基於卡方分箱的評分卡建模數據分箱：等頻分箱，等距分箱，卡方分箱，計算WOE、IV [轉]卡方分箱中卡方值的計算【DW·智慧海洋(捕魚作業分析)打卡】task03_特征工程 (復現top的各種特征工程：分箱特征、網格特征、統計特征、Embedding特征) 分箱的作用分箱方法