pandas中 transform 函數和 apply 函數的區別


There are two major differences between the transform and apply groupby methods.

  • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function
  • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.(transform必須返回與組合相同長度的序列(一維的序列、數組或列表))

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

 from   :https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object#

transform 函數:

                     1.只允許在同一時間在一個Series上進行一次轉換,如果定義列‘a’ 減去列‘b’,  則會出現異常;

                      2.必須返回與 group相同的單個維度的序列(行)

                      3. 返回單個標量對象也可以使用,如 . transform(sum)

 

apply函數:

                    1. 不同於transform只允許在Series上進行一次轉換, apply對整個DataFrame 作用

                     2.apply隱式地將group 上所有的列作為自定義函數

 栗子:

#coding=gbk import numpy as np import pandas as pd data = pd.DataFrame({'state':['Florida','Florida','Texas','Texas'], 'a':[4,5,1,3], 'b':[6,10,3,11] }) print(data) # a b state # 0 4 6 Florida # 1 5 10 Florida # 2 1 3 Texas # 3 3 11 Texas def sub_two(X): return X['a'] - X['b'] data1 = data.groupby(data['state']).apply(sub_two) # 此處使用transform 則會出現錯誤 print(data1) # state # Florida 0 -2 # 1 -5 # Texas 2 -2 # 3 -8 # dtype: int64

返回單個標量可以使用transform:

:我們可以看到使用transform 和apply 的輸出結果形式是不一樣的,transform返回與數據同樣長度的行,而apply則進行了聚合

此時,使用apply說明的信息更明確

def group_sum(x): return x.sum() data3 = data.groupby(data['state']).transform(group_sum) #返回與數據一樣的 行 print(data3) # a b # 0 9 16 # 1 9 16 # 2 4 14 # 3 4 14 #但是使用apply時 data4 = data.groupby(data['state']).apply(group_sum) print(data4) # a b state # state # Florida 9 16 FloridaFlorida # Texas 4 14 TexasTexas

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

栗子2:

np.random.seed(666) df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8)}) print(df) # A B C D # 0 foo one 0.824188 0.640573 # 1 bar one 0.479966 -0.786443 # 2 foo two 1.173468 0.608870 # 3 bar three 0.909048 -0.931012 # 4 foo two -0.571721 0.978222 # 5 bar two -0.109497 -0.736918 # 6 foo one 0.019028 -0.298733 # 7 foo three -0.943761 -0.460587 def zscore(x): return (x - x.mean())/ x.var() print(df.groupby('A').transform(zscore)) #自動識別CD列 print(df.groupby('A')['C','D'].apply(zscore)) #此種形式則兩種輸出數據是一樣的 # df.groupby('A').apply(zscore) 此種情況則會報錯,apply對整個dataframe作用 df['sum_c'] = df.groupby('A')['C'].transform(sum) #先對A列進行分組, 計算C列的和 df = df.sort_values('A') print(df) # A B C D sum_c # 1 bar one 0.479966 -0.786443 1.279517 # 3 bar three 0.909048 -0.931012 1.279517 # 5 bar two -0.109497 -0.736918 1.279517 # 0 foo one 0.824188 0.640573 0.501202 # 2 foo two 1.173468 0.608870 0.501202 # 4 foo two -0.571721 0.978222 0.501202 # 6 foo one 0.019028 -0.298733 0.501202 # 7 foo three -0.943761 -0.460587 0.501202 print(df.groupby('A')['C'].apply(sum)) # A # bar 1.279517 # foo 0.501202 # Name: C, dtype: float64 

The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. 

函數傳遞給transform必須返回一個數字,一行,或者與參數相同的形狀。 如果是一個數字,那么數字將被設置為組中的所有元素,如果是一行,它將會被廣播到組中的所有行。


 

參考:https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object#


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM