pandas中 transform 函数和 apply 函数的区别


There are two major differences between the transform and apply groupby methods.

  • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function
  • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.(transform必须返回与组合相同长度的序列(一维的序列、数组或列表))

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

 from   :https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object#

transform 函数:

                     1.只允许在同一时间在一个Series上进行一次转换,如果定义列‘a’ 减去列‘b’,  则会出现异常;

                      2.必须返回与 group相同的单个维度的序列(行)

                      3. 返回单个标量对象也可以使用,如 . transform(sum)

 

apply函数:

                    1. 不同于transform只允许在Series上进行一次转换, apply对整个DataFrame 作用

                     2.apply隐式地将group 上所有的列作为自定义函数

 栗子:

#coding=gbk import numpy as np import pandas as pd data = pd.DataFrame({'state':['Florida','Florida','Texas','Texas'], 'a':[4,5,1,3], 'b':[6,10,3,11] }) print(data) # a b state # 0 4 6 Florida # 1 5 10 Florida # 2 1 3 Texas # 3 3 11 Texas def sub_two(X): return X['a'] - X['b'] data1 = data.groupby(data['state']).apply(sub_two) # 此处使用transform 则会出现错误 print(data1) # state # Florida 0 -2 # 1 -5 # Texas 2 -2 # 3 -8 # dtype: int64

返回单个标量可以使用transform:

:我们可以看到使用transform 和apply 的输出结果形式是不一样的,transform返回与数据同样长度的行,而apply则进行了聚合

此时,使用apply说明的信息更明确

def group_sum(x): return x.sum() data3 = data.groupby(data['state']).transform(group_sum) #返回与数据一样的 行 print(data3) # a b # 0 9 16 # 1 9 16 # 2 4 14 # 3 4 14 #但是使用apply时 data4 = data.groupby(data['state']).apply(group_sum) print(data4) # a b state # state # Florida 9 16 FloridaFlorida # Texas 4 14 TexasTexas

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

栗子2:

np.random.seed(666) df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8)}) print(df) # A B C D # 0 foo one 0.824188 0.640573 # 1 bar one 0.479966 -0.786443 # 2 foo two 1.173468 0.608870 # 3 bar three 0.909048 -0.931012 # 4 foo two -0.571721 0.978222 # 5 bar two -0.109497 -0.736918 # 6 foo one 0.019028 -0.298733 # 7 foo three -0.943761 -0.460587 def zscore(x): return (x - x.mean())/ x.var() print(df.groupby('A').transform(zscore)) #自动识别CD列 print(df.groupby('A')['C','D'].apply(zscore)) #此种形式则两种输出数据是一样的 # df.groupby('A').apply(zscore) 此种情况则会报错,apply对整个dataframe作用 df['sum_c'] = df.groupby('A')['C'].transform(sum) #先对A列进行分组, 计算C列的和 df = df.sort_values('A') print(df) # A B C D sum_c # 1 bar one 0.479966 -0.786443 1.279517 # 3 bar three 0.909048 -0.931012 1.279517 # 5 bar two -0.109497 -0.736918 1.279517 # 0 foo one 0.824188 0.640573 0.501202 # 2 foo two 1.173468 0.608870 0.501202 # 4 foo two -0.571721 0.978222 0.501202 # 6 foo one 0.019028 -0.298733 0.501202 # 7 foo three -0.943761 -0.460587 0.501202 print(df.groupby('A')['C'].apply(sum)) # A # bar 1.279517 # foo 0.501202 # Name: C, dtype: float64 

The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. 

函数传递给transform必须返回一个数字,一行,或者与参数相同的形状。 如果是一个数字,那么数字将被设置为组中的所有元素,如果是一行,它将会被广播到组中的所有行。


 

参考:https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object#


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM