dataframe轉化（二）之 apply(),transform(),agg() 的用法和區別

本文轉載自查看原文 2020-04-30 00:47 3215 python數據分析

用法介紹

transform用法

pandas.Series.transform

Call func on self producing a Series with transformed values.

Produced Series will have same axis length as self.

Parameters
funcfunction, str, list or dict
Function to use for transforming the data. If a function, must either work when passed a Series or when passed to Series.apply.

Accepted combinations are:

function

string function name

list of functions and/or function names, e.g. [np.exp. 'sqrt']

dict of axis labels -> functions, function names or list of such.

axis{0 or ‘index’}
Parameter needed for compatibility with DataFrame.

*args
Positional arguments to pass to func.

**kwargs
Keyword arguments to pass to func.

Returns
Series
A Series that must have the same length as self.

Raises
ValueErrorIf the returned Series has a different length than self.

Series.transform(self, func, axis=0, *args, **kwargs)

agg用法

pandas.Series.agg

Series.agg(self, func, axis=0, *args, **kwargs)[source]
Aggregate using one or more operations over the specified axis.

New in version 0.20.0.

Parameters
funcfunction, str, list or dict
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.

axis{0 or ‘index’}
Parameter needed for compatibility with DataFrame.

*args
Positional arguments to pass to func.

**kwargs
Keyword arguments to pass to func.

Returns
scalar, Series or DataFrame
The return can be:

scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.

Series.agg(self, func, axis=0, *args, **kwargs)

pandas.DataFrame.agg

DataFrame.agg(self, func, axis=0, *args, **kwargs)[source]
Aggregate using one or more operations over the specified axis.

Parameters
funcfunction, str, list or dict
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

function
string function name
list of functions and/or function names, e.g. [np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.

axis{0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

*args
Positional arguments to pass to func.

**kwargs
Keyword arguments to pass to func.

Returns
scalar, Series or DataFrame
The return can be:

scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.

The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to
numpy.mean(arr_2d, axis=0).
agg is an alias for aggregate. Use the alias.

DataFrame.agg(self, func, axis=0, *args, **kwargs)

案例：

df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})
df
   A  B
0  0  1
1  1  2
2  2  3
df.transform(lambda x: x + 1)
   A  B
0  1  2
1  2  3
2  3  4

案例

異同點

apply() 與transform() agg()的異同點：

同：

pandas.core.groupby.GroupBy
pandas.DataFrame
pandas.Series

類的對象都可以調用如上方法

異：

1.apply()里面可以跟自定義的函數，包括簡單的求和函數以及復雜的特征間的差值函數等，但是agg()做不到

2.agg() / transform()方法可以反射調用（str調用）‘sum‘、'max'、'min'、'count‘等方法，形如agg('sum')。apply不能直接使用，而可以用自定義函數+列特征的方法調用。

3.transform() 里面不能跟自定義的特征交互函數，因為transform是真針對每一元素（即每一列特征操作）進行計算

性能比較

分別計算在同樣簡單需求下各組合方法的計算時長

數據源是最近kaggle比賽：

# Correct data types for "sell_prices.csv"
priceDTypes = {"store_id": "category", 
               "item_id": "category", 
               "wm_yr_wk": "int16",
               "sell_price":"float32"}

# Read csv file
prices = pd.read_csv("./sell_prices.csv", 
                     dtype = priceDTypes)

prices.head()

len(prices)

2.1 transform() 方法+自定義函數

prices.groupby(['store_id','item_id'])['sell_price'].transform(lambda x:x.min())
prices.groupby(['store_id','item_id'])['sell_price'].transform(lambda x:x.max())
prices.groupby(['store_id','item_id'])['sell_price'].transform(lambda x:x.sum())
prices.groupby(['store_id','item_id'])['sell_price'].transform(lambda x:x.count())
len(prices.groupby(['store_id','item_id'])['sell_price'].transform(lambda x:x.mean()))

View Code

2.2 transform() 方法+python內置方法

prices.groupby(['store_id','item_id'])['sell_price'].transform('min')
prices.groupby(['store_id','item_id'])['sell_price'].transform('max')
prices.groupby(['store_id','item_id'])['sell_price'].transform('sum')
prices.groupby(['store_id','item_id'])['sell_price'].transform('count')
len(prices.groupby(['store_id','item_id'])['sell_price'].transform('mean'))

View Code

2.3 apply() 方法+自定義函數

prices.groupby(['store_id','item_id'])['sell_price'].apply(lambda x:x.min())
prices.groupby(['store_id','item_id'])['sell_price'].apply(lambda x:x.max())
prices.groupby(['store_id','item_id'])['sell_price'].apply(lambda x:x.sum())
prices.groupby(['store_id','item_id'])['sell_price'].apply(lambda x:x.count())
len(prices.groupby(['store_id','item_id'])['sell_price'].apply(lambda x:x.mean()))

View Code

2.4 agg() 方法+自定義函數

prices.groupby(['store_id','item_id'])['sell_price'].agg(lambda x:x.min())
prices.groupby(['store_id','item_id'])['sell_price'].agg(lambda x:x.max())
prices.groupby(['store_id','item_id'])['sell_price'].agg(lambda x:x.sum())
prices.groupby(['store_id','item_id'])['sell_price'].agg(lambda x:x.count())
len(prices.groupby(['store_id','item_id'])['sell_price'].agg(lambda x:x.mean()))

View Code

2.5 agg() 方法+python內置方法

prices.groupby(['store_id','item_id'])['sell_price'].agg('min')
prices.groupby(['store_id','item_id'])['sell_price'].agg('max')
prices.groupby(['store_id','item_id'])['sell_price'].agg('sum')
prices.groupby(['store_id','item_id'])['sell_price'].agg('count')
len(prices.groupby(['store_id','item_id'])['sell_price'].agg('mean'))

View Code

2.6 結論

agg()+python內置方法的計算速度最快，其次是transform()+python內置方法。而 transform() 方法+自定義函數的組合方法最慢，需要避免使用！

python自帶的stats統計模塊在pandas結構中的計算也非常慢，也需要避免使用！

轉化差異

agg運算groupby的數據完直接賦給原生df數據某字段報錯

apply運算groupby的數據完直接賦給原生df數據某字段報錯

transform運算groupby的數據完直接賦給原生df數據某字段就不會報錯

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 dataframe轉化（一）之python中的apply(),applymap(),map() 的用法和區別 Python學習筆記：groupby+agg+transform+apply DataFrame的apply用法 groupby,agg,cut,merge,apply用法筆記 pandas中agg()函數和apply()函數的區別 pandas中 transform 函數和 apply 函數的區別 .call() 與 .apply() 的用法及區別 call和apply的區別及其用法 .apply()用法和call()的區別 5-Pandas數據分組的函數應用（df.apply()、df.agg()和df.transform()、df.applymap()）