對數據集進行分組並對各分組應用函數是數據分析中的重要環節。
group by技術
pandas對象中的數據會根據你所提供的一個或多個鍵被拆分為多組,拆分操作是在對象的特定軸上執行的,然后將一個函數應用到各個分組並產生一個新值,最后所有這些函數的執行結果會被合並到最終的結果對象中。
>>> from pandas import * >>> df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)}) >>> df data1 data2 key1 key2 0 -1.413818 -0.865514 a one 1 -1.001804 0.309597 a two 2 0.357458 -0.387695 b one 3 0.674294 -0.977009 b two 4 -0.090150 2.444888 a one >>> grouped=df['data1'].groupby(df['key1']) >>> grouped <pandas.core.groupby.SeriesGroupBy object at 0x04005770>
#生成一個groupby對象,實際上還未進行任何計算,可對其調用方法進行計算 >>> grouped.mean() key1 a -0.835257 b 0.515876 Name: data1, dtype: float64
#此外,可將列名直接當作分組對象,分組中,數值列會被聚合,非數值列會從結果中排除 >>> df.groupby('key1').mean() data1 data2 key1 a -0.835257 0.629657 b 0.515876 -0.682352 >>> df.groupby(['key1','key2']).mean() data1 data2 key1 key2 a one -0.751984 0.789687 two -1.001804 0.309597 b one 0.357458 -0.387695 two 0.674294 -0.977009
無論你准准備拿groupby做什么,都可能會使用groupby的size方法,可以返回一個含有分組大小的series;
>>> df.groupby(['key1','key2']).size() key1 key2 a one 2 two 1 b one 1 two 1 dtype: int64
1、對分組進行迭代
groupby對象支持迭代,可以產生一組二元數組(由分組名稱和數據塊構成)
>>> for name,group in df.groupby('key1'): print name print group a data1 data2 key1 key2 0 -1.413818 -0.865514 a one 1 -1.001804 0.309597 a two 4 -0.090150 2.444888 a one b data1 data2 key1 key2 2 0.357458 -0.387695 b one 3 0.674294 -0.977009 b two
對於多重鍵的情況,元祖的第一個元素將會是由鍵值組成的元組
>>> for (k1,k2),group in df.groupby(['key1','key2']): print k1,k2 print group a one data1 data2 key1 key2 0 -1.413818 -0.865514 a one 4 -0.090150 2.444888 a one a two data1 data2 key1 key2 1 -1.001804 0.309597 a two b one data1 data2 key1 key2 2 0.357458 -0.387695 b one b two data1 data2 key1 key2 3 0.674294 -0.977009 b two
groupby分組默認是在axis=0上進行分組的,通過設置也可以在其他軸上進行分組
>>> df.dtypes data1 float64 data2 float64 key1 object key2 object dtype: object >>> grouped=df.groupby(df.dtypes,axis=1) >>> dict(list(grouped)) {dtype('O'): key1 key2 0 a one 1 a two 2 b one 3 b two 4 a one, dtype('float64'): data1 data2 0 -1.413818 -0.865514 1 -1.001804 0.309597 2 0.357458 -0.387695 3 0.674294 -0.977009 4 -0.090150 2.444888}
2、選取一個或一組列
對於DataFrame產生的groupby對象,如果用一個或一組列名對其進行索引,就能實現選取部分列進行聚合的目的
>>> df data1 data2 key1 key2 0 -1.413818 -0.865514 a one 1 -1.001804 0.309597 a two 2 0.357458 -0.387695 b one 3 0.674294 -0.977009 b two 4 -0.090150 2.444888 a one >>> df.groupby('key1')['data1'] <pandas.core.groupby.SeriesGroupBy object at 0x04005FB0>
>>> df.groupby('key1')['data1'].mean()
key1
a -0.835257
b 0.515876
尤其對於大數據集,可能只需要對部分列進行聚合
>>> df.groupby(['key1','key2'])[['data2']].mean() #注意data2的形式,如果傳入的是標量名稱則不同 data2 key1 key2 a one 0.789687 two 0.309597 b one -0.387695 two -0.977009 >>> df.groupby(['key1','key2'])['data2'].mean() key1 key2 a one 0.789687 two 0.309597 b one -0.387695 two -0.977009 Name: data2, dtype: float64
3、通過字典或Series進行分組
除數組以外,分組信息還可以以其他形式存在
>>> people=DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['joe','steve','wes','jim','travis']) >>> people a b c d e joe -1.136829 -0.549897 1.382399 -1.457968 -1.975316 steve 0.633057 0.905028 0.615449 -1.307026 -0.150066 wes 0.715308 -1.546033 1.090450 -0.699447 0.308514 jim 0.127834 0.134140 0.218690 0.298301 0.722678 travis 1.561881 0.283804 0.017650 1.231204 -1.732033 >>> people.ix[2:3,['b','c']]=np.nan >>> people a b c d e joe -1.136829 -0.549897 1.382399 -1.457968 -1.975316 steve 0.633057 0.905028 0.615449 -1.307026 -0.150066 wes 0.715308 NaN NaN -0.699447 0.308514 jim 0.127834 0.134140 0.218690 0.298301 0.722678 travis 1.561881 0.283804 0.017650 1.231204 -1.732033 >>> mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red'} >>> by_column=people.groupby(mapping,axis=1) >>> by_column.sum() blue red joe -0.075569 -3.662042 steve -0.691577 1.388018 wes -0.699447 1.023822 jim 0.516991 0.984652 travis 1.248854 0.113652
Series也有這樣的功能,它可以被看作一個固定大小的映射
>>> map_series=Series(mapping) >>> map_series a red b red c blue d blue e red dtype: object >>> people.groupby(map_series,axis=1).sum() blue red joe -0.075569 -3.662042 steve -0.691577 1.388018 wes -0.699447 1.023822 jim 0.516991 0.984652 travis 1.248854 0.113652
4、通過函數進行分組
相較於字典或者Series,python函數在定義分組映射關系時可以更具創意和抽象,任何被當作分組鍵的函數都會在索引值上被調用一次,其返回值被當作分組名稱
#根據人名長度進行分組 >>> people.groupby(len).sum() a b c d e 3 -0.293687 -0.415757 1.601089 -1.859114 -0.944124 5 0.633057 0.905028 0.615449 -1.307026 -0.150066 6 1.561881 0.283804 0.017650 1.231204 -1.732033
將函數,列表,字典混用也沒問題,因為任何東西最終會被轉換為數組
>>> keyliat=['one','one','one','two','two'] >>> people.groupby([len,keyliat]).min() a b c d e 3 one -1.136829 -0.549897 1.382399 -1.457968 -1.975316 two 0.127834 0.134140 0.218690 0.298301 0.722678 5 one 0.633057 0.905028 0.615449 -1.307026 -0.150066 6 two 1.561881 0.283804 0.017650 1.231204 -1.732033
5、根據索引級別分組
層次化索引的數據集最方便的地方在於它能夠根據索引級別進行聚合,實現該目的,通過level關鍵字傳入級別編號或名稱即可。
>>> import numpy as np >>> hief_df=DataFrame(np.random.randn(4,5),columns=columns) >>> hief_df cty us jp tennor 1 3 5 1 3 0 -0.185892 -0.517436 -0.040285 1.274849 0.015439 1 -1.757972 -0.650451 0.863938 0.467745 -0.288524 2 1.512232 -0.494746 -0.119517 1.047349 -0.627444 3 -0.656453 0.858041 1.218276 1.138983 0.997657 >>> hief_df.groupby(level='cty',axis=1).count() cty jp us 0 2 3 1 2 3 2 2 3 3 2 3
數據聚合
對於聚合,一般指的是能夠從數組產生的標量值的數據轉換過程,常見的聚合運算都有相關的統計函數快速實現,當然也可以自定義聚合運算
要使用自己的定義的聚合函數,需將其傳入aggregate或agg方法即可
>>> df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)}) >>> df data1 data2 key1 key2 0 -1.299938 -1.269616 a one 1 -0.279184 -0.037004 a two 2 -0.851559 -0.527337 b one 3 1.140124 0.882907 b two 4 0.406030 -0.365484 a one >>> grouped=df.groupby('key1') >>> def peak_to_peak(arr): return arr.max()-arr.min() >>> grouped.agg(peak_to_peak) data1 data2 key1 a 1.705968 1.232612 b 1.991683 1.410243
describe方法也可使用,但嚴格來說這些並非聚合運算
>>> grouped.describe() data1 data2 key1 a count 3.000000 3.000000 mean -0.391031 -0.557368 std 0.858466 0.638316 min -1.299938 -1.269616 25% -0.789561 -0.817550 50% -0.279184 -0.365484 75% 0.063423 -0.201244 max 0.406030 -0.037004 b count 2.000000 2.000000 mean 0.144282 0.177785 std 1.408332 0.997193 min -0.851559 -0.527337 25% -0.353638 -0.174776 50% 0.144282 0.177785 75% 0.642203 0.530346 max 1.140124 0.882907
1、面向列的多函數應用
前面已經看到對Series或DataFrame列的聚合運算其實就是使用aggregate調用自定義函數或者直接調用諸如mean,std之類的方法;
但是當你希望對不同列使用不同的聚合函數時看如下事例:
>>> tips['tip_pct']=tips['tip']/tips['total_bill'] >>> tips[:6] total_bill tip sex smoker day time size tip_pct 0 16.99 1.01 Female No Sun Dinner 2 0.059447 1 10.34 1.66 Male No Sun Dinner 3 0.160542 2 21.01 3.50 Male No Sun Dinner 3 0.166587 3 23.68 3.31 Male No Sun Dinner 2 0.139780 4 24.59 3.61 Female No Sun Dinner 4 0.146808 5 25.29 4.71 Male No Sun Dinner 4 0.186240 >>> grouped=tips.groupby(['sex','smoker']) >>> grouped_pct=grouped['tip_pct'] #可以將函數名以字符串的形式傳入 >>> grouped_pct.agg('mean') sex smoker Female No 0.156921 Yes 0.182150 Male No 0.160669 Yes 0.152771 Name: tip_pct, dtype: float64
如果傳入一組函數或者函數名,則得到的DataFrame列就會以相應的函數命名,實際操作中並不一定需要接受默認的函數名,可以傳入一個由(name,function)元組組成的列表當作一個有序映射。
>>> grouped_pct.agg(['mean','std']) mean std sex smoker Female No 0.156921 0.036421 Yes 0.182150 0.071595 Male No 0.160669 0.041849 Yes 0.152771 0.090588
>>> grouped_pct.agg([('foo','mean'),('bar',np.std)]) foo bar sex smoker Female No 0.156921 0.036421 Yes 0.182150 0.071595 Male No 0.160669 0.041849 Yes 0.152771 0.090588
對於DataFrame,還可以定義一組應用於全部列的函數,或不同的列應用不同的函數,這樣會產生層次化索引的DataFrame
>>> functions=['count','mean','max'] >>> result=grouped['tip_pct','total_bill'].agg(functions) >>> result tip_pct total_bill count mean max count mean max sex smoker Female No 54 0.156921 0.252672 54 18.105185 35.83 Yes 33 0.182150 0.416667 33 17.977879 44.30 Male No 97 0.160669 0.291990 97 19.791237 48.33 Yes 60 0.152771 0.710345 60 22.284500 50.81
現在假設想要對不同的列應用不同的函數,具體的辦法就是向agg傳入一個從列名映射到函數的字典
>>> grouped.agg({'tip':np.max,'size':'sum'}) tip size sex smoker Female No 5.2 140 Yes 6.5 74 Male No 9.0 263 Yes 10.0 150 >>> grouped.agg({'tip_pct':['min','max','mean'],'size':'sum'}) tip_pct size min max mean sum sex smoker Female No 0.056797 0.252672 0.156921 140 Yes 0.056433 0.416667 0.182150 74 Male No 0.071804 0.291990 0.160669 263 Yes 0.035638 0.710345 0.152771 150
2、以無索引的形式返回聚合數據
一般情況下,聚合數據都需要唯一的分組鍵組成的索引,但也可以通過向groupby傳入as_index=False以禁用該功能
>>> tips.groupby(['sex','smoker'],as_index=False).mean() sex smoker total_bill tip size tip_pct 0 Female No 18.105185 2.773519 2.592593 0.156921 1 Female Yes 17.977879 2.931515 2.242424 0.182150 2 Male No 19.791237 3.113402 2.711340 0.160669 3 Male Yes 22.284500 3.051167 2.500000 0.152771
分組運算和轉換
聚合僅是分組運算的一種,它是數據轉換的一個特例,本節介紹transform和apply方法,他們能夠執行更多其他的分組運算
以下是為一個DataFrame添加一個用於存放各索引組平均值的列,利用了先聚合再合並
>>> df data1 data2 key1 key2 0 -1.359405 -0.567306 a one 1 -0.298647 -1.078614 a two 2 0.355256 0.693866 b one 3 -1.452335 -0.666225 b two 4 1.036177 1.811104 a one >>> k1_means=df.groupby('key1').mean() >>> k2_means=df.groupby('key1').mean().add_prefix('mean_') >>> k1_means data1 data2 key1 a -0.207292 0.055061 b -0.548539 0.013821 >>> k2_means mean_data1 mean_data2 key1 a -0.207292 0.055061 b -0.548539 0.013821 >>> merge(df,k2_means,left_on='key1',right_index=True) data1 data2 key1 key2 mean_data1 mean_data2 0 -1.359405 -0.567306 a one -0.207292 0.055061 1 -0.298647 -1.078614 a two -0.207292 0.055061 4 1.036177 1.811104 a one -0.207292 0.055061 2 0.355256 0.693866 b one -0.548539 0.013821 3 -1.452335 -0.666225 b two -0.548539 0.013821
實際上可以對DataFrame進行transform方法,對比一下下面兩種的區別,transform會將一個函數應用到各個分組
>>> df.groupby('key2').transform(np.mean) data1 data2 0 0.010676 0.645888 1 -0.875491 -0.872420 2 0.010676 0.645888 3 -0.875491 -0.872420 4 0.010676 0.645888 >>> df.groupby('key2').mean() data1 data2 key2 one 0.010676 0.645888 two -0.875491 -0.872420
1、apply,一般性的拆分-應用-合並
最一般的groupby方法是apply,apply會將待處理的對象拆分為多個片段,然后對各個片段調用傳入的函數,最后嘗試將各片段組合在一起,
在groupby中,當你調用諸如describe之類的方法時,實際上是應用了快捷方式:f=lambda x:x.describe();grouped.apply(f)
2、分位數和桶分析
pandas有一些能根據指定面元或樣本分位數將數據拆分為多塊的工具(比如cut和qcut),將這些數據跟groupby結合起來,就能輕松的對數據集的桶或分位數分析
>>>frame=DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)}) >>> factor=cut(frame.data1,4) >>> factor[:10] 0 (-1.35, 0.107] 1 (0.107, 1.563] 2 (-1.35, 0.107] 3 (-2.812, -1.35] 4 (0.107, 1.563] 5 (0.107, 1.563] 6 (-1.35, 0.107] 7 (-1.35, 0.107] 8 (-1.35, 0.107] 9 (1.563, 3.02] Name: data1, dtype: category Categories (4, object): [(-2.812, -1.35] < (-1.35, 0.107] < (0.107, 1.563] < (1.563, 3.02]]
cut返回的factor對象可直接用於groupby,分為長度相等的桶;
>>> def get_stats(group): return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()} >>> grouped=frame.data2.groupby(factor) >>> grouped.apply(get_stats).unstack() count max mean min data1 (-2.812, -1.35] 79 2.791474 0.023155 -2.577103 (-1.35, 0.107] 433 2.942033 0.066771 -2.812077 (0.107, 1.563] 437 2.391669 0.022582 -2.654376 (1.563, 3.02] 51 2.652038 0.406708 -2.387372
若要得到大小相等的桶,使用qcut即可
>>> grouping=qcut(frame.data1,10,labels=False) >>> grouped=frame.data2.groupby(grouping) >>> grouped.apply(get_stats).unstack() count max mean min 0 100 2.791474 0.025400 -2.577103 1 100 2.536797 -0.094773 -2.046163 2 100 2.942033 0.243372 -1.671060 3 100 2.566991 0.059096 -2.252417 4 100 2.589560 0.053143 -2.812077 5 100 1.743871 -0.041336 -2.448941 6 100 2.295631 0.157645 -2.264740 7 100 2.391669 -0.012642 -2.076873 8 100 2.164782 0.026390 -2.654376 9 100 2.652038 0.197221 -2.387372
3、用特定分組的值填充缺失值
對於缺失數據的清理工作,有時你會用dropna將其刪除,有時可能會希望用一個固定值或由數據集本事衍生出來的值去填充na值,這時應該使用fillna工具
>>> from pandas import * >>> s=Series(np.random.randn(6)) >>> s[::2]=np.nan >>> s 0 NaN 1 0.730366 2 NaN 3 1.072793 4 NaN 5 -0.720886 dtype: float64 >>> s.fillna(s.mean()) 0 0.360758 1 0.730366 2 0.360758 3 1.072793 4 0.360758 5 -0.720886 dtype: float64
假設需要對不同的分組填充不同的值,只需將數據分組,並使用apply和一個能夠對各數據塊調用的fillna的函數即可
>>> state=['ohio','new york','vermont','florida','oregen','nevada','california','idaho'] >>> group_key=['east']*4+['west']*4 >>> group_key ['east', 'east', 'east', 'east', 'west', 'west', 'west', 'west'] >>> data=Series(np.random.randn(8),index=state) >>> data[['vermont','nevada','idaho']]=np.nan >>> data ohio -1.032728 new york -1.162002 vermont NaN florida -0.571487 oregen -0.997641 nevada NaN california 1.149481 idaho NaN dtype: float64 >>> data.groupby(group_key).mean() east -0.922072 west 0.075920 dtype: float64 #利用分組平均去填充na值 >>> fill_mean=lambda g:g.fillna(g.mean()) >>> data.groupby(group_key).apply(fill_mean) ohio -1.032728 new york -1.162002 vermont -0.922072 florida -0.571487 oregen -0.997641 nevada 0.075920 california 1.149481 idaho 0.075920 dtype: float64
4、分組加權平均數和相關系數
根據 拆分-應用-合並 范式,DataFrame的列與列之間或兩個Series之間的運算成為一種標准運算
>>> df=DataFrame({'category':['a','a','a','a','b','b','b','b'],'data':np.random.randn(8),'weights':np.random.rand(8)}) >>> df category data weights 0 a -1.196080 0.247188 1 a -1.695342 0.914525 2 a 1.521977 0.483654 3 a 0.814892 0.267910 4 b -0.507479 0.204920 5 b -0.696985 0.097827 6 b -0.748492 0.105464 7 b 0.837663 0.404254 >>> grouped=df.groupby('category') >>> get_wavg=lambda g:np.average(g['data'],weights=g['weights']) >>> grouped.apply(get_wavg) category a -0.466038 b 0.107713 dtype: float64
5、面向分組的線性回歸
你可以用groupby執行分組更為復雜的分組統計分析,只要函數返回的是pandas對象或者標量值即可。
透視表和交叉表
在pandas中,可以通過groupby功能以及重塑運算制作透視表,DataFrame還有一個pivot_table方法,此外還有一個頂級的pandas.pivot_table函數。
>>> tips.pivot_table(index=['sex','smoker']) size tip total_bill sex smoker Female No 2.592593 2.773519 18.105185 Yes 2.242424 2.931515 17.977879 Male No 2.711340 3.113402 19.791237 Yes 2.500000 3.051167 22.284500 >>> tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker') size smoker No Yes sex day Female Fri 2.500000 2.000000 Sat 2.307692 2.200000 Sun 3.071429 2.500000 Thur 2.480000 2.428571 Male Fri 2.000000 2.125000 Sat 2.656250 2.629630 Sun 2.883721 2.600000 Thur 2.500000 2.300000
要使用其他的聚合函數,可將函數傳入aggfunc參數即可
>>> tips.pivot_table('size',index=['sex','smoker'],columns='day',aggfunc=len) day Fri Sat Sun Thur sex smoker Female No 2 13 14 25 Yes 7 15 4 7 Male No 2 32 43 20 Yes 8 27 15 10
交叉表是一種用於計算分組頻率的特殊透視表
>>> pd.crosstab([tips.time,tips.day],tips.smoker,margins=True) #指定行與列交叉統計,margins參數用於是否進行分項小計 smoker No Yes All time day Dinner Fri 3 9 12 Sat 45 42 87 Sun 57 19 76 Thur 1 0 1 Lunch Fri 1 6 7 Thur 44 17 61 All 151 93 244