5 層次化索引
層次化索引是pandas的重要功能。以低維度的形式處理高維度數據。
In [185]: data = Series(np.random.randn(10),index=[list('aaabbbccdd'),[1,2,3,1,2,3,2,3,2,3]])
In [186]: data
Out[186]:
a 1 0.458553
2 0.077532
3 -1.561180
b 1 2.498391
2 0.243617
3 -0.818542
c 2 -1.222213
3 -0.797079
d 2 1.131352
3 -1.292136
dtype: float64
獲取索引。
In [187]: data.index
Out[187]:
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 1, 2, 1, 2]])
In [188]: data['b']
Out[188]:
1 2.498391
2 0.243617
3 -0.818542
dtype: float64
In [189]: data['b':'c']
Out[189]:
b 1 2.498391
2 0.243617
3 -0.818542
c 2 -1.222213
3 -0.797079
dtype: float64
In [190]: data[:,2] # 獲取內層索引
Out[190]:
a 0.077532
b 0.243617
c -1.222213
d 1.131352
dtype: float64
In [191]: data.unstack() # unstack來重新安排到dataframe中。
Out[191]:
1 2 3
a 0.458553 0.077532 -1.561180
b 2.498391 0.243617 -0.818542
c NaN -1.222213 -0.797079
d NaN 1.131352 -1.292136
In [192]: data.unstack().stack() # 逆運算--stack
Out[192]:
a 1 0.458553
2 0.077532
3 -1.561180
b 1 2.498391
2 0.243617
3 -0.818542
c 2 -1.222213
3 -0.797079
d 2 1.131352
3 -1.292136
dtype: float64
DataFrame每條軸都可以分層索引。
5.1 重排分級順序
可以重排調整某條軸上的索引順序,swaplevel可以互換兩個索引值,並范圍一個新的對象。
In [193]: frame = DataFrame(np.random.randn(4,3),index=[list('aabb'),[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
...:
In [194]: frame
Out[194]:
Ohio Colorado
Green Red Green
a 1 0.368997 0.670430 1.056365
2 -0.352259 -0.656101 0.018544
b 1 -0.574535 -0.531988 0.295466
2 -0.973587 0.225511 -0.250887
In [198]: frame.index.names = ['key1','key2']
In [199]: frame.columns.names = ['state','color']
In [200]: frame
Out[200]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0.368997 0.670430 1.056365
2 -0.352259 -0.656101 0.018544
b 1 -0.574535 -0.531988 0.295466
2 -0.973587 0.225511 -0.250887
In [201]: frame.swaplevel('key1','key2')
Out[201]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0.368997 0.670430 1.056365
2 a -0.352259 -0.656101 0.018544
1 b -0.574535 -0.531988 0.295466
2 b -0.973587 0.225511 -0.250887
In [202]: frame.sortlevel(1)
Out[202]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0.368997 0.670430 1.056365
b 1 -0.574535 -0.531988 0.295466
a 2 -0.352259 -0.656101 0.018544
b 2 -0.973587 0.225511 -0.250887
In [203]: frame.swaplevel(0,1)
Out[203]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0.368997 0.670430 1.056365
2 a -0.352259 -0.656101 0.018544
1 b -0.574535 -0.531988 0.295466
2 b -0.973587 0.225511 -0.250887
In [204]: frame.swaplevel(0,1).sortlevel(0)
Out[204]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0.368997 0.670430 1.056365
b -0.574535 -0.531988 0.295466
2 a -0.352259 -0.656101 0.018544
b -0.973587 0.225511 -0.250887
5.2 根據級別匯總統計
許多DataFrame和Series匯總和統計方法都有level選項,指定在某個軸。
In [205]: frame
Out[205]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0.368997 0.670430 1.056365
2 -0.352259 -0.656101 0.018544
b 1 -0.574535 -0.531988 0.295466
2 -0.973587 0.225511 -0.250887
In [207]: frame.sum(level='key2')
Out[207]:
state Ohio Colorado
color Green Red Green
key2
1 -0.205538 0.138443 1.351831
2 -1.325846 -0.430590 -0.232343
In [209]: frame.sum(level='color',axis=1)
Out[209]:
color Green Red
key1 key2
a 1 1.425362 0.670430
2 -0.333715 -0.656101
b 1 -0.279069 -0.531988
2 -1.224474 0.225511
5.3 使用DataFrame的列
經常需要用DataFrame的列作為索引,或者希望將索引變成DataFrame的列。
In [210]: df = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one']*7,'d':[0,1,2,0,1,2,3]})
In [211]: df
Out[211]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 one 0
4 4 3 one 1
5 5 2 one 2
6 6 1 one 3
In [212]: df2 = df.set_index(['c','d']) #默認情況下,會將轉換的這兩列刪除掉;
In [213]: df2
Out[213]:
a b
c d
one 0 0 7
1 1 6
2 2 5
0 3 4
1 4 3
2 5 2
3 6 1
In [215]: df2 = df.set_index(['c','d'],drop=False) # 仍然保留這兩列
In [216]: df2
Out[216]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
0 3 4 one 0
1 4 3 one 1
2 5 2 one 2
3 6 1 one 3
用reset_index可以將索引合並到DataFrame中。
In [217]: df2 = df.set_index(['c','d'])
In [218]: df2
Out[218]:
a b
c d
one 0 0 7
1 1 6
2 2 5
0 3 4
1 4 3
2 5 2
3 6 1
In [219]: df2.reset_index()
Out[219]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 one 0 3 4
4 one 1 4 3
5 one 2 5 2
6 one 3 6 1
6 其他
6.1 整數索引
先看一個例子:我們很難判斷是要通過位置還是通過標簽的索引來獲取數據。
In [220]: ser = Series(np.arange(3))
In [221]: ser
Out[221]:
0 0
1 1
2 2
dtype: int64
In [222]: ser[-1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
...
這樣對於使用字母索引的Series就不存在這個問題。
如果需要可靠的、不考慮索引類型的、基於位置的索引,可以使用:
- Series:iget_value
- DataFrame:irow和icol
新的版本有些變化:都是用iloc來通過位置准確獲取。
In [231]: ser3 = Series(np.arange(3),index=[-5,1,3])
In [232]: ser3.iget_value(2)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: iget_value(i) is deprecated. Please use .iloc[i] or .iat[i]
#!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[232]: 2
In [236]: ser3.iloc[2]
Out[236]: 2
In [237]: ser3.iat[2]
Out[237]: 2
In [239]: frame = DataFrame(np.arange(6).reshape(3,2),index=[2,0,1])
In [241]: frame
Out[241]:
0 1
2 0 1
0 2 3
1 4 5
In [242]: frame.irow(1)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
#!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[242]:
0 2
1 3
Name: 0, dtype: int64
In [243]: frame.icol(1)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: icol(i) is deprecated. Please use .iloc[:,i]
#!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[243]:
2 1
0 3
1 5
Name: 1, dtype: int64
In [245]: frame.iloc[1] # 按行位置獲取
Out[245]:
0 2
1 3
Name: 0, dtype: int64
In [246]: frame.iloc[:,1] #按列位置獲取
Out[246]:
2 1
0 3
1 5
Name: 1, dtype: int64
6.2 面板數據
Panel數據結構,可以看成是一個三維的DataFrame數據結構。
Panel中的每一項都是一個DataFrame。
同樣使用堆積式(層次化索引的)的DataFrame可以表示一個panel。
In [247]: import pandas.io.data as web
/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py:35: FutureWarning:
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
FutureWarning)
In [248]: web
Out[248]: <module 'pandas.io.data' from '/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py'>
In [249]: pdata = pd.Panel(dict((stk ,web.get_data_yahoo(stk,'1/1/2009','6/1/2012')) for stk in ['AAPL','GOOG','MSFT','DELL']))
In [250]: pdata
Out[250]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 868 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: Open to Adj Close
In [252]: pdata = pdata.swapaxes('items','minor')
In [253]: pdata['Adj Close']
Out[253]:
AAPL DELL GOOG MSFT
Date
2009-01-02 11.808505 10.39902 160.499779 16.501303
...
2012-05-30 75.362333 12.14992 293.821674 25.878448
2012-05-31 75.174961 11.92743 290.140354 25.746145
2012-06-01 72.996726 11.67592 285.205295 25.093451
[868 rows x 4 columns]
In [256]: pdata.ix[:,'6/1/2012',:] # ix擴展為三維
Out[256]:
Open High Low Close Volume Adj Close
AAPL 569.159996 572.650009 560.520012 560.989983 130246900.0 72.996726
DELL 12.150000 12.300000 12.045000 12.070000 19397600.0 11.675920
GOOG 571.790972 572.650996 568.350996 570.981000 6138700.0 285.205295
MSFT 28.760000 28.959999 28.440001 28.450001 56634300.0 25.093451
In [260]: pdata.ix[:,'5/30/2012':,:].to_frame()
Out[260]:
Open High Low Close Volume \
Date minor
2012-05-30 AAPL 569.199997 579.989990 566.559990 579.169998 132357400.0
DELL 12.590000 12.700000 12.460000 12.560000 19787800.0
GOOG 588.161028 591.901014 583.530999 588.230992 3827600.0
MSFT 29.350000 29.480000 29.120001 29.340000 41585500.0
2012-05-31 AAPL 580.740021 581.499985 571.460022 577.730019 122918600.0
DELL 12.530000 12.540000 12.330000 12.330000 19955600.0
GOOG 588.720982 590.001032 579.001013 580.860990 5958800.0
MSFT 29.299999 29.420000 28.940001 29.190001 39134000.0
2012-06-01 AAPL 569.159996 572.650009 560.520012 560.989983 130246900.0
DELL 12.150000 12.300000 12.045000 12.070000 19397600.0
GOOG 571.790972 572.650996 568.350996 570.981000 6138700.0
MSFT 28.760000 28.959999 28.440001 28.450001 56634300.0
Adj Close
Date minor
2012-05-30 AAPL 75.362333
DELL 12.149920
GOOG 293.821674
MSFT 25.878448
2012-05-31 AAPL 75.174961
DELL 11.927430
GOOG 290.140354
MSFT 25.746145
2012-06-01 AAPL 72.996726
DELL 11.675920
GOOG 285.205295
MSFT 25.093451
# 可以轉化為DataFrame
In [261]: stacked = pdata.ix[:,'5/30/2012':,:].to_frame()
In [262]: stacked.to_panel() # 轉化為panel
Out[262]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT