Pandas入門學習筆記4


5 層次化索引

層次化索引是pandas的重要功能。以低維度的形式處理高維度數據。

In [185]: data = Series(np.random.randn(10),index=[list('aaabbbccdd'),[1,2,3,1,2,3,2,3,2,3]])

In [186]: data
Out[186]:
a  1    0.458553
   2    0.077532
   3   -1.561180
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
d  2    1.131352
   3   -1.292136
dtype: float64

獲取索引。

In [187]: data.index
Out[187]:
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 1, 2, 1, 2]])

In [188]: data['b']
Out[188]:
1    2.498391
2    0.243617
3   -0.818542
dtype: float64

In [189]: data['b':'c']
Out[189]:
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
dtype: float64

In [190]: data[:,2]  # 獲取內層索引
Out[190]:
a    0.077532
b    0.243617
c   -1.222213
d    1.131352
dtype: float64


In [191]: data.unstack()  # unstack來重新安排到dataframe中。
Out[191]:
          1         2         3
a  0.458553  0.077532 -1.561180
b  2.498391  0.243617 -0.818542
c       NaN -1.222213 -0.797079
d       NaN  1.131352 -1.292136

In [192]: data.unstack().stack()  # 逆運算--stack
Out[192]:
a  1    0.458553
   2    0.077532
   3   -1.561180
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
d  2    1.131352
   3   -1.292136
dtype: float64

DataFrame每條軸都可以分層索引。

5.1 重排分級順序

可以重排調整某條軸上的索引順序,swaplevel可以互換兩個索引值,並范圍一個新的對象。

In [193]: frame = DataFrame(np.random.randn(4,3),index=[list('aabb'),[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
     ...:

In [194]: frame
Out[194]:
         Ohio            Colorado
        Green       Red     Green
a 1  0.368997  0.670430  1.056365
  2 -0.352259 -0.656101  0.018544
b 1 -0.574535 -0.531988  0.295466
  2 -0.973587  0.225511 -0.250887

In [198]: frame.index.names = ['key1','key2']

In [199]: frame.columns.names = ['state','color']

In [200]: frame
Out[200]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
     2    -0.352259 -0.656101  0.018544
b    1    -0.574535 -0.531988  0.295466
     2    -0.973587  0.225511 -0.250887

In [201]: frame.swaplevel('key1','key2')
Out[201]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
2    a    -0.352259 -0.656101  0.018544
1    b    -0.574535 -0.531988  0.295466
2    b    -0.973587  0.225511 -0.250887

In [202]: frame.sortlevel(1)
Out[202]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
b    1    -0.574535 -0.531988  0.295466
a    2    -0.352259 -0.656101  0.018544
b    2    -0.973587  0.225511 -0.250887

In [203]: frame.swaplevel(0,1)
Out[203]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
2    a    -0.352259 -0.656101  0.018544
1    b    -0.574535 -0.531988  0.295466
2    b    -0.973587  0.225511 -0.250887

In [204]: frame.swaplevel(0,1).sortlevel(0)
Out[204]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
     b    -0.574535 -0.531988  0.295466
2    a    -0.352259 -0.656101  0.018544
     b    -0.973587  0.225511 -0.250887

5.2 根據級別匯總統計

許多DataFrame和Series匯總和統計方法都有level選項,指定在某個軸。


In [205]: frame
Out[205]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
     2    -0.352259 -0.656101  0.018544
b    1    -0.574535 -0.531988  0.295466
     2    -0.973587  0.225511 -0.250887


In [207]: frame.sum(level='key2')
Out[207]:
state      Ohio            Colorado
color     Green       Red     Green
key2                               
1     -0.205538  0.138443  1.351831
2     -1.325846 -0.430590 -0.232343

In [209]: frame.sum(level='color',axis=1)
Out[209]:
color         Green       Red
key1 key2                    
a    1     1.425362  0.670430
     2    -0.333715 -0.656101
b    1    -0.279069 -0.531988
     2    -1.224474  0.225511

5.3 使用DataFrame的列

經常需要用DataFrame的列作為索引,或者希望將索引變成DataFrame的列。

In [210]: df = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one']*7,'d':[0,1,2,0,1,2,3]})

In [211]: df
Out[211]:
   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  one  0
4  4  3  one  1
5  5  2  one  2
6  6  1  one  3

In [212]: df2 = df.set_index(['c','d']) #默認情況下,會將轉換的這兩列刪除掉;

In [213]: df2
Out[213]:
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
    0  3  4
    1  4  3
    2  5  2
    3  6  1

    In [215]: df2 = df.set_index(['c','d'],drop=False) # 仍然保留這兩列

    In [216]: df2
    Out[216]:
           a  b    c  d
    c   d              
    one 0  0  7  one  0
        1  1  6  one  1
        2  2  5  one  2
        0  3  4  one  0
        1  4  3  one  1
        2  5  2  one  2
        3  6  1  one  3

用reset_index可以將索引合並到DataFrame中。

In [217]: df2 = df.set_index(['c','d'])

In [218]: df2
Out[218]:
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
    0  3  4
    1  4  3
    2  5  2
    3  6  1

In [219]: df2.reset_index()
Out[219]:
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  one  0  3  4
4  one  1  4  3
5  one  2  5  2
6  one  3  6  1

6 其他

6.1 整數索引

先看一個例子:我們很難判斷是要通過位置還是通過標簽的索引來獲取數據。

In [220]: ser = Series(np.arange(3))

In [221]: ser
Out[221]:
0    0
1    1
2    2
dtype: int64

In [222]: ser[-1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...

這樣對於使用字母索引的Series就不存在這個問題。

如果需要可靠的、不考慮索引類型的、基於位置的索引,可以使用:

  • Series:iget_value
  • DataFrame:irow和icol

新的版本有些變化:都是用iloc來通過位置准確獲取。

In [231]: ser3 = Series(np.arange(3),index=[-5,1,3])

In [232]: ser3.iget_value(2)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: iget_value(i) is deprecated. Please use .iloc[i] or .iat[i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[232]: 2

In [236]: ser3.iloc[2]
Out[236]: 2

In [237]: ser3.iat[2]
Out[237]: 2

In [239]: frame = DataFrame(np.arange(6).reshape(3,2),index=[2,0,1])

In [241]: frame
Out[241]:
   0  1
2  0  1
0  2  3
1  4  5

In [242]: frame.irow(1)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[242]:
0    2
1    3
Name: 0, dtype: int64

In [243]: frame.icol(1)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: icol(i) is deprecated. Please use .iloc[:,i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[243]:
2    1
0    3
1    5
Name: 1, dtype: int64

In [245]: frame.iloc[1]  # 按行位置獲取
Out[245]:
0    2
1    3
Name: 0, dtype: int64

In [246]: frame.iloc[:,1]  #按列位置獲取
Out[246]:
2    1
0    3
1    5
Name: 1, dtype: int64

6.2 面板數據

Panel數據結構,可以看成是一個三維的DataFrame數據結構。
Panel中的每一項都是一個DataFrame。
同樣使用堆積式(層次化索引的)的DataFrame可以表示一個panel。

In [247]: import pandas.io.data as web
/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py:35: FutureWarning:
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
  FutureWarning)

In [248]: web
Out[248]: <module 'pandas.io.data' from '/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py'>

In [249]: pdata = pd.Panel(dict((stk ,web.get_data_yahoo(stk,'1/1/2009','6/1/2012')) for stk in ['AAPL','GOOG','MSFT','DELL']))

In [250]: pdata
Out[250]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 868 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: Open to Adj Close

In [252]: pdata = pdata.swapaxes('items','minor')

In [253]: pdata['Adj Close']
Out[253]:
                 AAPL      DELL        GOOG       MSFT
Date                                                  
2009-01-02  11.808505  10.39902  160.499779  16.501303
...
2012-05-30  75.362333  12.14992  293.821674  25.878448
2012-05-31  75.174961  11.92743  290.140354  25.746145
2012-06-01  72.996726  11.67592  285.205295  25.093451

[868 rows x 4 columns]

In [256]: pdata.ix[:,'6/1/2012',:]  # ix擴展為三維
Out[256]:
            Open        High         Low       Close       Volume   Adj Close
AAPL  569.159996  572.650009  560.520012  560.989983  130246900.0   72.996726
DELL   12.150000   12.300000   12.045000   12.070000   19397600.0   11.675920
GOOG  571.790972  572.650996  568.350996  570.981000    6138700.0  285.205295
MSFT   28.760000   28.959999   28.440001   28.450001   56634300.0   25.093451

In [260]: pdata.ix[:,'5/30/2012':,:].to_frame()
Out[260]:
                        Open        High         Low       Close       Volume  \
Date       minor                                                                
2012-05-30 AAPL   569.199997  579.989990  566.559990  579.169998  132357400.0   
           DELL    12.590000   12.700000   12.460000   12.560000   19787800.0   
           GOOG   588.161028  591.901014  583.530999  588.230992    3827600.0   
           MSFT    29.350000   29.480000   29.120001   29.340000   41585500.0   
2012-05-31 AAPL   580.740021  581.499985  571.460022  577.730019  122918600.0   
           DELL    12.530000   12.540000   12.330000   12.330000   19955600.0   
           GOOG   588.720982  590.001032  579.001013  580.860990    5958800.0   
           MSFT    29.299999   29.420000   28.940001   29.190001   39134000.0   
2012-06-01 AAPL   569.159996  572.650009  560.520012  560.989983  130246900.0   
           DELL    12.150000   12.300000   12.045000   12.070000   19397600.0   
           GOOG   571.790972  572.650996  568.350996  570.981000    6138700.0   
           MSFT    28.760000   28.959999   28.440001   28.450001   56634300.0   

                   Adj Close  
Date       minor              
2012-05-30 AAPL    75.362333  
           DELL    12.149920  
           GOOG   293.821674  
           MSFT    25.878448  
2012-05-31 AAPL    75.174961  
           DELL    11.927430  
           GOOG   290.140354  
           MSFT    25.746145  
2012-06-01 AAPL    72.996726  
           DELL    11.675920  
           GOOG   285.205295  
           MSFT    25.093451  

# 可以轉化為DataFrame
In [261]: stacked = pdata.ix[:,'5/30/2012':,:].to_frame()

In [262]: stacked.to_panel() # 轉化為panel
Out[262]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM