pandas 之 多層索引


In many applications, data may be spread across a number of files or datasets or be arranged in a form that is not easy to analyze. This chapter focuses on tools to help combine, and rearrange data.
(在許多應用中,數據可以分布在多個文件或數據集中,或者以不易分析的形式排列。 本章重點介紹幫助組合和重新排列數據的工具.)

import numpy as np 
import pandas as pd

多層索引

Hierarchical indexing is an important featuer of pandas that enables you to have multiple(two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to to work with higher dimensional data in a lower dimensional form.(通過多層索引的方式去從低維看待高維數據). Let's start with a simple example; create a Series with a list of lists(or arrays) as the index:

data = pd.Series(np.random.randn(9),
                index=['a,a,a,b,b,c,c,d,d'.split(','),
                      [1,2,3,1,3,1,2,2,3]])

data
a  1    0.874880
   2    1.424326
   3   -2.028509
b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
d  2   -1.008064
   3    0.988234
dtype: float64

What you're seeing is a prettified view of a Series with a MultiIndex as its index. The 'gaps' in the index display mean "use the lable directly above":

data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

With a hierarchically indexed object(分層索引對象), so-called partial indexing is possible, enabling you to concisely(便捷地) select subsets of the data.

data['b']  # 1 3
1   -1.081833
3   -0.072116
dtype: float64
data['b':'c']  # 1 3 1 2
b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
dtype: float64
data.loc[['b', 'd']]  # loc 通常按名字取, iloc 按下標取
b  1   -1.081833
   3   -0.072116
d  2   -1.008064
   3    0.988234
dtype: float64
"Selection is even possible from an inner level" 

data.loc[:, 2]
'Selection is even possible from an inner level'






a    1.424326
c   -1.246831
d   -1.008064
dtype: float64

Hierarchical indexing plays an important role in reshapeing data and group-based operations like forming a pivot table. For example, you could rearrange the data into a DataFrame using its unstack method:

data.unstack()
1 2 3
a 0.874880 1.424326 -2.028509
b -1.081833 NaN -0.072116
c 0.575918 -1.246831 NaN
d NaN -1.008064 0.988234

The inverse operation of unstack is stack:

data.unstack().stack()  # 相當於沒變
a  1    0.874880
   2    1.424326
   3   -2.028509
b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
d  2   -1.008064
   3    0.988234
dtype: float64

stack and unstack will be explored more detail later in this chapter.

With a DataFrame, either axis can have a hierarchical index:

frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index=[['a','a','b','b'], [1,2,1,2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                             ['Green', 'Red', 'Green']]
                    )

frame
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

The hierarchical levels can have names(as strings or any Python objects). If so, these will show up in the console output:

frame.index.names = ['key1', 'key2']

frame.columns.names = ['state', 'color']

"可設置行列索引的名字呢"
frame
'可設置行列索引的名字呢'
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

Be careful to distinguish(分辨) the index names 'state' and 'color'

Wiht partial column indexing you can similarly select groups of columns:

(使用部分列索引, 可以相應地使用列組)

frame['Ohio']
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10

A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this.

tmp = pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])
tmp
MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=['state', 'color'])

重排列和Level排序

At times you will need to rearange the order of the levels on an axis or sort the data by the value in one specific level. The swaplevel takes two levle numbers or names and return a new object with the levels interchanged(but the data is otherwise unaltered):

frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.swaplevel('key1', 'key2')  # 交換索引level
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11

sort_index, on the other hand, sorts the data using only the values in a single level. When swapping levels, it's not uncommon to also use sort_index so that the result is lexicographically(詞典的) sorted by the indicated level:

frame.sort_index(level=1)
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
# cj
frame.sort_index(level=0)
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
"先交換軸索引, 再按照軸0排序"
frame.swaplevel(0, 1).sort_index(level=0)
'先交換軸索引, 再按照軸0排序'
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11

Data selection performance is much better on hierarchically indexed if the index is lexicographically sorted starting with the outermost level-that is the result of calling sort_index()
如果索引從最外層開始按字典順序排序,則在分層索引上,>數據選擇性能要好得多——這是調用sort index()的結果

按level描述性統計

Many descriptive and summary statistic on DataFrame and Series have a level option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by level on either the rows or columns like so:

frame
frame.sum(level='key2')
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
frame.sum(level='color', axis=1)
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10

Under the hood, this utilizes(利用) pandas's groupby machinery, which will be discussed in more detail later in the book.

將DF某列值作為行索引

It's not unusual(不尋常的) to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame's columns. Here' an example DataFrame:
想要使用DataFrame中的一個或多個列作為行索引並不罕見; 或者,您可能希望將行索引移動到DataFrame的列中。 這是一個示例DataFrame:

frame = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, -1),
    'c':"one,one,one,two,two,two,two".split(','),  # cj
    'd':[0, 1, 2, 0, 1, 2, 3]
})

frame
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3

DataFrame's set_index function will create a new DataFrame using one or more of its columns as the index:

"將 c, d 列作為index, 同時去掉c, d"
frame2 = frame.set_index(['c', 'd']) 

frame2
'將 c, d 列作為index, 同時去掉c, d'
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1

By default the columns are removed from the DataFrame, though you can leave them in:

frame.set_index(['c', 'd'], drop=False)
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3

reset_index, on the other hand, does the opposite of set_index; the hierachical index levels are moved into the columns:

frame2
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
"將多層index給還原到列去..."
frame2.reset_index()
'將多層index給還原到列去...'
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
# cj test
time.clock()
6e-07
def f(x, l=[]):
    for i in range(x):
        l.append(i*i)
    print(l)
    
    
f(2)
f(3, [3,2,1])
f(3)
[0, 1]
[3, 2, 1, 0, 1, 4]
[0, 1, 0, 1, 4]


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM