pandas 之多層索引

本文轉載自查看原文 2019-11-25 00:27 356 多層索引/ reset_index/ 數據分析/ level/ Pandas/ unstack

In many applications, data may be spread across a number of files or datasets or be arranged in a form that is not easy to analyze. This chapter focuses on tools to help combine, and rearrange data.
(在許多應用中，數據可以分布在多個文件或數據集中，或者以不易分析的形式排列。本章重點介紹幫助組合和重新排列數據的工具.)

import numpy as np 
import pandas as pd

多層索引

Hierarchical indexing is an important featuer of pandas that enables you to have multiple(two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to to work with higher dimensional data in a lower dimensional form.(通過多層索引的方式去從低維看待高維數據). Let's start with a simple example; create a Series with a list of lists(or arrays) as the index:

data = pd.Series(np.random.randn(9),
                index=['a,a,a,b,b,c,c,d,d'.split(','),
                      [1,2,3,1,3,1,2,2,3]])

data

a  1    0.874880
   2    1.424326
   3   -2.028509
b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
d  2   -1.008064
   3    0.988234
dtype: float64

What you're seeing is a prettified view of a Series with a MultiIndex as its index. The 'gaps' in the index display mean "use the lable directly above":

data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

With a hierarchically indexed object(分層索引對象), so-called partial indexing is possible, enabling you to concisely(便捷地) select subsets of the data.

data['b']  # 1 3

1   -1.081833
3   -0.072116
dtype: float64

data['b':'c']  # 1 3 1 2

b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
dtype: float64

data.loc[['b', 'd']]  # loc 通常按名字取, iloc 按下標取

b  1   -1.081833
   3   -0.072116
d  2   -1.008064
   3    0.988234
dtype: float64

"Selection is even possible from an inner level" 

data.loc[:, 2]

'Selection is even possible from an inner level'






a    1.424326
c   -1.246831
d   -1.008064
dtype: float64

Hierarchical indexing plays an important role in reshapeing data and group-based operations like forming a pivot table. For example, you could rearrange the data into a DataFrame using its unstack method:

data.unstack()

	1	2	3
a	0.874880	1.424326	-2.028509
b	-1.081833	NaN	-0.072116
c	0.575918	-1.246831	NaN
d	NaN	-1.008064	0.988234

The inverse operation of unstack is stack:

data.unstack().stack()  # 相當於沒變

a  1    0.874880
   2    1.424326
   3   -2.028509
b  1   -1.081833
   3   -0.072116
c  1    0.575918
   2   -1.246831
d  2   -1.008064
   3    0.988234
dtype: float64

stack and unstack will be explored more detail later in this chapter.

With a DataFrame, either axis can have a hierarchical index:

frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index=[['a','a','b','b'], [1,2,1,2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                             ['Green', 'Red', 'Green']]
                    )

frame

		Ohio		Colorado
		Green	Red	Green
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

The hierarchical levels can have names(as strings or any Python objects). If so, these will show up in the console output:

frame.index.names = ['key1', 'key2']

frame.columns.names = ['state', 'color']

"可設置行列索引的名字呢"
frame

'可設置行列索引的名字呢'

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

Be careful to distinguish(分辨) the index names 'state' and 'color'

Wiht partial column indexing you can similarly select groups of columns:

(使用部分列索引, 可以相應地使用列組)

frame['Ohio']

	color	Green	Red
key1	key2
a	1	0	1
a	2	3	4
b	1	6	7
b	2	9	10

A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this.

tmp = pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])

tmp

MultiIndex(levels=[['Colorado', 'Ohio'], ['Green', 'Red']],
           labels=[[1, 1, 0], [0, 1, 0]],
           names=['state', 'color'])

重排列和Level排序

At times you will need to rearange the order of the levels on an axis or sort the data by the value in one specific level. The swaplevel takes two levle numbers or names and return a new object with the levels interchanged(but the data is otherwise unaltered):

frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.swaplevel('key1', 'key2')  # 交換索引level

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
2	a	3	4	5
1	b	6	7	8
2	b	9	10	11

sort_index, on the other hand, sorts the data using only the values in a single level. When swapping levels, it's not uncommon to also use sort_index so that the result is lexicographically(詞典的) sorted by the indicated level:

frame.sort_index(level=1)

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
b	1	6	7	8
a	2	3	4	5
b	2	9	10	11

# cj
frame.sort_index(level=0)

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

"先交換軸索引, 再按照軸0排序"
frame.swaplevel(0, 1).sort_index(level=0)

'先交換軸索引, 再按照軸0排序'

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
1	b	6	7	8
2	a	3	4	5
2	b	9	10	11

Data selection performance is much better on hierarchically indexed if the index is lexicographically sorted starting with the outermost level-that is the result of calling sort_index()
如果索引從最外層開始按字典順序排序，則在分層索引上，>數據選擇性能要好得多——這是調用sort index()的結果

按level描述性統計

Many descriptive and summary statistic on DataFrame and Series have a level option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by level on either the rows or columns like so:

frame
frame.sum(level='key2')

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

state	Ohio		Colorado
color	Green	Red	Green
key2
1	6	8	10
2	12	14	16

frame.sum(level='color', axis=1)

	color	Green	Red
key1	key2
a	1	2	1
a	2	8	4
b	1	14	7
b	2	20	10

Under the hood, this utilizes(利用) pandas's groupby machinery, which will be discussed in more detail later in the book.

將DF某列值作為行索引

It's not unusual(不尋常的) to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame's columns. Here' an example DataFrame:
想要使用DataFrame中的一個或多個列作為行索引並不罕見; 或者，您可能希望將行索引移動到DataFrame的列中。這是一個示例DataFrame：

frame = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, -1),
    'c':"one,one,one,two,two,two,two".split(','),  # cj
    'd':[0, 1, 2, 0, 1, 2, 3]
})

frame

	a	b	c	d
0	0	7	one	0
1	1	6	one	1
2	2	5	one	2
3	3	4	two	0
4	4	3	two	1
5	5	2	two	2
6	6	1	two	3

DataFrame's set_index function will create a new DataFrame using one or more of its columns as the index:

"將 c, d 列作為index, 同時去掉c, d"
frame2 = frame.set_index(['c', 'd']) 

frame2

'將 c, d 列作為index, 同時去掉c, d'

		a	b
c	d
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1

By default the columns are removed from the DataFrame, though you can leave them in:

frame.set_index(['c', 'd'], drop=False)

		a	b	c	d
c	d
one	0	0	7	one	0
	1	1	6	one	1
	2	2	5	one	2
two	0	3	4	two	0
	1	4	3	two	1
	2	5	2	two	2
	3	6	1	two	3

reset_index, on the other hand, does the opposite of set_index; the hierachical index levels are moved into the columns:

frame2

		a	b
c	d
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1

"將多層index給還原到列去..."
frame2.reset_index()

'將多層index給還原到列去...'

	c	d	a	b
0	one	0	0	7
1	one	1	1	6
2	one	2	2	5
3	two	0	3	4
4	two	1	4	3
5	two	2	5	2
6	two	3	6	1

# cj test
time.clock()

6e-07

def f(x, l=[]):
    for i in range(x):
        l.append(i*i)
    print(l)
    
    
f(2)
f(3, [3,2,1])
f(3)

[0, 1]
[3, 2, 1, 0, 1, 4]
[0, 1, 0, 1, 4]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Pandas 多層索引轉化為列 pandas 數據索引與選取 pandas-索引 pandas 索引、選取和過濾 pandas（3）：索引Index/MultiIndex pandas索引操作 Pandas學習總結——2. Pandas索引 pandas 刪除、索引及切片 pandas中dataframe的索引問題 Pandas進階之DataFrame多級索引