Pandas 之 DataFrame 常用操作

本文轉載自查看原文 2019-11-16 23:27 1005 pandas/ 常用api

import numpy as np
import pandas as pd

This section will walk you(引導你) through the fundamental(基本的) mechanics(方法) of interacting(交互) with the data contained in a Series or DataFrame. -> (引導你去了解基本的數據交互, 通過Series, DataFrame).
In the chapters to come, we will delve(鑽研) more deeply into data analysis and manipulation topics using pandas. This book is not inteded to serve as exhausitive(詳盡的) documentation for the pandas library; instead, we'll focus on the most important features, leaving the less common(i.e more esoteric(深奧,晦澀的)) things for you to explore on you own.

Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex on this Series rearranges(重排列) the data according to the new index, introducing missing values if any index values were not already present:
-> 更新索引, 如沒有對應到值, 則為缺失NaN

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series(時間序列), it may be desirable to do some interprolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills(向前填充值) the values.

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

"ffill 向前填充 - forward-fill"
obj3.reindex(range(6), method='ffill')

'ffill 向前填充 - forward-fill'






0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, reindex can alter either the(row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:

frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California']
)

frame

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

"重新index, 不能匹配的則NaN"
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

'重新index, 不能匹配的則NaN'

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

The columns can be reindex with the columns keyword:

states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

	Texas	Utah	California
a	1	NaN	2
c	4	NaN	5
d	7	NaN	8

See Table 5-3 for more about the arguments to reindex.

As we'll explore in more detail, you can reindex more succinctly by label-indexing with loc, and many users prefer to use it exclusively(情有獨鍾對loc):

"loc[[行索引], [列索引]]"
frame.loc[['a', 'b', 'c', 'd'], states]

'loc[[行索引], [列索引]]'



c:\python\python36\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike

	Texas	Utah	California
a	1.0	NaN	2.0
b	NaN	NaN	NaN
c	4.0	NaN	5.0
d	7.0	NaN	8.0

reindex() 參數:

index
method 'ffill'向前填充, 'bfill' 向后填充
fill_value 填充值
limit
livel Match simple index on level of MultiIndex; otherwise select subset of.
copy

刪除行,列數據根據Axis

Dropping one or more entries from an axis is easy if you already hava an index array or list without those entries. As that can requier a bit of munging(操作) and set logic. The drop method will return a new object with the indecated value or values deleted from an axis:

obj = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

"drop('行索引') 直接刪除, 連帶這行數據"
new_obj = obj.drop('c')
new_obj

"drop('行索引') 直接刪除, 連帶這行數據"






a    0
b    1
d    3
e    4
dtype: int32

"刪除多個索引, 用列表組織起來 - 非原地"
obj.drop(['d', 'c'])

'刪除多個索引, 用列表組織起來 - 非原地'






a    0
b    1
e    4
dtype: int32

obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

With DataFrame, index values can be deleted from either axis. To illustrate(闡明) this, we first create an example DataFrame:

data = pd.DataFrame(np.arange(16).reshape((4,4)),
                   index=['Ohio', 'Colorado', 'Utah', 'New York'],
                   columns=['one', 'two', 'three', 'four']
                   )

data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

Calling drop with a sequence of labels will drop values from either axis. To illustrate this, we first create an example DataFrame: ->(刪除某個行標簽, 將會對應刪掉該行數據)

'drop([row_name1, row_name2]), 刪除行, 非原地'

data.drop(['Colorado', 'Ohio'])

'drop([row_name1, row_name2]), 刪除行, 非原地'

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

".drop()方式是非原地的, del方式是原地的"
data

'.drop()方式是非原地的, del方式是原地的'

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

You can drop values from the columns by passing axis=1(列方向->) or axis='columns'.

"刪除列, 需要指明 axis=1 or axis='columns'"

data.drop(['two', 'four'], axis='columns')

"刪除列, 需要指明 axis=1 or axis='columns'"

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

"drop()不論刪除行還是列, 默認都是非原地的,可以指定"
data

'drop()不論刪除行還是列, 默認都是非原地的,可以指定'

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object: ->(可以用inplace=True來指定原定修改, 很多方法都可試一波的)

"原地刪除第2,3列"

data.drop(['two', 'three'], axis='columns', inplace=True)

"發現原數據已經被修改了"
data

'原地刪除第2,3列'






'發現原數據已經被修改了'

	one	four
Ohio	0	3
Colorado	4	7
Utah	8	11
New York	12	15

Indexing, Selection, and Filtering

Series indexing (obj[...]) works analogously(類似的) to NumPy array indexing, except you can use the Series's index values instead of only integers(字符, 數字等序列也可以作為索引). Here are some examples of this:

obj = pd.Series(np.arange(4), index=['a', 'b', 'c','d'])
obj

a    0
b    1
c    2
d    3
dtype: int32

"obj[row_index_name]"

obj['b']

"根據多個行索引名稱來選取"
obj[['b', 'd']]

'obj[row_index_name]'






1






'根據多個行索引名稱來選取'






b    1
d    3
dtype: int32

"根據索引值來選取, 下標從0哦"
"obj[1]"

"取第二行元素"
obj[1]

"連續: 取1到3行元素, 前閉后開的, 跟列表一樣"
obj[0:3]

"離散: 取索第1行, 第4行"
obj[[0, 3]]

'根據索引值來選取, 下標從0哦'






'obj[1]'






'取第二行元素'






1






'連續: 取1到3行元素, 前閉后開的, 跟列表一樣'






a    0
b    1
c    2
dtype: int32






'離散: 取索第1行, 第4行'






a    0
d    3
dtype: int32

"bool 索引, 取值小於2的"

obj[ obj < 2]

'bool 索引, 取值小於2的'






a    0
b    1
dtype: int32

Slicing with labels behaves differently than normal Python slicing in that the end-point is inclusive.-> 后面是閉合的區間

"是閉區間"
obj['b':'c']

'是閉區間'






b    1
c    2
dtype: int32

Setting using these methods modifies the corresponding section of the Series:

"直接賦值修改是原地的"

obj['b':'c'] = 5
obj

'直接賦值修改是原地的'






a    0
b    5
c    5
d    3
dtype: int32

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence: 通過一個或多個(列表值)來進項列的選取

 data = pd.DataFrame(np.arange(16).reshape((4, 4)),
            index=['Ohio', 'Colorado', 'Utah', 'New York'],
            columns=['one', 'two', 'three', 'four'])
    
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

"df[列名] 選取單個列"
data['two']

"df[[col1, col2]] 選取多個列"

data[['three', 'four']]

'df[列名] 選取單個列'






Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32






'df[[col1, col2]] 選取多個列'

	three	four
Ohio	2	3
Colorado	6	7
Utah	10	11
New York	14	15

Indexing like this has a few special cases. First, slicing or selecting data with a boolean array.

"df[數字], 表示行索引, 從0開始, 這里表示 取2行"
data[:2]

"bool 行索引, 條件是three列中, 值大於5的行"
data[data['three'] > 5]

'df[數字], 表示行索引, 從0開始, 這里表示 取2行'

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

'bool 行索引, 條件是three列中, 值大於5的行'

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

The row selection syntax data[:2] is provided as convenience. Passing a single element or a list to the [] operator selects columns.
-> df[]傳入的是字段名, 則按字段選取; 傳入的是切片索引, 這安行選取. 我是挺容易混淆的. 沒有通用的方法來整

Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison(跟標量進行比較, 產生bool值進行過濾):

data < 5

	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False

"檢索整個DF的值, 將小於5的值 原地替換為 0"

data[data < 5] = 0

data

'檢索整個DF的值, 將小於5的值 原地替換為 0'

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

This makes DataFrame syntactically(在語法上) more like two-dimensional NumPy array in the particular(特別的) case.

Selection with loc and iloc

For DataFrame label-indexing on the rows(行列同時索引的神器), I introduce the the special indexing operators loc and iloc. The enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notaion using either axis lables(loc) or integers(iloc)

As a preliminary(初步的) example, let's select a single row and multiple columns by label:

data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

"df.loc[[行索引], [列索引]]"

data.loc[['Colorado'], ['two', 'three']]

'df.loc[[行索引], [列索引]]'

	two	three
Colorado	5	6

We'll the perform some similar selections with integers using using iloc:

"選取第3行, 第4,1,2列, 即 11, 8, 9"

data.iloc[2, [3,0,1]]

"一維是Series, 二維是DataFrame"
data.iloc[[2,3], [3,0,1]]

'選取第3行, 第4,1,2列, 即 11, 8, 9'






four    11
one      8
two      9
Name: Utah, dtype: int32






'一維是Series, 二維是DataFrame'

	four	one	two
Utah	11	8	9
New York	15	12	13

Both indexing functions work with slices in addtion to single labels or lists of labels:

"都能切片"
data.loc[:'Utah', 'two']

'都能切片'






Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

"取所有行, 1-4列, three列值大於5的行"

data.iloc[:, :3][data.three > 5]

'取所有行, 1-4列, three列值大於5的行'

	one	two	three
Colorado	0	5	6
Utah	8	9	10
New York	12	13	14

So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, Table 5-4 provides a short summary of many of them. As you'll see later, there are a number of additional options for working with hierarchical(分層的) indexes.

When originally(最初) designing pandas(作者自己設計pandas的時候), I felt that having to type frame[:, col] to select a column was too verbose(冗余的) (and error-prone), since column selection is one of the most common operations. I made the design trade-off(權衡) to push all of the fancy indexing behavior(both labels and integers 標簽索引和數值索引都支持) into the ix operator.(推送到ix). In practice,(然而從實踐上來看) this led to many edge(雞肋的, 邊緣的) cases in data with integer axis labels, so the pandas team decided to creat the loc and iloc operators to deal with strictly label-based and integer-based indexing, respectively(各自地).
The ix indexing operator still exists, but it is deprecated(廢棄了). I do not recommend(推薦) using it.

Indexing options with DataFrame

df[col_name] / df[[col, col]] 選取單列或多列
df.loc[row_name] 選取單或多行

df.loc[:, val] 選取單列或多列
df.loc[row, col] 選取行列的交集

df.iloc[where] 下標區間的行(integer)
df.iloc[:, where] 下標區間的列(integer)
df.iloc[where_i, where_j] indtege行列索引

df.at[label_i, label_j] 通過行列的label來取值
df.iat[i, j] 行列位置來選取

reindex method Select either rows or columns by labels
get_value, setvalue methods Select single value by row and column label

Integer Indexes

Working with pandas objects indexed by integers is somthing that often trips up(小坑) new users due to some differences with indexing semantics(語義上的) on buit-in Python data structures like list and tuples. For example, you might not expect the following code to generate an error:

ser = pd.Series(np.arange(3))
ser

0    0
1    1
2    2
dtype: int32

"習慣以為是Python"
ser[-1]

'習慣以為是Python'

In this case, pandas could 'fall back' on integer indexing, but it is difficult to do this in general without introducing subtle(微妙的)bugs. Here we have an index containing 0, 1, 2 but inferring what the user wants (label-based indexing or position-based 用數子做為索引, 則會很難去分辨到底是標簽索引or位置) is difficult:

ser

0    0
1    1
2    2
dtype: int32

On the other hand, with a non-integer index, there is no potential(可能性) for ambiguity(歧義的):
用label等作為索引, 就不會引起歧義了.

ser2 = pd.Series(np.arange(3), index=['a', 'b', 'c'])
ser2

a    0
b    1
c    2
dtype: int32

"這樣就不會引起歧義了, 數值就是行呀"
ser2[-1]

'這樣就不會引起歧義了, 數值就是行呀'






2

To keep things consistent, (保持一致) if you have an axis index containing integer, data selection will always be label-oriented. For more precise handling use lic(for labels) or iloc(for integers):
-> 保持良好的使用風格, 如df[label] 用選取列, loc[row_lable] 用於選取行標簽, iloc[] 用於數值索引, 不要混淆使用.

'選取第1行, 后開區間'
ser[:1]

'選取1,2行, 后閉區間'
ser.loc[:1]


ser.iloc[:1]

'選取第1行, 后開區間'






0    0
dtype: int32






'選取1,2行, 后閉區間'






0    0
1    1
dtype: int32






0    0
dtype: int32

Arithmetic and Data Alignment

基本算數運算和數據對齊 An important pandas feature for some application is the behave of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index paris(索引的並集). For users with database experience, this is similar to an automatic outer join(像表的外連接) on the index labels. Let's look at an example.

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
              index=['a', 'c', 'e', 'f', 'g'])

s1
s2

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64






a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

Adding these together yields:

s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment(數據自動對齊) introduces missing values in the label locations that don't overlap(重疊的). Missing values will then propagate(傳播) in further arithmetic computations.

In this case of DataFrame, alignment is performed on both the rows and the columns:

df1 = pd.DataFrame(np.arange(9).reshape((3, 3)),
                  columns=list('bcd'),
                  index=['Ohio', 'Texas', 'Colorado'])


df2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                  columns=list('bde'),
                  index=['Utath', 'Ohio', 'Texas', 'Oregon'])

df1
df2

	b	c	d
Ohio	0	1	2
Texas	3	4	5
Colorado	6	7	8

	b	d	e
Utath	0	1	2
Ohio	3	4	5
Texas	6	7	8
Oregon	9	10	11

Adding these together returns a DataFrame whose index and columns are the unions(並集) of the ones in the each DataFrame.

"自動對齊"
df1 + df2

'自動對齊'

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utath	NaN	NaN	NaN	NaN

Since the 'c' and 'e' columns are not found in both DataFrame objects, they apper as all missing in the result.(沒有共同的列, 則值是NA) The same holds(把握) for the rows whose labels are not common to both objects.(對於沒有共同標簽的行也是如此)

if you add DataFrame objects with no columns or row labels in common(共同的行or列標簽), the result will contain all nulls:

df1 = pd.DataFrame({'A': [1, 2]})

df2 = pd.DataFrame({'B': [3, 4]})

df1
df2

	A
0	1
1	2

	B
0	3
1	4

"沒有相同的行列標簽, 節后是標簽保留了, 值都為NaN"
df1 + df2

'沒有相同的行列標簽, 節后是標簽保留了, 值都為NaN'

	A	B
0	NaN	NaN
1	NaN	NaN

Arithmetic methods with fill values

In arithmetic operations between different indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
    columns=list('abcd'))
    
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
    columns=list('abcde'))

"第2行, b列的元素, 修改為NaN"
df2.loc[1, 'b'] = np.nan

df1
df2

'第2行, b列的元素, 修改為NaN'

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	NaN	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

Adding these together results in NA values in the locations that don't overlap(重疊的)

"相同軸標簽, 同位置的值運算, 不重疊的則NaN"
df1 + df2

'相同軸標簽, 同位置的值運算, 不重疊的則NaN'

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

Using the add method on df1, I pass df2 and an argument to fill_value. ->(將位置對不上的值, 用0填充, 然后再進行運算)

df1.add(df2, fill_value=0)

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	5.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

See Table 5-5 for a listing of Series and DataFrame methods for arithmetic. Each of them has a countepart(副本), starting with the letter r that has arguments flpped(帶r的字母). So these two statements are equivalent:

1/ df1

"直接寫運算符, 和調用方法是一樣的效果, 副本而已啦"

df1.rdiv(1)

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250000	0.200000	0.166667	0.142857
2	0.125000	0.111111	0.100000	0.090909

'直接寫運算符, 和調用方法是一樣的效果, 副本而已啦'

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250000	0.200000	0.166667	0.142857
2	0.125000	0.111111	0.100000	0.090909

Relatedly(有密切相關的是), when reindexing a Series or DataFrame, you can also specify a different fill value

'重設索引名reindex 時, 可對沒有對應上索引, 進行值填充, 如本列中的e'
df1.reindex(columns=df2.columns, fill_value=0)

'重設索引名reindex 時, 可對沒有對應上索引, 進行值填充, 如本列中的e'

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

add, radd
sub, rsub
div, rdiv
floordiv, rfloordiv
mul, rmul
pow, rpow

Operations between DataFrame and Series

As(針對) with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined.(針對不同維度的數組, 定義了運算法則) First, as a motivating(激勵的) example, consider the difference between a two-dimensional array and one of its rows:

arr = np.arange(12).reshape((3, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

'取第一行'
arr[0]

'廣播機制, 每一行都會減掉值, 因為匹配了'
arr - arr[0]

'取第一行'






array([0, 1, 2, 3])






'廣播機制, 每一行都會減掉值, 因為匹配了'






array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

When we subtract arr[0] from arr, the subtraction is performed once for each row(每行都被廣播了). This is referred to as broadcasting and is explained in more detail as it relates to general NumPy arrays in Appendix A. Operations between a DataFrame and a Series are similar:

frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

"選取第一行"
series = frame.iloc[0]
series

'選取第一行'






b    0
d    1
e    2
Name: Utah, dtype: int32

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame's columns, broadcasting down the rows: ->(默認是運算匹配列索引(右邊), 向下廣播(行索引))

"匹配從左到右, 廣播從上到下"
frame - series

'匹配從左到右, 廣播從上到下'

	b	d	e
Utah	0	0	0
Ohio	3	3	3
Texas	6	6	6
Oregon	9	9	9

If an index value is not found in either the DataFrame's columns or the Series's index, the objects will be reindexed to from the union

series2 = pd.Series(range(3), index=['b', 'e', 'f'])

"沒有匹配到, 就自然NA了"
frame + series2

'沒有匹配到, 就自然NA了'

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

If you want to instead broadcast over the columns, matching on the rows, you have to use one of teh arithmetic methods. For example:

series3 = frame['d']

frame
series3

	b	d	e
Utah	0	1	2
Ohio	3	4	5
Texas	6	7	8
Oregon	9	10	11

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: int32

"指定廣播的軸0, 每一了行都被廣播"
frame.sub(series3, axis='index')

'指定廣播的軸0, 每一了行都被廣播'

	b	e
Utah	-1	1
Ohio	-1	1
Texas	-1	1
Oregon	-1	1

The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame's row index(aix='index' or axis=0) and broadcast across.

Function Application and Mapping

NumPy ufuncs(element-wise array methods) also work with pandas objects: => 元素級函數對DF同樣適用

# randn 是標准正態分布

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame

	b	d	e
Utah	0.872481	-0.026409	1.130246
Ohio	-0.793998	-1.394605	-0.224205
Texas	-0.120480	0.243161	1.627977
Oregon	-2.734813	-2.009582	0.905505

np.abs(frame)

	b	d	e
Utah	0.872481	0.026409	1.130246
Ohio	0.793998	1.394605	0.224205
Texas	0.120480	0.243161	1.627977
Oregon	2.734813	2.009582	0.905505

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame's apply method does exactly this:

"自定義一個求極差的函數, 作為參數傳給DF"
f = lambda x: x.max() - x.min()

"默認方向映射每列"
frame.apply(f)

'自定義一個求極差的函數, 作為參數傳給DF'






'默認方向映射每列'






b    3.607294
d    2.252743
e    1.852182
dtype: float64

Here the function f, which computes the difference between maximum and minimum of a Series, is invoked(被調用) once each column(輸入是frame的每列哦). The result is a Series having the columns of frame as its index.

If you pass axis='columns'(列方向, 右邊, 每行) to apply, the function will be invoked once per row instead:

"默認行方向, 即上-下, 每列一個結果; axis='column'列方向, 每行一個結果"

"axis=column, 是方向為列, 不是按列哦, 實際是按行映射"
frame.apply(f, axis=1)

"默認行方向, 即上-下, 每列一個結果; axis='column'列方向, 每行一個結果"






'axis=column, 是方向為列, 不是按列哦, 實際是按行映射'






Utah      1.156654
Ohio      1.170400
Texas     1.748457
Oregon    3.640318
dtype: float64

Many of the most common array statistic(like sum and mean) are DataFrame methods, so using apply is not necessay.

The function passed to apply need not return a scalar value; It can alse return a Series with multiple values. -> appy(), 能返回多值

def f(x):
    "返回DF的最大值, 最小值的df"
    return pd.Series([x.min(), x.max()], index=['max', 'min'])

"將函數名作為參數傳給apply"

frame.apply(f)

'將函數名作為參數傳給apply'

	b	d	e
max	-2.734813	-2.009582	-0.224205
min	0.872481	0.243161	1.627977

Element-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating-point value in fram. You can do this with applymap: -apply, 假如要格式化DF里的所有字符數據(保留兩位)...

format = lambda x: '%.2f' % x

"applymap() 映射所有元素, 而apply是有軸方向的"
frame.applymap(format)

'applymap() 映射所有元素, 而apply是有軸方向的'

	b	d	e
Utah	0.87	-0.03	1.13
Ohio	-0.79	-1.39	-0.22
Texas	-0.12	0.24	1.63
Oregon	-2.73	-2.01	0.91

The reason for the name applmap is that Series has a map method for applying an element-wise function:

"applymap()的原因是, Series有一個map的方法"
frame['e'].map(format)

'applymap()的原因是, Series有一個map的方法'






Utah       1.13
Ohio      -0.22
Texas      1.63
Oregon     0.91
Name: e, dtype: object

Sorting and Ranking

Sorting a dataset by some criterion(標准) is another important built-in operation. To sort lexicographically(詞典編撰的) by row or column index, use sort_index method, which returns a new, sorted object:

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

"sort_index() 按索引排序"
obj.sort_index()

'sort_index() 按索引排序'






a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis:

frame = pd.DataFrame(np.arange(8).reshape((2,4)),
                    index=['three', 'one'],
                    columns=['d', 'a', 'b', 'c'])

frame

"按索引排序, 默認axis=0, 下方向"
frame.sort_index()

"按列索引排序, axis=1, 右方向"
frame.sort_index(axis=1)

	d	a	b	c
three	0	1	2	3
one	4	5	6	7

'按索引排序, 默認axis=0, 下方向'

	d	a	b	c
one	4	5	6	7
three	0	1	2	3

'按列索引排序, axis=1, 右方向'

	a	b	c	d
three	1	2	3	0
one	5	6	7	4

The data is sorted in ascending order by default, but can be sorted in descending order, too:

frame.sort_index(axis=1, ascending=False)

	d	c	b	a
three	0	3	2	1
one	4	7	6	5

To sort a Series by its values, use its sort_values method.

obj = pd.Series([4, 7, -3, 2])

"sort_values()按值排序"
obj.sort_values()

'sort_values()按值排序'






2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default. -> 缺失值默認排在最后

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

"缺失值排在最后"
obj.sort_values()

'缺失值排在最后'






4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

When sorthing a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more columns names to the by option of sort_values:

frame = pd.DataFrame({
    'b':[4,7,3,-2],
    'a':[0,1,0, 1]
})

frame

	b	a
0	4	0
1	7	1
2	3	0
3	-2	1

"by='column_name' 按某個字段進行排序"
frame.sort_values(by='b')

"by='column_name' 按某個字段進行排序"

	b	a
3	-2	1
2	3	0
0	4	0
1	7	1

"To sort by multiple columns, pass a list of names"

frame.sort_values(by=['a', 'b'])

'To sort by multiple columns, pass a list of names'

	b	a
2	3	0
0	4	0
3	-2	1
1	7	1

Ranking assigns ranks from one through the number of valid points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank: -. 沒看明白呢

obj = pd.Series([7, -5, 7, 4, 2, 8, 4])

obj.rank()

"Ranks can also be assigned according to the order in which they're observed in the data"

obj.rank(method='first')

0    5.5
1    1.0
2    5.5
3    3.5
4    2.0
5    7.0
6    3.5
dtype: float64






"Ranks can also be assigned according to the order in which they're observed in the data"






0    5.0
1    1.0
2    6.0
3    3.0
4    2.0
5    7.0
6    4.0
dtype: float64

Here, instead of using the average rank 6.5 for the entries 0 and 2, they insead have been set to 6 and 7 because label 0 precedes label 2 inthe data.

You can rank in descending order, too:

# Assign values the maximum rank in the group
obj.rank(ascending=False, method='max')

0    3.0
1    7.0
2    3.0
3    5.0
4    6.0
5    1.0
6    5.0
dtype: float64

See Table 5-6 for a list of tie-breaking methods avaible.

DataFrame can compute ranks over the rows or the columns:

 frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
    'c': [-2, 5, 8, -2.5]})
    
frame

	b	a	c
0	4.3	0	-2.0
1	7.0	1	5.0
2	-3.0	0	8.0
3	2.0	1	-2.5

frame.rank(axis=1)

	b	a	c
0	3.0	2.0	1.0
1	3.0	1.0	2.0
2	1.0	2.0	3.0
3	3.0	2.0	1.0

rank 參數

average Default: assign the average rank to each entry in the equal group
max Use the maximum rank for the whole group
min ---
first
dense

Axis Indexes with Duplicate Lables

Up until now all of the examples we've looked at have had unique axis labels(index values). When many pandas functions(like reindex) require that the labels be unique, it's not mandatory(強制的). Let's consider a small Series with duplicate indices:

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The index's is_unique property can tell you whether its label are unique or not:

"is_unique 屬性來判斷是否有重復索引"

obj.index.is_unique

'is_unique 屬性來判斷是否有重復索引'






False

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value. -> 存在重復索引的時候, 返回值是一個Series, 不重復則返回一個標量值

df = pd.DataFrame(np.random.randn(4,3), index=['a', 'a', 'b', 'b'])
df

	0	1	2
a	-1.160530	-0.226480	0.608358
a	-1.052758	-0.783890	0.920109
b	-0.520996	-0.706842	0.459379
b	0.813595	1.052030	0.263111

"選取行索引為 b 的行, 重復, 則返回df"
df.loc['b']

'選取行索引為 b 的行, 重復, 則返回df'

	0	1	2
b	-0.520996	-0.706842	0.459379
b	0.813595	1.052030	0.263111

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python：pandas之DataFrame常用操作 Pandas之DataFrame基本操作 Python pandas DataFrame操作 Pandas | Dataframe的merge操作，像數據庫一樣盡情join Python--Pandas.2(DataFrame的概念和創建,索引,基本操作) 1.20學習總結:DataFrame保存及常用操作 pandas DataFrame(5)-合並DataFrame與Series Pandas dataframe 和 spark dataframe 轉換 pandas 替換dataframe表頭 pandas入門之DataFrame