import numpy as np
import pandas as pd
This section will walk you(引導你) through the fundamental(基本的) mechanics(方法) of interacting(交互) with the data contained in a Series or DataFrame. -> (引導你去了解基本的數據交互, 通過Series, DataFrame).
In the chapters to come, we will delve(鑽研) more deeply into data analysis and manipulation topics using pandas. This book is not inteded to serve as exhausitive(詳盡的) documentation for the pandas library; instead, we'll focus on the most important features, leaving the less common(i.e more esoteric(深奧,晦澀的)) things for you to explore on you own.
Reindexing
An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
Calling reindex on this Series rearranges(重排列) the data according to the new index, introducing missing values if any index values were not already present:
-> 更新索引, 如沒有對應到值, 則為缺失NaN
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
For ordered data like time series(時間序列), it may be desirable to do some interprolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills(向前填充值) the values.
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3
0 blue
2 purple
4 yellow
dtype: object
"ffill 向前填充 - forward-fill"
obj3.reindex(range(6), method='ffill')
'ffill 向前填充 - forward-fill'
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
With DataFrame, reindex can alter either the(row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California']
)
frame
Ohio | Texas | California | |
---|---|---|---|
a | 0 | 1 | 2 |
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
"重新index, 不能匹配的則NaN"
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
'重新index, 不能匹配的則NaN'
Ohio | Texas | California | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | NaN | NaN | NaN |
c | 3.0 | 4.0 | 5.0 |
d | 6.0 | 7.0 | 8.0 |
The columns can be reindex with the columns keyword:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
Texas | Utah | California | |
---|---|---|---|
a | 1 | NaN | 2 |
c | 4 | NaN | 5 |
d | 7 | NaN | 8 |
See Table 5-3 for more about the arguments to reindex.
As we'll explore in more detail, you can reindex more succinctly by label-indexing with loc, and many users prefer to use it exclusively(情有獨鍾對loc):
"loc[[行索引], [列索引]]"
frame.loc[['a', 'b', 'c', 'd'], states]
'loc[[行索引], [列索引]]'
c:\python\python36\lib\site-packages\ipykernel_launcher.py:2: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Texas | Utah | California | |
---|---|---|---|
a | 1.0 | NaN | 2.0 |
b | NaN | NaN | NaN |
c | 4.0 | NaN | 5.0 |
d | 7.0 | NaN | 8.0 |
reindex() 參數:
- index
- method 'ffill'向前填充, 'bfill' 向后填充
- fill_value 填充值
- limit
- livel Match simple index on level of MultiIndex; otherwise select subset of.
- copy
刪除行,列數據根據Axis
Dropping one or more entries from an axis is easy if you already hava an index array or list without those entries. As that can requier a bit of munging(操作) and set logic. The drop method will return a new object with the indecated value or values deleted from an axis:
obj = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
obj
a 0
b 1
c 2
d 3
e 4
dtype: int32
"drop('行索引') 直接刪除, 連帶這行數據"
new_obj = obj.drop('c')
new_obj
"drop('行索引') 直接刪除, 連帶這行數據"
a 0
b 1
d 3
e 4
dtype: int32
"刪除多個索引, 用列表組織起來 - 非原地"
obj.drop(['d', 'c'])
'刪除多個索引, 用列表組織起來 - 非原地'
a 0
b 1
e 4
dtype: int32
obj
a 0
b 1
c 2
d 3
e 4
dtype: int32
With DataFrame, index values can be deleted from either axis. To illustrate(闡明) this, we first create an example DataFrame:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four']
)
data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
Calling drop with a sequence of labels will drop values from either axis. To illustrate this, we first create an example DataFrame: ->(刪除某個行標簽, 將會對應刪掉該行數據)
'drop([row_name1, row_name2]), 刪除行, 非原地'
data.drop(['Colorado', 'Ohio'])
'drop([row_name1, row_name2]), 刪除行, 非原地'
one | two | three | four | |
---|---|---|---|---|
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
".drop()方式是非原地的, del方式是原地的"
data
'.drop()方式是非原地的, del方式是原地的'
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
You can drop values from the columns by passing axis=1(列方向->) or axis='columns'.
"刪除列, 需要指明 axis=1 or axis='columns'"
data.drop(['two', 'four'], axis='columns')
"刪除列, 需要指明 axis=1 or axis='columns'"
one | three | |
---|---|---|
Ohio | 0 | 2 |
Colorado | 4 | 6 |
Utah | 8 | 10 |
New York | 12 | 14 |
"drop()不論刪除行還是列, 默認都是非原地的,可以指定"
data
'drop()不論刪除行還是列, 默認都是非原地的,可以指定'
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object: ->(可以用inplace=True來指定原定修改, 很多方法都可試一波的)
"原地刪除第2,3列"
data.drop(['two', 'three'], axis='columns', inplace=True)
"發現原數據已經被修改了"
data
'原地刪除第2,3列'
'發現原數據已經被修改了'
one | four | |
---|---|---|
Ohio | 0 | 3 |
Colorado | 4 | 7 |
Utah | 8 | 11 |
New York | 12 | 15 |
Indexing, Selection, and Filtering
Series indexing (obj[...]) works analogously(類似的) to NumPy array indexing, except you can use the Series's index values instead of only integers(字符, 數字等序列也可以作為索引). Here are some examples of this:
obj = pd.Series(np.arange(4), index=['a', 'b', 'c','d'])
obj
a 0
b 1
c 2
d 3
dtype: int32
"obj[row_index_name]"
obj['b']
"根據多個行索引名稱來選取"
obj[['b', 'd']]
'obj[row_index_name]'
1
'根據多個行索引名稱來選取'
b 1
d 3
dtype: int32
"根據索引值來選取, 下標從0哦"
"obj[1]"
"取第二行元素"
obj[1]
"連續: 取1到3行元素, 前閉后開的, 跟列表一樣"
obj[0:3]
"離散: 取索第1行, 第4行"
obj[[0, 3]]
'根據索引值來選取, 下標從0哦'
'obj[1]'
'取第二行元素'
1
'連續: 取1到3行元素, 前閉后開的, 跟列表一樣'
a 0
b 1
c 2
dtype: int32
'離散: 取索第1行, 第4行'
a 0
d 3
dtype: int32
"bool 索引, 取值小於2的"
obj[ obj < 2]
'bool 索引, 取值小於2的'
a 0
b 1
dtype: int32
Slicing with labels behaves differently than normal Python slicing in that the end-point is inclusive.-> 后面是閉合的區間
"是閉區間"
obj['b':'c']
'是閉區間'
b 1
c 2
dtype: int32
Setting using these methods modifies the corresponding section of the Series:
"直接賦值修改是原地的"
obj['b':'c'] = 5
obj
'直接賦值修改是原地的'
a 0
b 5
c 5
d 3
dtype: int32
Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence: 通過一個或多個(列表值)來進項列的選取
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
"df[列名] 選取單個列"
data['two']
"df[[col1, col2]] 選取多個列"
data[['three', 'four']]
'df[列名] 選取單個列'
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
'df[[col1, col2]] 選取多個列'
three | four | |
---|---|---|
Ohio | 2 | 3 |
Colorado | 6 | 7 |
Utah | 10 | 11 |
New York | 14 | 15 |
Indexing like this has a few special cases. First, slicing or selecting data with a boolean array.
"df[數字], 表示行索引, 從0開始, 這里表示 取2行"
data[:2]
"bool 行索引, 條件是three列中, 值大於5的行"
data[data['three'] > 5]
'df[數字], 表示行索引, 從0開始, 這里表示 取2行'
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
'bool 行索引, 條件是three列中, 值大於5的行'
one | two | three | four | |
---|---|---|---|---|
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
The row selection syntax data[:2] is provided as convenience. Passing a single element or a list to the [] operator selects columns.
-> df[]傳入的是字段名, 則按字段選取; 傳入的是切片索引, 這安行選取. 我是挺容易混淆的. 沒有通用的方法來整
Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison(跟標量進行比較, 產生bool值進行過濾):
data < 5
one | two | three | four | |
---|---|---|---|---|
Ohio | True | True | True | True |
Colorado | True | False | False | False |
Utah | False | False | False | False |
New York | False | False | False | False |
"檢索整個DF的值, 將小於5的值 原地替換為 0"
data[data < 5] = 0
data
'檢索整個DF的值, 將小於5的值 原地替換為 0'
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 0 | 0 | 0 |
Colorado | 0 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
This makes DataFrame syntactically(在語法上) more like two-dimensional NumPy array in the particular(特別的) case.
Selection with loc and iloc
For DataFrame label-indexing on the rows(行列同時索引的神器), I introduce the the special indexing operators loc and iloc. The enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notaion using either axis lables(loc) or integers(iloc)
As a preliminary(初步的) example, let's select a single row and multiple columns by label:
data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 0 | 0 | 0 |
Colorado | 0 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
"df.loc[[行索引], [列索引]]"
data.loc[['Colorado'], ['two', 'three']]
'df.loc[[行索引], [列索引]]'
two | three | |
---|---|---|
Colorado | 5 | 6 |
We'll the perform some similar selections with integers using using iloc:
"選取第3行, 第4,1,2列, 即 11, 8, 9"
data.iloc[2, [3,0,1]]
"一維是Series, 二維是DataFrame"
data.iloc[[2,3], [3,0,1]]
'選取第3行, 第4,1,2列, 即 11, 8, 9'
four 11
one 8
two 9
Name: Utah, dtype: int32
'一維是Series, 二維是DataFrame'
four | one | two | |
---|---|---|---|
Utah | 11 | 8 | 9 |
New York | 15 | 12 | 13 |
Both indexing functions work with slices in addtion to single labels or lists of labels:
"都能切片"
data.loc[:'Utah', 'two']
'都能切片'
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
"取所有行, 1-4列, three列值大於5的行"
data.iloc[:, :3][data.three > 5]
'取所有行, 1-4列, three列值大於5的行'
one | two | three | |
---|---|---|---|
Colorado | 0 | 5 | 6 |
Utah | 8 | 9 | 10 |
New York | 12 | 13 | 14 |
So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, Table 5-4 provides a short summary of many of them. As you'll see later, there are a number of additional options for working with hierarchical(分層的) indexes.
When originally(最初) designing pandas(作者自己設計pandas的時候), I felt that having to type frame[:, col] to select a column was too verbose(冗余的) (and error-prone), since column selection is one of the most common operations. I made the design trade-off(權衡) to push all of the fancy indexing behavior(both labels and integers 標簽索引和數值索引都支持) into the ix operator.(推送到ix). In practice,(然而從實踐上來看) this led to many edge(雞肋的, 邊緣的) cases in data with integer axis labels, so the pandas team decided to creat the loc and iloc operators to deal with strictly label-based and integer-based indexing, respectively(各自地).
The ix indexing operator still exists, but it is deprecated(廢棄了). I do not recommend(推薦) using it.
Indexing options with DataFrame
-
df[col_name] / df[[col, col]] 選取單列或多列
-
df.loc[row_name] 選取單或多行
- df.loc[:, val] 選取單列或多列
- df.loc[row, col] 選取行列的交集
- df.iloc[where] 下標區間的行(integer)
- df.iloc[:, where] 下標區間的列(integer)
- df.iloc[where_i, where_j] indtege行列索引
- df.at[label_i, label_j] 通過行列的label來取值
- df.iat[i, j] 行列位置來選取
- reindex method Select either rows or columns by labels
- get_value, setvalue methods Select single value by row and column label
Integer Indexes
Working with pandas objects indexed by integers is somthing that often trips up(小坑) new users due to some differences with indexing semantics(語義上的) on buit-in Python data structures like list and tuples. For example, you might not expect the following code to generate an error:
ser = pd.Series(np.arange(3))
ser
0 0
1 1
2 2
dtype: int32
"習慣以為是Python"
ser[-1]
'習慣以為是Python'
In this case, pandas could 'fall back' on integer indexing, but it is difficult to do this in general without introducing subtle(微妙的)bugs. Here we have an index containing 0, 1, 2 but inferring what the user wants (label-based indexing or position-based 用數子做為索引, 則會很難去分辨到底是 標簽索引or位置) is difficult:
ser
0 0
1 1
2 2
dtype: int32
On the other hand, with a non-integer index, there is no potential(可能性) for ambiguity(歧義的):
用label等作為索引, 就不會引起歧義了.
ser2 = pd.Series(np.arange(3), index=['a', 'b', 'c'])
ser2
a 0
b 1
c 2
dtype: int32
"這樣就不會引起歧義了, 數值就是行呀"
ser2[-1]
'這樣就不會引起歧義了, 數值就是行呀'
2
To keep things consistent, (保持一致) if you have an axis index containing integer, data selection will always be label-oriented. For more precise handling use lic(for labels) or iloc(for integers):
-> 保持良好的使用風格, 如df[label] 用選取列, loc[row_lable] 用於選取行標簽, iloc[] 用於數值索引, 不要混淆使用.
'選取第1行, 后開區間'
ser[:1]
'選取1,2行, 后閉區間'
ser.loc[:1]
ser.iloc[:1]
'選取第1行, 后開區間'
0 0
dtype: int32
'選取1,2行, 后閉區間'
0 0
1 1
dtype: int32
0 0
dtype: int32
Arithmetic and Data Alignment
基本算數運算和數據對齊 An important pandas feature for some application is the behave of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index paris(索引的並集). For users with database experience, this is similar to an automatic outer join(像表的外連接) on the index labels. Let's look at an example.
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
index=['a', 'c', 'e', 'f', 'g'])
s1
s2
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
Adding these together yields:
s1 + s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
The internal data alignment(數據自動對齊) introduces missing values in the label locations that don't overlap(重疊的). Missing values will then propagate(傳播) in further arithmetic computations.
In this case of DataFrame, alignment is performed on both the rows and the columns:
df1 = pd.DataFrame(np.arange(9).reshape((3, 3)),
columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12).reshape((4,3)),
columns=list('bde'),
index=['Utath', 'Ohio', 'Texas', 'Oregon'])
df1
df2
b | c | d | |
---|---|---|---|
Ohio | 0 | 1 | 2 |
Texas | 3 | 4 | 5 |
Colorado | 6 | 7 | 8 |
b | d | e | |
---|---|---|---|
Utath | 0 | 1 | 2 |
Ohio | 3 | 4 | 5 |
Texas | 6 | 7 | 8 |
Oregon | 9 | 10 | 11 |
Adding these together returns a DataFrame whose index and columns are the unions(並集) of the ones in the each DataFrame.
"自動對齊"
df1 + df2
'自動對齊'
b | c | d | e | |
---|---|---|---|---|
Colorado | NaN | NaN | NaN | NaN |
Ohio | 3.0 | NaN | 6.0 | NaN |
Oregon | NaN | NaN | NaN | NaN |
Texas | 9.0 | NaN | 12.0 | NaN |
Utath | NaN | NaN | NaN | NaN |
Since the 'c' and 'e' columns are not found in both DataFrame objects, they apper as all missing in the result.(沒有共同的列, 則值是NA) The same holds(把握) for the rows whose labels are not common to both objects.(對於沒有共同標簽的行也是如此)
if you add DataFrame objects with no columns or row labels in common(共同的行or列標簽), the result will contain all nulls:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df1
df2
A | |
---|---|
0 | 1 |
1 | 2 |
B | |
---|---|
0 | 3 |
1 | 4 |
"沒有相同的行列標簽, 節后是標簽保留了, 值都為NaN"
df1 + df2
'沒有相同的行列標簽, 節后是標簽保留了, 值都為NaN'
A | B | |
---|---|---|
0 | NaN | NaN |
1 | NaN | NaN |
Arithmetic methods with fill values
In arithmetic operations between different indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
"第2行, b列的元素, 修改為NaN"
df2.loc[1, 'b'] = np.nan
df1
df2
'第2行, b列的元素, 修改為NaN'
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | 5.0 | 6.0 | 7.0 |
2 | 8.0 | 9.0 | 10.0 | 11.0 |
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 |
1 | 5.0 | NaN | 7.0 | 8.0 | 9.0 |
2 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
Adding these together results in NA values in the locations that don't overlap(重疊的)
"相同軸標簽, 同位置的值運算, 不重疊的則NaN"
df1 + df2
'相同軸標簽, 同位置的值運算, 不重疊的則NaN'
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | NaN |
1 | 9.0 | NaN | 13.0 | 15.0 | NaN |
2 | 18.0 | 20.0 | 22.0 | 24.0 | NaN |
3 | NaN | NaN | NaN | NaN | NaN |
Using the add method on df1, I pass df2 and an argument to fill_value. ->(將位置對不上的值, 用0填充, 然后再進行運算)
df1.add(df2, fill_value=0)
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
1 | 9.0 | 5.0 | 13.0 | 15.0 | 9.0 |
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
See Table 5-5 for a listing of Series and DataFrame methods for arithmetic. Each of them has a countepart(副本), starting with the letter r that has arguments flpped(帶r的字母). So these two statements are equivalent:
1/ df1
"直接寫運算符, 和調用方法是一樣的效果, 副本而已啦"
df1.rdiv(1)
a | b | c | d | |
---|---|---|---|---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
1 | 0.250000 | 0.200000 | 0.166667 | 0.142857 |
2 | 0.125000 | 0.111111 | 0.100000 | 0.090909 |
'直接寫運算符, 和調用方法是一樣的效果, 副本而已啦'
a | b | c | d | |
---|---|---|---|---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
1 | 0.250000 | 0.200000 | 0.166667 | 0.142857 |
2 | 0.125000 | 0.111111 | 0.100000 | 0.090909 |
Relatedly(有密切相關的是), when reindexing a Series or DataFrame, you can also specify a different fill value
'重設索引名reindex 時, 可對沒有對應上索引, 進行值填充, 如本列中的e'
df1.reindex(columns=df2.columns, fill_value=0)
'重設索引名reindex 時, 可對沒有對應上索引, 進行值填充, 如本列中的e'
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 | 0 |
1 | 4.0 | 5.0 | 6.0 | 7.0 | 0 |
2 | 8.0 | 9.0 | 10.0 | 11.0 | 0 |
- add, radd
- sub, rsub
- div, rdiv
- floordiv, rfloordiv
- mul, rmul
- pow, rpow
Operations between DataFrame and Series
As(針對) with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined.(針對不同維度的數組, 定義了運算法則) First, as a motivating(激勵的) example, consider the difference between a two-dimensional array and one of its rows:
arr = np.arange(12).reshape((3, 4))
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
'取第一行'
arr[0]
'廣播機制, 每一行都會減掉值, 因為匹配了'
arr - arr[0]
'取第一行'
array([0, 1, 2, 3])
'廣播機制, 每一行都會減掉值, 因為匹配了'
array([[0, 0, 0, 0],
[4, 4, 4, 4],
[8, 8, 8, 8]])
When we subtract arr[0] from arr, the subtraction is performed once for each row(每行都被廣播了). This is referred to as broadcasting and is explained in more detail as it relates to general NumPy arrays in Appendix A. Operations between a DataFrame and a Series are similar:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
"選取第一行"
series = frame.iloc[0]
series
'選取第一行'
b 0
d 1
e 2
Name: Utah, dtype: int32
By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame's columns, broadcasting down the rows: ->(默認是運算匹配列索引(右邊), 向下廣播(行索引))
"匹配從左到右, 廣播從上到下"
frame - series
'匹配從左到右, 廣播從上到下'
b | d | e | |
---|---|---|---|
Utah | 0 | 0 | 0 |
Ohio | 3 | 3 | 3 |
Texas | 6 | 6 | 6 |
Oregon | 9 | 9 | 9 |
If an index value is not found in either the DataFrame's columns or the Series's index, the objects will be reindexed to from the union
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
"沒有匹配到, 就自然NA了"
frame + series2
'沒有匹配到, 就自然NA了'
b | d | e | f | |
---|---|---|---|---|
Utah | 0.0 | NaN | 3.0 | NaN |
Ohio | 3.0 | NaN | 6.0 | NaN |
Texas | 6.0 | NaN | 9.0 | NaN |
Oregon | 9.0 | NaN | 12.0 | NaN |
If you want to instead broadcast over the columns, matching on the rows, you have to use one of teh arithmetic methods. For example:
series3 = frame['d']
frame
series3
b | d | e | |
---|---|---|---|
Utah | 0 | 1 | 2 |
Ohio | 3 | 4 | 5 |
Texas | 6 | 7 | 8 |
Oregon | 9 | 10 | 11 |
Utah 1
Ohio 4
Texas 7
Oregon 10
Name: d, dtype: int32
"指定廣播的軸0, 每一了行都被廣播"
frame.sub(series3, axis='index')
'指定廣播的軸0, 每一了行都被廣播'
b | d | e | |
---|---|---|---|
Utah | -1 | 0 | 1 |
Ohio | -1 | 0 | 1 |
Texas | -1 | 0 | 1 |
Oregon | -1 | 0 | 1 |
The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame's row index(aix='index' or axis=0) and broadcast across.
Function Application and Mapping
NumPy ufuncs(element-wise array methods) also work with pandas objects: => 元素級函數對DF同樣適用
# randn 是標准正態分布
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
b | d | e | |
---|---|---|---|
Utah | 0.872481 | -0.026409 | 1.130246 |
Ohio | -0.793998 | -1.394605 | -0.224205 |
Texas | -0.120480 | 0.243161 | 1.627977 |
Oregon | -2.734813 | -2.009582 | 0.905505 |
np.abs(frame)
b | d | e | |
---|---|---|---|
Utah | 0.872481 | 0.026409 | 1.130246 |
Ohio | 0.793998 | 1.394605 | 0.224205 |
Texas | 0.120480 | 0.243161 | 1.627977 |
Oregon | 2.734813 | 2.009582 | 0.905505 |
Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame's apply method does exactly this:
"自定義一個求極差的函數, 作為參數傳給DF"
f = lambda x: x.max() - x.min()
"默認方向映射每列"
frame.apply(f)
'自定義一個求極差的函數, 作為參數傳給DF'
'默認方向映射每列'
b 3.607294
d 2.252743
e 1.852182
dtype: float64
Here the function f, which computes the difference between maximum and minimum of a Series, is invoked(被調用) once each column(輸入是frame的每列哦). The result is a Series having the columns of frame as its index.
If you pass axis='columns'(列方向, 右邊, 每行) to apply, the function will be invoked once per row instead:
"默認行方向, 即上-下, 每列一個結果; axis='column'列方向, 每行一個結果"
"axis=column, 是方向為列, 不是按列哦, 實際是按行映射"
frame.apply(f, axis=1)
"默認行方向, 即上-下, 每列一個結果; axis='column'列方向, 每行一個結果"
'axis=column, 是方向為列, 不是按列哦, 實際是按行映射'
Utah 1.156654
Ohio 1.170400
Texas 1.748457
Oregon 3.640318
dtype: float64
Many of the most common array statistic(like sum and mean) are DataFrame methods, so using apply is not necessay.
The function passed to apply need not return a scalar value; It can alse return a Series with multiple values. -> appy(), 能返回多值
def f(x):
"返回DF的最大值, 最小值的df"
return pd.Series([x.min(), x.max()], index=['max', 'min'])
"將函數名作為參數傳給apply"
frame.apply(f)
'將函數名作為參數傳給apply'
b | d | e | |
---|---|---|---|
max | -2.734813 | -2.009582 | -0.224205 |
min | 0.872481 | 0.243161 | 1.627977 |
Element-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating-point value in fram. You can do this with applymap: -apply, 假如要格式化DF里的所有字符數據(保留兩位)...
format = lambda x: '%.2f' % x
"applymap() 映射所有元素, 而apply是有軸方向的"
frame.applymap(format)
'applymap() 映射所有元素, 而apply是有軸方向的'
b | d | e | |
---|---|---|---|
Utah | 0.87 | -0.03 | 1.13 |
Ohio | -0.79 | -1.39 | -0.22 |
Texas | -0.12 | 0.24 | 1.63 |
Oregon | -2.73 | -2.01 | 0.91 |
The reason for the name applmap is that Series has a map method for applying an element-wise function:
"applymap()的原因是, Series有一個map的方法"
frame['e'].map(format)
'applymap()的原因是, Series有一個map的方法'
Utah 1.13
Ohio -0.22
Texas 1.63
Oregon 0.91
Name: e, dtype: object
Sorting and Ranking
Sorting a dataset by some criterion(標准) is another important built-in operation. To sort lexicographically(詞典編撰的) by row or column index, use sort_index method, which returns a new, sorted object:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
"sort_index() 按索引排序"
obj.sort_index()
'sort_index() 按索引排序'
a 1
b 2
c 3
d 0
dtype: int64
With a DataFrame, you can sort by index on either axis:
frame = pd.DataFrame(np.arange(8).reshape((2,4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
frame
"按索引排序, 默認axis=0, 下方向"
frame.sort_index()
"按列索引排序, axis=1, 右方向"
frame.sort_index(axis=1)
d | a | b | c | |
---|---|---|---|---|
three | 0 | 1 | 2 | 3 |
one | 4 | 5 | 6 | 7 |
'按索引排序, 默認axis=0, 下方向'
d | a | b | c | |
---|---|---|---|---|
one | 4 | 5 | 6 | 7 |
three | 0 | 1 | 2 | 3 |
'按列索引排序, axis=1, 右方向'
a | b | c | d | |
---|---|---|---|---|
three | 1 | 2 | 3 | 0 |
one | 5 | 6 | 7 | 4 |
The data is sorted in ascending order by default, but can be sorted in descending order, too:
frame.sort_index(axis=1, ascending=False)
d | c | b | a | |
---|---|---|---|---|
three | 0 | 3 | 2 | 1 |
one | 4 | 7 | 6 | 5 |
To sort a Series by its values, use its sort_values method.
obj = pd.Series([4, 7, -3, 2])
"sort_values()按值排序"
obj.sort_values()
'sort_values()按值排序'
2 -3
3 2
0 4
1 7
dtype: int64
Any missing values are sorted to the end of the Series by default. -> 缺失值默認排在最后
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
"缺失值排在最后"
obj.sort_values()
'缺失值排在最后'
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
When sorthing a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more columns names to the by option of sort_values:
frame = pd.DataFrame({
'b':[4,7,3,-2],
'a':[0,1,0, 1]
})
frame
b | a | |
---|---|---|
0 | 4 | 0 |
1 | 7 | 1 |
2 | 3 | 0 |
3 | -2 | 1 |
"by='column_name' 按某個字段進行排序"
frame.sort_values(by='b')
"by='column_name' 按某個字段進行排序"
b | a | |
---|---|---|
3 | -2 | 1 |
2 | 3 | 0 |
0 | 4 | 0 |
1 | 7 | 1 |
"To sort by multiple columns, pass a list of names"
frame.sort_values(by=['a', 'b'])
'To sort by multiple columns, pass a list of names'
b | a | |
---|---|---|
2 | 3 | 0 |
0 | 4 | 0 |
3 | -2 | 1 |
1 | 7 | 1 |
Ranking assigns ranks from one through the number of valid points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank: -. 沒看明白呢
obj = pd.Series([7, -5, 7, 4, 2, 8, 4])
obj.rank()
"Ranks can also be assigned according to the order in which they're observed in the data"
obj.rank(method='first')
0 5.5
1 1.0
2 5.5
3 3.5
4 2.0
5 7.0
6 3.5
dtype: float64
"Ranks can also be assigned according to the order in which they're observed in the data"
0 5.0
1 1.0
2 6.0
3 3.0
4 2.0
5 7.0
6 4.0
dtype: float64
Here, instead of using the average rank 6.5 for the entries 0 and 2, they insead have been set to 6 and 7 because label 0 precedes label 2 inthe data.
You can rank in descending order, too:
# Assign values the maximum rank in the group
obj.rank(ascending=False, method='max')
0 3.0
1 7.0
2 3.0
3 5.0
4 6.0
5 1.0
6 5.0
dtype: float64
See Table 5-6 for a list of tie-breaking methods avaible.
DataFrame can compute ranks over the rows or the columns:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
frame
b | a | c | |
---|---|---|---|
0 | 4.3 | 0 | -2.0 |
1 | 7.0 | 1 | 5.0 |
2 | -3.0 | 0 | 8.0 |
3 | 2.0 | 1 | -2.5 |
frame.rank(axis=1)
b | a | c | |
---|---|---|---|
0 | 3.0 | 2.0 | 1.0 |
1 | 3.0 | 1.0 | 2.0 |
2 | 1.0 | 2.0 | 3.0 |
3 | 3.0 | 2.0 | 1.0 |
rank 參數
- average Default: assign the average rank to each entry in the equal group
- max Use the maximum rank for the whole group
- min ---
- first
- dense
Axis Indexes with Duplicate Lables
Up until now all of the examples we've looked at have had unique axis labels(index values). When many pandas functions(like reindex) require that the labels be unique, it's not mandatory(強制的). Let's consider a small Series with duplicate indices:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
a 0
a 1
b 2
b 3
c 4
dtype: int64
The index's is_unique property can tell you whether its label are unique or not:
"is_unique 屬性來判斷是否有重復索引"
obj.index.is_unique
'is_unique 屬性來判斷是否有重復索引'
False
Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value. -> 存在重復索引的時候, 返回值是一個Series, 不重復則返回一個標量值
df = pd.DataFrame(np.random.randn(4,3), index=['a', 'a', 'b', 'b'])
df
0 | 1 | 2 | |
---|---|---|---|
a | -1.160530 | -0.226480 | 0.608358 |
a | -1.052758 | -0.783890 | 0.920109 |
b | -0.520996 | -0.706842 | 0.459379 |
b | 0.813595 | 1.052030 | 0.263111 |
"選取行索引為 b 的行, 重復, 則返回df"
df.loc['b']
'選取行索引為 b 的行, 重復, 則返回df'
0 | 1 | 2 | |
---|---|---|---|
b | -0.520996 | -0.706842 | 0.459379 |
b | 0.813595 | 1.052030 | 0.263111 |