本文包含一些 Pandas 的基本操作,旨在快速上手 Pandas 的基本操作。
讀者最好有 NumPy 的基礎,如果你還不熟悉 NumPy,建議您閱讀NumPy基本操作快速熟悉。
Pandas 數據結構
Pandas 有兩個核心的數據結構:Series 和 DataFrame。
Series
Series 是一維的類數組對象,包含一個值序列以及對應的索引。
1 obj = pd.Series([6, 66, 666, 6666]) 2 obj
0 6 1 66 2 666 3 6666 dtype: int64
此時索引默認為 0 到 N。我們可以分別訪問 Series 的值和索引:
1 obj.values 2 obj.index # 類似於 range(4)
array([ 6, 66, 666, 6666]) RangeIndex(start=0, stop=4, step=1)
索引可以用標簽來指定:
1 obj2 = pd.Series([6, 66, 666, 6666], index=['d', 'b', 'a', 'c']) 2 obj2 3 obj2.index
d 6 b 66 a 666 c 6666 dtype: int64 Index(['d', 'b', 'a', 'c'], dtype='object')
可以使用標簽索引來訪問 Series 的值,這有點像 NumPy 和字典的結合。
1 obj2['a'] 2 obj2['d'] = 66666 3 obj2[['c', 'a', 'd']]
666 c 6666 a 666 d 66666 dtype: int64
Series 可以使用很多類似於 NumPy 的操作:
1 obj2[obj2 > 100] 2 obj2 / 2 3 np.sqrt(obj2)
d 66666 a 666 c 6666 dtype: int64 d 33333.0 b 33.0 a 333.0 c 3333.0 dtype: float64 d 258.197599 b 8.124038 a 25.806976 c 81.645576 dtype: float64
判斷某索引是否存在:
1 'b' in obj2 2 'e' in obj2
True False
可以直接將字典傳入 Series 來創建 Series 對象:
1 sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} 2 obj3 = pd.Series(sdata) 3 obj3 4 states = ['Texas', 'California', 'Ohio', 'Oregon'] 5 obj4 = pd.Series(sdata, index=states) # 指定了索引及其順序 6 obj4
Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 Texas 71000.0 California NaN Ohio 35000.0 Oregon 16000.0 dtype: float64
通過 isnull 和 notnull 可以檢測是否有空值,既可以使用 Pandas 函數也可以使用 Series 方法:
1 pd.isnull(obj4) 2 pd.notnull(obj4) 3 obj4.isnull() 4 obj4.notnull()
Texas False California True Ohio False Oregon False dtype: bool Texas True California False Ohio True Oregon True dtype: bool Texas False California True Ohio False Oregon False dtype: bool Texas True California False Ohio True Oregon True dtype: bool
Series 的數據對齊,類似於數據庫的連接操作:
1 obj3 2 obj4 3 obj3 + obj4
Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 Texas 71000.0 California NaN Ohio 35000.0 Oregon 16000.0 dtype: float64 California NaN Ohio 70000.0 Oregon 32000.0 Texas 142000.0 Utah NaN dtype: float64
Series 及其索引都有一個 name 屬性:
1 obj4.name = 'population' 2 obj4.index.name = 'state' 3 obj4
state Texas 71000.0 California NaN Ohio 35000.0 Oregon 16000.0 Name: population, dtype: float64
Series 的索引可以原地(in-place)修改:
1 obj 2 obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] 3 obj
0 6 1 66 2 666 3 6666 dtype: int64 Bob 6 Steve 66 Jeff 666 Ryan 6666 dtype: int64
DataFrame
DataFrame 表示一張矩陣數據表,其中包含有序的列集合,每列都可以表示不同的數據類型。
DataFrame 有行索引和列索引,可以把它當做是 Series 的字典,該字典共享同一套行索引。
通過等長列表(或 NumPy 數組)的字典是常用的創建方式:
1 # 每個值都是長度為 6 的列表 2 data = { 3 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 4 'year': [2000, 2001, 2002, 2001, 2002, 2003], 5 'pop': [2.5, 1.7, 3.6, 2.4, 2.9, 3.2], 6 } 7 frame = pd.DataFrame(data) 8 data
state year pop 0 Ohio 2000 2.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9 5 Nevada 2003 3.2
通過 head 方法可以選擇前幾行:
1 frame.head() 2 frame.head(2)
state year pop 0 Ohio 2000 2.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9
state year pop 0 Ohio 2000 2.5 1 Ohio 2001 1.7
指定列的序列,可以按照響應順序來展示:
1 pd.DataFrame(data, columns=['year', 'state', 'pop'])
year state pop 0 2000 Ohio 2.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9 5 2003 Nevada 3.2
同樣也可以指定索引:
1 frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], 2 index=['one', 'two', 'three', 'four', 'five', 'six']) 3 frame2 4 frame2.columns 5 frame2.index
year state pop debt one 2000 Ohio 2.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN six 2003 Nevada 3.2 NaN Index(['year', 'state', 'pop', 'debt'], dtype='object') Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')
通過鍵或者屬性提取 DataFrame 的一列,得到的是 Series:
1 frame2['state'] 2 frame2.year 3 type(frame2.year)
one Ohio two Ohio three Ohio four Nevada five Nevada six Nevada Name: state, dtype: object one 2000 two 2001 three 2002 four 2001 five 2002 six 2003 Name: year, dtype: int64 pandas.core.series.Series
通過 loc 屬性可以指定標簽,檢索行數據:
1 frame2.loc['three'] 2 type(frame2.loc['three'])
year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object pandas.core.series.Series
對 'debt' 列進行賦值:
1 frame2['debt'] = 16.5 2 frame2 3 frame2['debt'] = np.arange(6.) 4 frame2
year state pop debt one 2000 Ohio 2.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.9 16.5 six 2003 Nevada 3.2 16.5 year state pop debt one 2000 Ohio 2.5 0.0 two 2001 Ohio 1.7 1.0 three 2002 Ohio 3.6 2.0 four 2001 Nevada 2.4 3.0 five 2002 Nevada 2.9 4.0 six 2003 Nevada 3.2 5.0
可以用 Series 來賦值 DataFrame 列:
1 val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) 2 frame2['debt'] = val 3 frame2
year state pop debt one 2000 Ohio 2.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7 six 2003 Nevada 3.2 NaN
給不存在的列賦值會創建新列:
1 frame2['eastern'] = frame2.state == 'Ohio' 2 frame2
year state pop debt eastern one 2000 Ohio 2.5 NaN True two 2001 Ohio 1.7 -1.2 True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 -1.5 False five 2002 Nevada 2.9 -1.7 False six 2003 Nevada 3.2 NaN False
刪除某列:
del frame2['eastern'] frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
DataFrame 取出來的列是底層數據的視圖,不是拷貝。所做的修改會反映到底層數據中,如果要拷貝必須使用顯式的 copy 方法。
傳入嵌套字典的情況:
1 pop = { 2 'Nevada': { 3 2001: 2.4, 4 2002: 2.9, 5 }, 6 'Ohio': { 7 2000: 1.5, 8 2001: 1.7, 9 2002: 3.6, 10 } 11 } 12 frame3 = pd.DataFrame(pop) 13 frame3
Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
DataFrame 的轉置:
1 frame3.T
2000 2001 2002 Nevada NaN 2.4 2.9 Ohio 1.5 1.7 3.6
values 屬性是包含 DataFrame 值的 ndarray:
1 frame3.values
array([[nan, 1.5],
[2.4, 1.7],
[2.9, 3.6]])
索引對象
Pandas 的索引對象用於存儲軸標簽和元數據(軸名稱等)。
1 obj = pd.Series(range(3), index=['a', 'b', 'c']) 2 index = obj.index 3 index 4 index.name = 'alpha' 5 index[1:]
Index(['a', 'b', 'c'], dtype='object') Index(['b', 'c'], dtype='object', name='alpha')
索引對象是不可變對象,因此不可修改:
1 index[1] = 'd' # 報錯
索引類似於固定長度的集合:
1 frame3 2 frame3.columns 3 frame3.index 4 # 類似於集合 5 'Nevada' in frame3.columns 6 2000 in frame3.index
Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6 Index(['Nevada', 'Ohio'], dtype='object') Int64Index([2000, 2001, 2002], dtype='int64') True True
但是索引和集合的不同之處是可以包含重復的標簽:
1 dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar']) 2 dup_labels 3 set_labels = set(['foo', 'foo', 'bar', 'bar']) 4 set_labels
Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
{'bar', 'foo'}
索引對象的一些常用方法:
- append
- difference
- intersection
- union
- isin
- delete
- drop
- insert
- is_monotonic
- is_unique
- unique
Pandas 基本功能
重新索引
reindex 用於創建一個新對象,其索引進行了重新編排。
1 obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) 2 obj 3 obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) 4 obj2
d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64
ffill 用於向前填充值,在重新索引時會進行插值:
1 obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4]) 2 obj3 3 obj3.reindex(range(6), method='ffill')
0 blue 2 purple 4 yellow dtype: object 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
無論行索引還是列索引都可以重新編排:
1 frame = pd.DataFrame(np.arange(9).reshape((3, 3)), 2 index=['a', 'c', 'd'], 3 columns=['Ohio', 'Texas', 'California']) 4 frame 5 frame.reindex(['a', 'b', 'c', 'd']) 6 frame.reindex(columns=['Texas', 'Utah', 'California'])
Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0 Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8
根據軸刪除數據
使用 drop 方法對數據行或列進行刪除:
1 obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) 2 obj 3 new_obj = obj.drop('c') 4 new_obj 5 obj.drop(['d', 'c'])
a 0.0 b 1.0 c 2.0 d 3.0 e 4.0 dtype: float64 a 0.0 b 1.0 d 3.0 e 4.0 dtype: float64 a 0.0 b 1.0 e 4.0 dtype: float64
1 data = pd.DataFrame(np.arange(16).reshape((4, 4)), 2 index=['Ohio', 'Colorado', 'Utah', 'New York'], 3 columns=['one', 'two', 'three', 'four']) 4 data 5 data.drop(['Colorado', 'Ohio']) 6 data.drop('two', axis=1) 7 data.drop(['two', 'four'], axis='columns')
one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 one two three four Utah 8 9 10 11 New York 12 13 14 15 one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15 one three Ohio 0 2 Colorado 4 6 Utah 8 10 New York 12 14
原地刪除:
1 obj 2 obj.drop('c', inplace=True) 3 obj
a 0.0 b 1.0 c 2.0 d 3.0 e 4.0 dtype: float64 a 0.0 b 1.0 d 3.0 e 4.0 dtype: float64
索引、選擇、過濾
Series 的索引類似於 NumPy,只不過 Series 還可以用索引標簽,不一定是整型索引。
1 obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd']) 2 obj 3 obj['b'] 4 obj[1] 5 obj[2:4] 6 obj[['b', 'a', 'd']] 7 obj[[1, 3]] 8 obj[obj < 2]
a 0.0 b 1.0 c 2.0 d 3.0 dtype: float64 1.0 1.0 c 2.0 d 3.0 dtype: float64 b 1.0 a 0.0 d 3.0 dtype: float64 b 1.0 d 3.0 dtype: float64 a 0.0 b 1.0 dtype: float64
Series 的切片和原生 Python 有所不同,是閉區間(Python 是左閉右開區間)。
1 obj['b':'c']
b 1.0 c 2.0 dtype: float64
使用切片進行賦值:
1 obj['b':'c'] = 5 2 obj
a 0.0 b 5.0 c 5.0 d 3.0 dtype: float64
選擇 DataFrame 的若干列:
1 data = pd.DataFrame(np.arange(16).reshape((4, 4)), 2 index=['Ohio', 'Colorado', 'Utah', 'New York'], 3 columns=['one', 'two', 'three', 'four']) 4 data 5 data['two'] 6 type(data['two']) 7 data[['three', 'one']] 8 type(data[['three', 'one']])
one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 Ohio 1 Colorado 5 Utah 9 New York 13 Name: two, dtype: int64 pandas.core.series.Series three one Ohio 2 0 Colorado 6 4 Utah 10 8 New York 14 12 pandas.core.frame.DataFrame
使用切片和布爾數組選擇 DataFrame:
1 data[:2] 2 data[data['three'] > 5]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
DataFrame 語法上很像二維的 NumPy 數組,使用布爾 DataFrame:
1 data < 5 2 data[data < 5] = 0 3 data
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
使用 loc 和 iloc 進行選擇
loc 和 iloc 可以用於選擇 DataFrame 的行和列。
1 data.loc['Colorado', ['two', 'three']]
two 5 three 6 Name: Colorado, dtype: int64
1 data.iloc[2, [3, 0, 1]] 2 data.iloc[2] 3 data.iloc[[1, 2], [3, 0, 1]]
four 11
one 8
two 9
Name: Utah, dtype: int64
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
four one two
Colorado 7 0 5
Utah 11 8 9
同樣可以使用切片:
1 data.loc[:'Utah', 'two'] 2 data.iloc[:, :3][data.three > 5]
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int64
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
DataFrame 常用索引方法:
- df[val]:選列
- df.loc[val]:選行
- df.loc[:, val]:選列
- df.loc[val1, val2]:選列和行
- df.iloc[where]:選行
- df.iloc[:, where]:選列
- df.iloc[where_i, where_j]:選列和行
- df.at[label_i, label_j]:選某一標量
- df.iat[i, j]:選某一標量
- reindex:選列和行
- get_value, set_value:選某一標量
算術和數據對齊
前面我們介紹了 Series 的算術對齊,接下來是 DataFrame 的:
1 df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), 2 index=['Ohio', 'Texas', 'Colorado']) 3 df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), 4 index=['Utah', 'Ohio', 'Texas', 'Oregon']) 5 df1 6 df2 7 df1 + df2
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado6.0 7.0 8.0
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
b c d e
ColoradoNaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
沒有相同行索引和列索引的情況:
1 df1 = pd.DataFrame({'A': [1, 2]}) 2 df2 = pd.DataFrame({'B': [3, 4]}) 3 df1 4 df2 5 df1 - df2
A 0 1 1 2 B 0 3 1 4 A B 0 NaN NaN 1 NaN NaN
填補值的算術方法
對於算術運算后產生空值的情況:
1 df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), 2 columns=list('abcd')) 3 df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), 4 columns=list('abcde')) 5 df2.loc[1, 'b'] = np.nan 6 df1 7 df2 8 df1 + df2
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
可以使用 fill_value 屬性來自動填空值:
1 df1.add(df2, fill_value=0)
a b c d e 0 0.0 2.0 4.0 6.0 4.0 1 9.0 5.0 13.0 15.0 9.0 2 18.0 20.0 22.0 24.0 14.0 3 15.0 16.0 17.0 18.0 19.0
常用的算術方法(可用於補空值):
- add, radd
- sub, rsub
- div, rdiv
- floordiv, rfloordiv
- mul, rmul
- pow, rpow
DataFrame 和 Series 之間的運算
首先,看看一維數組和二維數組的減法的情況。
1 arr = np.arange(12.).reshape((3, 4)) 2 arr 3 arr[0] 4 arr - arr[0]
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
array([0., 1., 2., 3.])
array([[0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.]])
默認情況下,DataFrame 和 Series 會匹配索引進行計算。
1 frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), 2 columns=list('bde'), 3 index=['Utah', 'Ohio', 'Texas', 'Oregon']) 4 series = frame.iloc[0] 5 frame 6 series 7 frame - series
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
如果索引不匹配,會用外連接的方式重組索引:
1 series2 = pd.Series(range(3), index=['b', 'e', 'f']) 2 frame + series2
b d e f Utah 0.0 NaN 3.0 NaN Ohio 3.0 NaN 6.0 NaN Texas 6.0 NaN 9.0 NaN Oregon 9.0 NaN 12.0 NaN
如果希望對於所有列進行算術計算(broadcast 機制),必須使用前面介紹的算術方法。
1 series3 = frame['d'] 2 frame 3 series3 4 frame.sub(series3, axis=0)
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
apply 和 map
NumPy 的通用函數(按元素的數組方法)可以用於 Pandas 對象:
1 frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 2 index=['Utah', 'Ohio', 'Texas', 'Oregon']) 3 frame 4 np.abs(frame)
b d e
Utah -0.590458 -0.352861 0.820549
Ohio -1.708280 0.174739 -1.081811
Texas 0.857712 0.246972 0.532208
Oregon 0.812756 1.260538 0.818304
b d e
Utah 0.590458 0.352861 0.820549
Ohio 1.708280 0.174739 1.081811
Texas 0.857712 0.246972 0.532208
Oregon 0.812756 1.260538 0.818304
一個常用的操作是按列或者行調用某一個函數,apply 可以達到該功能:
1 f = lambda x: x.max() - x.min() 2 frame.apply(f)
b 2.565993 d 1.613398 e 1.902360 dtype: float64
1 frame.apply(f, axis=1)
Utah 1.411007 Ohio 1.883019 Texas 0.610741 Oregon 0.447782 dtype: float64
apply 傳入的函數不一定非得返回一個標量值,可以返回 Series:
1 def f(x): 2 return pd.Series([x.min(), x.max()], index=['min', 'max']) 3 frame.apply(f)
b d e min -1.708280 -0.352861 -1.081811 max 0.857712 1.260538 0.820549
還可以傳入按元素計算的函數:
1 frame 2 format = lambda x: '%.2f' % x 3 frame.applymap(format)
b d e
Utah -0.590458 -0.352861 0.820549
Ohio -1.708280 0.174739 -1.081811
Texas 0.857712 0.246972 0.532208
Oregon 0.812756 1.260538 0.818304
b d e
Utah -0.59 -0.35 0.82
Ohio -1.71 0.17 -1.08
Texas 0.86 0.25 0.53
Oregon 0.81 1.26 0.82
按元素應用某函數必須使用 applymap。取這個名字的原因是 Series 有一個 map 方法就是用來按元素調用函數的。
1 frame['e'].map(format)
Utah 0.82 Ohio -1.08 Texas 0.53 Oregon 0.82 Name: e, dtype: object
排序和排名
對行索引進行排列:
1 obj = pd.Series(range(4), index=['d', 'a', 'b', 'c']) 2 obj 3 obj.sort_index()
d 0 a 1 b 2 c 3 dtype: int64 a 1 b 2 c 3 d 0 dtype: int64
對列索引進行排列:
1 frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 2 index=['three', 'one'], 3 columns=['d', 'a', 'b', 'c']) 4 frame 5 frame.sort_index() 6 frame.sort_index(axis=1)
d a b c
three 0 1 2 3
one 4 5 6 7
d a b c
one 4 5 6 7
three 0 1 2 3
a b c d
three 1 2 3 0
one 5 6 7 4
按降序排列:
1 frame.sort_index(axis=1, ascending=False)
d c b a three 0 3 2 1 one 4 7 6 5
對於 Series 和 DataFrame,如果需要根據值來進行排列,使用 sort_values 方法:
1 obj = pd.Series([4, 7, -3, 2]) 2 obj.sort_values() 3 obj = pd.Series([4, np.nan, 7, np.nan, -3, 2]) 4 obj.sort_values()
2 -3 3 2 0 4 1 7 dtype: int64 4 -3.0 5 2.0 0 4.0 2 7.0 1 NaN 3 NaN dtype: float64
1 frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) 2 frame 3 frame.sort_values(by='b') 4 frame.sort_values(by=['a', 'b'])
b a
0 4 0
1 7 1
2 -3 0
3 2 1
b a
2 -3 0
3 2 1
0 4 0
1 7 1
b a
2 -3 0
0 4 0
3 2 1
1 7 1
排名(ranking)用於計算出排名值。
1 obj = pd.Series([7, -5, 7, 4, 2, 4, 0, 4, 4]) 2 obj.rank() 3 obj.rank(method='first') # 不使用平均值的排名方法
0 8.5 1 1.0 2 8.5 3 5.5 4 3.0 5 5.5 6 2.0 7 5.5 8 5.5 dtype: float64 0 8.0 1 1.0 2 9.0 3 4.0 4 3.0 5 5.0 6 2.0 7 6.0 8 7.0 dtype: float64
降序排列,並且按最大值指定名次:
1 obj.rank(ascending=False, method='max')
0 2.0 1 9.0 2 2.0 3 6.0 4 7.0 5 6.0 6 8.0 7 6.0 8 6.0 dtype: float64
按列進行排名:
1 frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],'c': [-2, 5, 8, -2.5]}) 2 frame 3 frame.rank(axis=1)
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0
帶有重復標簽的索引
Series 的重復索引:
1 obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c']) 2 obj 3 obj.index.is_unique 4 obj['a'] 5 obj['c']
a 0 a 1 b 2 b 3 c 4 dtype: int64 False a 0 a 1 dtype: int64 4
Pandas 的重復索引:
1 df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b']) 2 df 3 df.loc['b']
0 1 2
a -0.975030 2.041130 1.022168
a 0.321428 2.124496 0.037530
b 0.343309 -0.386692 -0.577290
b 0.002090 -0.890841 1.759072
0 1 2
b 0.343309 -0.386692 -0.577290
b 0.002090 -0.890841 1.759072
求和與計算描述性統計量
sum 求和:
1 df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], 2 [np.nan, np.nan], [0.75, -1.3]], 3 index=['a', 'b', 'c', 'd'], 4 columns=['one', 'two']) 5 df 6 df.sum() 7 df.sum(axis=1) 8 df.mean(axis=1, skipna=False) # 不跳過 NAN
one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 one 9.25 two -5.80 dtype: float64 a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64 a NaN b 1.300 c NaN d -0.275 dtype: float64
返回最大值或最小值對應的索引:
1 df.idxmax() 2 df.idxmin(axis=1)
one b two d dtype: object a one b two c NaN d two dtype: object
累加:
1 df.cumsum()
one two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8
describe 方法生成一些統計信息:
1 df.describe()
one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000
1 # 非數值數據 2 obj = pd.Series(['a', 'a', 'b', 'c'] * 4) 3 obj 4 obj.describe()
0 a 1 a 2 b 3 c 4 a 5 a 6 b 7 c 8 a 9 a 10 b 11 c 12 a 13 a 14 b 15 c dtype: object count 16 unique 3 top a freq 8 dtype: object
一些常用的統計方法:
- count
- describe
- min, max
- argmin, argmax
- idxmin, idxmax
- quantile
- sum
- mean
- median
- mad
- prod
- var
- std
- skew
- kurt
- cumsum
- cummin, cummax
- cumprod
- diff
- pct_change
自相關和協方差
首先安裝一個讀取數據集的模塊:
pip install pandas-datareader -i https://pypi.douban.com/simple
下載一個股票行情的數據集:
1 import pandas_datareader.data as web 2 all_data = {ticker: web.get_data_yahoo(ticker) 3 for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']} 4 price = pd.DataFrame({ticker: data['Adj Close'] 5 for ticker, data in all_data.items()}) 6 volume = pd.DataFrame({ticker: data['Volume'] 7 for ticker, data in all_data.items()})
計算價格的百分比變化:
1 returns = price.pct_change() 2 returns.tail()
AAPL IBM MSFT GOOG Date 2019-08-06 0.018930 -0.000213 0.018758 0.015300 2019-08-07 0.010355 -0.011511 0.004380 0.003453 2019-08-08 0.022056 0.018983 0.026685 0.026244 2019-08-09 -0.008240 -0.028337 -0.008496 -0.013936 2019-08-12 -0.002537 -0.014765 -0.013942 -0.011195
Series 的 corr 方法會計算兩個 Series 的相關性,而 cov 方法計算協方差。
1 returns['MSFT'].corr(returns['IBM']) 2 returns.MSFT.cov(returns['IBM'])
0.48863990166304594 8.714318020797283e-05
DataFrame 的 corr 方法和 cov 會計算相關性和協方差的矩陣:
1 returns.corr() 2 returns.cov()
AAPL IBM MSFT GOOG
AAPL 1.000000 0.381659 0.453727 0.459663
IBM 0.381659 1.000000 0.488640 0.402751
MSFT 0.453727 0.488640 1.000000 0.535898
GOOG 0.459663 0.402751 0.535898 1.000000
AAPL IBM MSFT GOOG
AAPL 0.000266 0.000077 0.000107 0.000117
IBM 0.000077 0.000152 0.000087 0.000077
MSFT 0.000107 0.000087 0.000209 0.000120
GOOG 0.000117 0.000077 0.000120 0.000242
DataFrame 的 corrwith 方法可以計算和其他 Series 或 DataFrame 的相關性:
1 returns.corrwith(returns.IBM) 2 returns.corrwith(volume)
AAPL 0.381659 IBM 1.000000 MSFT 0.488640 GOOG 0.402751 dtype: float64 AAPL -0.061924 IBM -0.151708 MSFT -0.089946 GOOG -0.018591 dtype: float64
唯一值、值的計數、關系
unique 方法唯一值數組:
1 obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) 2 obj.unique()
array(['c', 'a', 'd', 'b'], dtype=object)
value_counts 方法返回值的計數:
1 obj.value_counts() 2 pd.value_counts(obj.values, sort=False) # 等價的寫法
a 3 c 3 b 2 d 1 dtype: int64 c 3 b 2 d 1 a 3 dtype: int64
isin 檢查 Series 中的數是否屬於某個集合:
1 obj 2 mask = obj.isin(['b', 'c']) 3 mask 4 obj[mask]
0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool 0 c 5 b 6 b 7 c 8 c dtype: object
get_indexer 將唯一值轉換為索引(對於標簽轉換為數值很管用):
1 to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a']) 2 unique_vals = pd.Series(['c', 'b', 'a']) 3 pd.Index(unique_vals).get_indexer(to_match)
array([0, 2, 1, 1, 0, 2])
計算多個列的直方圖:
1 data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 2 'Qu2': [2, 3, 1, 2, 3], 3 'Qu3': [1, 5, 2, 4, 4]}) 4 data 5 data.apply(pd.value_counts).fillna(0)
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
參考
- 《Python for Data Analysis, 2nd Edition》by Wes McKinney
