reindex:重新索引
pandas對象有一個重要的方法reindex,作用:創建一個適應新索引的新對象
以Series為例

1 >>> series_obj = Series([4.5,1.3,5,-5.5],index=('a','b','c','d')) 2 >>> series_obj 3 a 4.5 4 b 1.3 5 c 5.0 6 d -5.5 7 dtype: float64 8 >>> obj2 = series_obj.reindex(['a','b','c','e','f']) 9 >>> obj2 10 a 4.5 11 b 1.3 12 c 5.0 13 e NaN 14 f NaN 15 dtype: float64
重新索引的時候可以自動填充Nan值

1 >>> obj3 = series_obj.reindex(['a','b','c','e','f'],fill_value='0') 2 >>> obj3 3 a 4.5 4 b 1.3 5 c 5 6 e 0 7 f 0
對於時間序列這樣的有序數據,重新索引可能需要做一些插值操作,reindex的method參數提供此功能。
method的可選選項有:
ffill或pad :前向填充或搬運值
bfill或backfill:后向填充或搬運值
不存在前向或后項的行自動填充Nan

1 >>> obj4 = Series(['red','blue','green'],index=[0,2,4]) 2 >>> obj4 3 0 red 4 2 blue 5 4 green 6 dtype: object 7 >>> obj4.reindex(range(6),method='ffill') 8 0 red 9 1 red 10 2 blue 11 3 blue 12 4 green 13 5 green 14 dtype: object
DataFrame的重新索引
只傳入一個序列的時候,默認是重新索引“行”,可以用關鍵字參數來定義行索引(index)和列索引(columns)。

1 >>> frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','b','c'],columns = ['Ohio','Texas',"Cali"]) 2 >>> frame2 = frame.reindex(['a','b','c','d']) 3 >>> frame2 4 Ohio Texas Cali 5 a 0.0 1.0 2.0 6 b 3.0 4.0 5.0 7 c 6.0 7.0 8.0 8 d NaN NaN NaN 9 10 >>> frame3 = frame.reindex(columns = ['Ohio','Texas','Cali','Wile'],index=['a','b','c','d'],fill_value=4) 11 >>> frame3 12 Ohio Texas Cali Wile 13 a 0 1 2 4 14 b 3 4 5 4 15 c 6 7 8 4 16 d 4 4 4 4 17 >>>
如果對DataFrame的行和列重新索引的時候,插值只能按行應用
如果利用ix的標簽索功能,重新索引會變得更簡潔

1 >>> frame5 = frame.ix[['a','b','c','d'], ['Ohio','Texas','Cali','Wile']] 2 >>> frame5 3 Ohio Texas Cali Wile 4 a 0.0 1.0 2.0 NaN 5 b 3.0 4.0 5.0 NaN 6 c 6.0 7.0 8.0 NaN 7 d NaN NaN NaN NaN
drop:丟棄指定軸上的項

>>> obj = Series(np.arange(5),index=['a','b','c','d','e']) >>> obj a 0 b 1 c 2 d 3 e 4 dtype: int32 >>> new_obj = obj.drop('b') >>> new_obj a 0 c 2 d 3 e 4 >>> new_obj2 = obj.drop(['b','c']) >>> new_obj2 a 0 d 3 e 4 dtype: int32
#dataframe >>> frame = DataFrame(np.arange(16).reshape((4,4)),index=['a','b','c','d'],columns=['one','two','three','four']) >>> frame one two three four a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 >>> new_frame = frame.drop('a') >>> new_frame one two three four b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 >>> new_frame2 = frame.drop(['two','four'],axis = 1) >>> new_frame2 one three a 0 2 b 4 6 c 8 10 d 12 14
索引、選取和過濾
Series的索引,既可以是類似NumPy數組的索引,也可以是自定義的index
>>> obj a 0 b 1 c 2 d 3 e 4 dtype: int32 >>> obj['a'] 0 >>> obj[1] 1
注意:利用標簽的切片運算,標簽的右側是封閉區間的,即包含末端。 >>> obj['a':'c'] a 0 b 1 c 2 dtype: int32 >>> obj[3:4] d 3 dtype: int32 >>> obj[2:3] c 2 dtype: int32 >>> obj[[3,1]] d 3 b 1 dtype: int32 >>> obj[['a','c']] a 0 c 2 dtype: int32 >>>
通過索引修改值
>>> obj[['b','d']] *=2 >>> obj a 0 b 2 c 2 d 6 e 4 dtype: int32
dataframe的索引:
通過直接索引只能獲取列
>>> frame one two three four a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 >>> frame['a'] KeyError: 'a' >>> frame['one'] a 0 b 4 c 8 d 12 Name: one, dtype: int32 >>> frame[['one','four']] one four a 0 3 b 4 7 c 8 11 d 12 15
通過切片或布爾型數組,選取的是行
>>> frame[1:3] #不閉合區間 one two three four b 4 5 6 7 c 8 9 10 11 >>> frame[frame['three'] > 8] one two three four c 8 9 10 11 d 12 13 14 15 >>>
DataFrame的索引字段ix
>>> frame.ix['a'] #按照行索引 one 0 two 1 three 2 four 3 Name: a, dtype: int32 >>> frame.ix[['b','d']] one two three four b 4 5 6 7 d 12 13 14 15
>>> frame.ix[1]#同樣是按照行索引 one 4 two 5 three 6 four 7 Name: b, dtype: int32 >>> frame.ix[1:3] one two three four b 4 5 6 7 c 8 9 10 11
>>> frame.ix[1:2,[2,3,1]] three four two b 6 7 5 >>> frame.ix[1:3,[2,3,1]] three four two b 6 7 5 c 10 11 9 >>> frame.ix[['b','d'],['one','three']] one three b 4 6 d 12 14 >>> frame.ix[['b','d'],[3,1,2]] four two three b 7 5 6 d 15 13 14 >>> frame.ix[:,[2,3,1]]# 選取所有行 three four two a 2 3 1 b 6 7 5 c 10 11 9 d 14 15 13
>>> frame.ix[frame.three >5,:3]
one two three
b 4 5 6
c 8 9 10
d 12 13 14
算術運算和數據對齊
>>> s1 = Series([1.3,4.5,6.6,3.4],index=['a','b','c','d']) >>> s2 = Series([1,2,3,4,5,6,7],index=['a','b','c','d','e','f','g']) >>> s1+s2 a 2.3 b 6.5 c 9.6 d 7.4 e NaN f NaN g NaN dtype: float64 #不重疊的索引處引入缺失值 #DataFrame也是同理
再算術方法中填充缺失值
>>> df1 = DataFrame(np.arange(12).reshape((3,4)),columns=list('abcd')) >>> df2 = DataFrame(np.arange(20).reshape((4,5)),columns=list('abcde')) >>> df1+df2#普通的算術運算會產生缺失值 a b c d e 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaN #用算術運算方法,可以填充缺失值 >>> df1.add(df2,fill_value=0) a b c d e 0 0.0 2.0 4.0 6.0 4.0 1 9.0 11.0 13.0 15.0 9.0 2 18.0 20.0 22.0 24.0 14.0 3 15.0 16.0 17.0 18.0 19.0 >>>
算術運算方法有
add 加法
sub 減法
div 除法
mul 乘法
DataFrame和Series之間的運算
>>> frame one two three four a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 >>> series = frame.ix[0] >>> series one 0 two 1 three 2 four 3 Name: a, dtype: int32 >>> frame - series one two three four a 0 0 0 0 b 4 4 4 4 c 8 8 8 8 d 12 12 12 12 >>>
兩者之間的運算會將Series的索引匹配到DataFrame的列,然后沿着行一直向下廣播。
如果某個索引值在DataFrame的列或Series的索引中找不到,則參與運算的連個對象就會被重新索引以形成並集。
>>> series2 = Series(range(3),index = ['two','four','five']) >>> frame +series2 five four one three two a NaN 4.0 NaN NaN 1.0 b NaN 8.0 NaN NaN 5.0 c NaN 12.0 NaN NaN 9.0 d NaN 16.0 NaN NaN 13.0
如果希望匹配行,且在列上傳播,則必須使用算術方法
>>> series3 = frame['two'] >>> frame.sub(series3,axis = 0) one two three four a -1 0 1 2 b -1 0 1 2 c -1 0 1 2 d -1 0 1 2 >>>