2 基本功能
只是一些基本功能,更深奧的內容用到再摸索。
2.1 重新索引
reindex是pandas的重要方法,舉個例子:
In [101]: obj = Series([4,7,-5,3.4],index=['c','a','b','d'])
In [102]: obj
Out[102]:
c 4.0
a 7.0
b -5.0
d 3.4
dtype: float64
In [103]: obj2 = obj.reindex(['a','b','c','d','e'])
In [104]: obj2
Out[104]:
a 7.0
b -5.0
c 4.0
d 3.4
e NaN
dtype: float64
# 缺失值可以自定義
In [105]: obj.reindex(['a','b','c','d','e'],fill_value=0)
Out[105]:
a 7.0
b -5.0
c 4.0
d 3.4
e 0.0 #缺失值填充
dtype: float64
reindex的插值method選項:
參數 | 說明 |
---|---|
ffill或pad | 前向填充值 |
bfill或backfill | 后向填充值 |
In [106]: obj3 = Series(['blue','purple','yellow'],index=[0,2,4])
# 前向填充
In [107]: obj3.reindex(range(6),method='ffill')
Out[107]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
# 后向填充
In [109]: obj3.reindex(range(6),method='bfill')
Out[109]:
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
針對DataFrame,可以修改行、列或兩個都進行重新索引。
In [111]: frame = DataFrame(np.arange(9).reshape(3,3), index=['a','b','c'],colmns=['Ohio','Texas','California'])
In [112]: frame
Out[112]:
Ohio Texas California
a 0 1 2
b 3 4 5
c 6 7 8
In [113]: frame2 = frame.reindex(['a','b','c','d']) # 默認行索引
In [115]: frame2
Out[115]:
Ohio Texas California
a 0.0 1.0 2.0
b 3.0 4.0 5.0
c 6.0 7.0 8.0
d NaN NaN NaN
In [116]: states = ['Texas','Utah','California']
In [117]: frame.reindex(columns=states) #指定列索引
Out[117]:
Texas Utah California
a 1 NaN 2
b 4 NaN 5
c 7 NaN 8
# 對行、列都進行重新索引,
# 並且進行插值,但是只能在0軸進行,即按行應用。
In [118]: frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
Out[118]:
Texas Utah California
a 1 NaN 2
b 4 NaN 5
c 7 NaN 8
d 7 NaN 8
# 用ix更簡潔。
In [119]: frame.ix[['a','b','c','d'],states]
Out[119]:
Texas Utah California
a 1.0 NaN 2.0
b 4.0 NaN 5.0
c 7.0 NaN 8.0
d NaN NaN NaN
reindex函數的參數
2.2 丟棄指定軸上的項
丟棄項,只要一個索引或列表即可。drop方法會返回一個刪除了指定值的新對象。
In [120]: obj = Series(np.arange(5.),index=['a','b','c','d','e'])
In [121]: new_obj = obj.drop('c')
In [122]: new_obj
Out[122]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [124]: obj.drop(['d','c'])
Out[124]:
a 0.0
b 1.0
e 4.0
dtype: float64
針對DataFrame,可以刪除任意軸上的索引值。
In [125]: data = DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado',
...: 'Utah','New York'],columns=['one','two','three','four'])
In [126]: data
Out[126]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [127]: data.drop(['Colorado','Ohio'])
Out[127]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [128]: data.drop(['two',],axis=1)
Out[128]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
2.3 索引、選取和過濾
In [129]: obj = Series(np.arange(4.),index=['a','b','c','d'])
In [130]: obj['a'] #使用index索引
Out[130]: 0.0
In [131]: obj[0] #使用序號來索引
Out[131]: 0.0
In [132]: obj[1]
Out[132]: 1.0
In [133]: obj
Out[133]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
In [134]: obj[1:2] # 使用序號切片
Out[134]:
b 1.0
dtype: float64
In [135]: obj[1:3]
Out[135]:
b 1.0
c 2.0
dtype: float64
In [136]: obj[obj<2] # 使用值判斷
Out[136]:
a 0.0
b 1.0
dtype: float64
In [137]: obj['b':'c'] # 使用索引切片,注意是兩端包含的。
Out[137]:
b 1.0
c 2.0
dtype: float64
In [138]: obj['b':'c'] = 100 # 賦值
In [139]: obj
Out[139]:
a 0.0
b 100.0
c 100.0
d 3.0
dtype: float64
針對DataFrame,索引就是獲取一個或多個列。
使用列名:獲取列
使用序號或bool值:獲取行
In [140]: data
Out[140]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [141]:
In [141]: data['two'] # 獲取第2列
Out[141]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
In [142]: data[['two','one']] # 按要求獲取列
Out[142]:
two one
Ohio 1 0
Colorado 5 4
Utah 9 8
New York 13 12
In [143]: data[:2] # 獲取前面兩行,使用數字序號獲取的是行
Out[143]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
In [144]: data[data['three']>5] # 獲取第三列大於5的行
Out[144]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
DataFrame在語法上與ndarray是比較相似的。
In [146]: data < 5
Out[146]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [147]: data[data<5] = 0
In [148]: data
Out[148]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
索引字段ix:
可以通過Numpy的標記法以及軸標簽從DataFrame中選取行和列的子集。
此外,ix得表述方式很簡單
In [150]: data.ix['Colorado',['two','three']]
Out[150]:
two 5
three 6
Name: Colorado, dtype: int32
In [151]: data.ix[['Colorado','Utah'],[3,0,1]]
Out[151]:
four one two
Colorado 7 0 5
Utah 11 8 9
In [152]: data.ix[2]
Out[152]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
In [153]: data.ix[:'Utah','two']
Out[153]:
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
DataFrame的索引選項
2.4 算術運算和數據對齊
算術運算結果就是不同索引之間的並集,不存在的值之間運算結果用NaN表示。
In [4]: s1 = Series([-2,-3,5,-1],index=list('abcd'))
In [5]: s2 = Series([9,2,5,1,5],index=list('badef'))
In [6]: s1 + s2
Out[6]:
a 0.0
b 6.0
c NaN
d 4.0
e NaN
f NaN
dtype: float64
DataFrame也是一樣,會同時發生在行和列上。
在算術方法中填充值
In [7]: df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
In [8]: df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))
In [9]: df1
Out[9]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
In [10]: df2
Out[10]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
In [11]: df1 + df2 # 不填充值
Out[11]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
In [12]: df1.add(df2, fill_value=0) # 填充0
Out[12]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
In [13]: df1.reindex(columns=df2.columns, method='ffill')
Out[13]:
a b c d e
0 0.0 1.0 2.0 3.0 NaN
1 4.0 5.0 6.0 7.0 NaN
2 8.0 9.0 10.0 11.0 NaN
In [14]: df1.reindex(columns=df2.columns, fill_value=0) # 重新索引的時候也可以填充。
Out[14]:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
可用的算術算法有:
- add:加法,
- sub:減法,
- div:除法
- mul:乘法
DataFrame和Series之間的運算
采用廣播的方式,就是會按照一定的規律作用到整個DataFrame之中。
In [15]: frame = DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index
...: =['Utah','Ohio','Texas','Oregon'])
In [16]: series = frame.ix[0] # 獲取第一行
In [17]: frame
Out[17]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [18]: series
Out[18]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
In [19]: frame - series # 自動廣播到其他行
Out[19]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
In [20]: series2 = Series(np.arange(3),index=list('bef'))
In [21]: series2
Out[21]:
b 0
e 1
f 2
dtype: int64
In [22]: frame + series2 # 沒有的列使用NaN
Out[22]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
In [23]: series3 = frame['d'] # 獲取列
In [24]: frame.sub(series3, axis=0) #列相減,指定axis
Out[24]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
2.5 函數應用和映射
Numpy中的通用函數(ufunc)也可以作用於pandas的Series和DataFrame對象。
In [31]: np.abs(frame)
Out[31]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [32]: np.max(frame)
Out[32]:
b 9.0
d 10.0
e 11.0
dtype: float64
DataFrame有一個apply方法,可以接受自定義函數。
In [33]: f = lambda x: np.max(x) - np.min(x)
In [34]: frame.apply(f)
Out[34]:
b 9.0
d 9.0
e 9.0
dtype: float64
In [35]: frame
Out[35]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [36]: f = lambda x : np.exp2(x)
In [37]: frame.apply(f)
Out[37]:
b d e
Utah 1.0 2.0 4.0
Ohio 8.0 16.0 32.0
Texas 64.0 128.0 256.0
Oregon 512.0 1024.0 2048.0
許多常用的方法,DataFrame已經實現,不需要使用apply方法自定義。
In [38]: f = lambda x: Series([np.max(x),np.min(x)],index=['max','min'])
In [39]: frame.apply(f)
Out[39]:
b d e
max 9.0 10.0 11.0
min 0.0 1.0 2.0
# 如果f函數是一個元素級別的函數,就使用applymap
In [40]: f = lambda x : '%.2f' % x
In [41]: frame.applymap(f)
Out[41]:
b d e
Utah 0.00 1.00 2.00
Ohio 3.00 4.00 5.00
Texas 6.00 7.00 8.00
Oregon 9.00 10.00 11.00
# 同樣對於Series就使用map,與DataFrame的applymap是對應的。
In [43]: series
Out[43]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
In [44]: series.map(f)
Out[44]:
b 0.00
d 1.00
e 2.00
Name: Utah, dtype: object
2.6 排序與排名
排序
排序可以使用:
- sort_index方法:按索引排序,
- sort_value方法(order方法):按值排序,使用by參數
In [45]: obj = Series(range(4),index=list('dbca'))
In [46]: obj
Out[46]:
d 0
b 1
c 2
a 3
dtype: int64
In [47]: obj.sort_index()
Out[47]:
a 3
b 1
c 2
d 0
dtype: int64
In [50]: frame
Out[50]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [51]: frame.sort_index()
Out[51]:
b d e
Ohio 3.0 4.0 5.0
Oregon 9.0 10.0 11.0
Texas 6.0 7.0 8.0
Utah 0.0 1.0 2.0
In [52]: frame.sort_index(axis=0)
Out[52]:
b d e
Ohio 3.0 4.0 5.0
Oregon 9.0 10.0 11.0
Texas 6.0 7.0 8.0
Utah 0.0 1.0 2.0
In [53]: frame.sort_index(axis=1)
Out[53]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [54]: frame.sort_index(axis=1, ascending=False) # 倒序
Out[54]:
e d b
Utah 2.0 1.0 0.0
Ohio 5.0 4.0 3.0
Texas 8.0 7.0 6.0
Oregon 11.0 10.0 9.0
按值排序:
In [55]: s1 = Series([3,-2,-7,4])
In [56]: s1.order()
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: order is deprecated, use sort_values(...)
#!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[56]:
2 -7
1 -2
0 3
3 4
dtype: int64
In [58]: frame.sort_index(by='b')
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
#!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[58]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [59]: frame.sort_values(by='b')
Out[59]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
排名
rank方法,默認情況下為“相同的值分配一個平均排名”:
In [60]: s1 = Series([7,-5,7,4,2,0,4])
In [61]: s1.rank() # 可見0和2索引對應的值都是7,排名分別為6,7;因此取平均值6.5
Out[61]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
當然,有很多方法可以“打破”這種平級關系。
In [62]: s1.rank(method='first') # 按原始數據出現順序排序
Out[62]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
In [63]: s1.rank(ascending=False, method='max') # 倒序,平級處理使用最大排名
Out[63]:
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
DataFrame排名可以使用axis按行或按列進行排名。
2.7 帶有重復值的軸索引
目前所有的例子中索引都是唯一的,而且如pandas中的許多函數(reindex)就要求索引唯一。
但是也不是強制的。
In [64]: obj = Series(range(5),index=list('aabbc'))
In [65]: obj
Out[65]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
In [67]: obj.index.is_unique
Out[67]: False
In [68]: obj['a']
Out[68]:
a 0
a 1
dtype: int64
In [69]: obj['c']
Out[69]: 4
對於DataFrame,也是如此。
In [70]: df =DataFrame(np.random.randn(4,3),index=list('aabb'))
In [79]: df.ix['a']
Out[79]:
0 1 2
a 1.099692 -0.491098 0.625690
a -0.816857 1.025018 0.558494
In [80]: df.reindex(['b','a']) # 不能重新索引有重復索引的DataFrame
...
ValueError: cannot reindex from a duplicate axis
待續。。。