DataFrame的算术运算
- 当对象相加时,如果存在某个索引对不相同,则返回结果的索引将是索引对的并集。这个特性类似于数据库操作中,对索引标签的自动外连接(outer join),不重叠的位置将出现NA值
In [4]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),columns=list('bcd'), index=['Ohio', 'Texas'
...: , 'Colorado'])
In [5]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),columns=list('bde'), index=['Utah', 'Ohio'
...: , 'Texas', 'Oregon'])
In [6]: df1
Out[6]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
In [7]: df2
Out[7]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [8]: df1 + df2
Out[8]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
- 按行或列索引进行字典型排序,可以使用sort_index,返回一个新的、排序好的对象。注意sort_index排序的是索引,而不是内容
In [15]: obj = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one
...: '], columns=['d', 'a', 'b', 'c'])
In [16]: obj
Out[16]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [17]: obj.sort_index()
Out[17]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [18]: obj.sort_index(axis=1)
Out[18]:
a b c d
three 1 2 3 0
one 5 6 7 4
- 按内容来排序列,使用sort_values,需要使用by来指定按哪一列排序
In [20]: obj.sort_values(by='b')
Out[20]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [21]: obj.sort_values(by='b', ascending=False)
Out[21]:
d a b c
one 4 5 6 7
three 0 1 2 3
DataFrame的交并补集
In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year':
...: [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}
In [4]: data
Out[4]:
{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}
In [5]: df1 = pd.DataFrame(data)
In [6]: df1
Out[6]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
In [11]: data1 = {'state': ['Ohio1', 'Ohio1', 'Ohio', 'Nevada1', 'Nevada1'], 'year': [2000, 2021
...: , 2022, 2021, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}
In [12]: df2 = pd.DataFrame(data1)
In [14]: df2
Out[14]:
state year pop
0 Ohio1 2000 1.5
1 Ohio1 2021 1.7
2 Ohio 2022 3.6
3 Nevada1 2021 2.4
4 Nevada1 2002 3.9
In [22]: pd.merge(df1, df2, how='inner')
Out[22]:
state year pop
0 Ohio 2002 3.6
In [16]: pd.merge(df1, df2, how='outer')
Out[16]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
5 Ohio1 2000 1.5
6 Ohio1 2021 1.7
7 Ohio 2022 3.6
8 Nevada1 2021 2.4
9 Nevada1 2002 3.9
DataFrame的布尔索引
In [23]: df1
Out[23]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
In [24]: df1[df1['pop'] > 2]
Out[24]:
state year pop
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
# 可以使用逻辑运算
In [27]: df1[(df1['pop'] > 2) & (df1['year'] > 2001)]
Out[27]:
state year pop
2 Ohio 2002 3.6
4 Nevada 2002 3.9