DataFrame的算術運算
- 當對象相加時,如果存在某個索引對不相同,則返回結果的索引將是索引對的並集。這個特性類似於數據庫操作中,對索引標簽的自動外連接(outer join),不重疊的位置將出現NA值
In [4]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),columns=list('bcd'), index=['Ohio', 'Texas'
...: , 'Colorado'])
In [5]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),columns=list('bde'), index=['Utah', 'Ohio'
...: , 'Texas', 'Oregon'])
In [6]: df1
Out[6]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
In [7]: df2
Out[7]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [8]: df1 + df2
Out[8]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
- 按行或列索引進行字典型排序,可以使用sort_index,返回一個新的、排序好的對象。注意sort_index排序的是索引,而不是內容
In [15]: obj = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one
...: '], columns=['d', 'a', 'b', 'c'])
In [16]: obj
Out[16]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [17]: obj.sort_index()
Out[17]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [18]: obj.sort_index(axis=1)
Out[18]:
a b c d
three 1 2 3 0
one 5 6 7 4
- 按內容來排序列,使用sort_values,需要使用by來指定按哪一列排序
In [20]: obj.sort_values(by='b')
Out[20]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [21]: obj.sort_values(by='b', ascending=False)
Out[21]:
d a b c
one 4 5 6 7
three 0 1 2 3
DataFrame的交並補集
In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year':
...: [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}
In [4]: data
Out[4]:
{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}
In [5]: df1 = pd.DataFrame(data)
In [6]: df1
Out[6]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
In [11]: data1 = {'state': ['Ohio1', 'Ohio1', 'Ohio', 'Nevada1', 'Nevada1'], 'year': [2000, 2021
...: , 2022, 2021, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}
In [12]: df2 = pd.DataFrame(data1)
In [14]: df2
Out[14]:
state year pop
0 Ohio1 2000 1.5
1 Ohio1 2021 1.7
2 Ohio 2022 3.6
3 Nevada1 2021 2.4
4 Nevada1 2002 3.9
In [22]: pd.merge(df1, df2, how='inner')
Out[22]:
state year pop
0 Ohio 2002 3.6
In [16]: pd.merge(df1, df2, how='outer')
Out[16]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
5 Ohio1 2000 1.5
6 Ohio1 2021 1.7
7 Ohio 2022 3.6
8 Nevada1 2021 2.4
9 Nevada1 2002 3.9
DataFrame的布爾索引
In [23]: df1
Out[23]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
In [24]: df1[df1['pop'] > 2]
Out[24]:
state year pop
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
# 可以使用邏輯運算
In [27]: df1[(df1['pop'] > 2) & (df1['year'] > 2001)]
Out[27]:
state year pop
2 Ohio 2002 3.6
4 Nevada 2002 3.9