DataFrame的一些常用运算


DataFrame的算术运算

  • 当对象相加时,如果存在某个索引对不相同,则返回结果的索引将是索引对的并集。这个特性类似于数据库操作中,对索引标签的自动外连接(outer join),不重叠的位置将出现NA值
In [4]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),columns=list('bcd'), index=['Ohio', 'Texas'
   ...: , 'Colorado'])

In [5]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),columns=list('bde'), index=['Utah', 'Ohio'
   ...: , 'Texas', 'Oregon'])

In [6]: df1
Out[6]: 
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

In [7]: df2
Out[7]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [8]: df1 + df2
Out[8]: 
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN
  • 按行或列索引进行字典型排序,可以使用sort_index,返回一个新的、排序好的对象。注意sort_index排序的是索引,而不是内容
In [15]: obj = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one
    ...: '], columns=['d', 'a', 'b', 'c'])

In [16]: obj
Out[16]: 
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [17]: obj.sort_index()
Out[17]: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

In [18]: obj.sort_index(axis=1)
Out[18]: 
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
  • 按内容来排序列,使用sort_values,需要使用by来指定按哪一列排序
In [20]: obj.sort_values(by='b')
Out[20]: 
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [21]: obj.sort_values(by='b', ascending=False)
Out[21]: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

DataFrame的交并补集

In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year':
   ...:  [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}

In [4]: data
Out[4]: 
{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002],
 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}

In [5]: df1 = pd.DataFrame(data)

In [6]: df1
Out[6]: 
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  3.9

In [11]: data1 = {'state': ['Ohio1', 'Ohio1', 'Ohio', 'Nevada1', 'Nevada1'], 'year': [2000, 2021
    ...: , 2022, 2021, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}

In [12]: df2 = pd.DataFrame(data1)
  
In [14]: df2
Out[14]: 
     state  year  pop
0    Ohio1  2000  1.5
1    Ohio1  2021  1.7
2     Ohio  2022  3.6
3  Nevada1  2021  2.4
4  Nevada1  2002  3.9
  • 交集
In [22]: pd.merge(df1, df2, how='inner')
Out[22]: 
  state  year  pop
0  Ohio  2002  3.6
  • 并集
In [16]: pd.merge(df1, df2, how='outer')
Out[16]: 
     state  year  pop
0     Ohio  2000  1.5
1     Ohio  2001  1.7
2     Ohio  2002  3.6
3   Nevada  2001  2.4
4   Nevada  2002  3.9
5    Ohio1  2000  1.5
6    Ohio1  2021  1.7
7     Ohio  2022  3.6
8  Nevada1  2021  2.4
9  Nevada1  2002  3.9

DataFrame的布尔索引

  • 从某一列中找到值大于2的项
In [23]: df1
Out[23]: 
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  3.9

In [24]: df1[df1['pop'] > 2]
Out[24]: 
    state  year  pop
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  3.9

# 可以使用逻辑运算
In [27]: df1[(df1['pop'] > 2) & (df1['year'] > 2001)]
Out[27]: 
    state  year  pop
2    Ohio  2002  3.6
4  Nevada  2002  3.9


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM