pandas對象中的數據可以通過一些內置的方式進行合並:
pandas.merge 可根據一個或多個鍵將不同的DataFrame中的行連接起來。
pandas.concat可以沿着一條軸將多個對象堆疊到一起
實例的方法conbine_first 可以將重復的數據編接到一起,用一個對象中的值填充另一個對象的缺失值。
數據庫風格的DataFrame合並
In [51]: df1 = DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
In [53]: df2 = DataFrame({'key':['a','b','d'],'data2':range(3)})
In [54]: df1
Out[54]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
In [55]: df2
Out[55]:
data2 key
0 0 a
1 1 b
2 2 d
In [56]: pd.merge(df1,df2)
Out[56]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
默認不顯式指明用哪個鍵進行連接的時候,merge會將重疊列的列名當做鍵。不過,最好顯式的指明:
In [57]: pd.merge(df1,df2,on = 'key')
如果兩個列的列名不同,可以分別指定
In [58]: df4 = DataFrame({'key-r':['a','b','d'],'data2':range(3)})
In [59]: df3 = DataFrame({'key-l':['b','b','a','c','a','a','b'],'data1':range(7)})
In [63]: pd.merge(df3,df4,left_on = 'key-l',right_on = 'key-r')
Out[63]:
data1 key-l data2 key-r
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
默認情況下,merge做的是‘inner’連接,merge還有left、right和outer:
In [64]: pd.merge(df1,df2,how='outer') Out[64]: data1 key data2 0 0.0 b 1.0 1 1.0 b 1.0 2 6.0 b 1.0 3 2.0 a 0.0 4 4.0 a 0.0 5 5.0 a 0.0 6 3.0 c NaN 7 NaN d 2.0
如果要根據多個鍵進行合並,傳入一個由列名組成的列表即可。
In [65]: left = DataFrame({'key1':['a','a','b'],'key2':['one','two','one'],'lrow':[1,2,3]})
In [66]: right = DataFrame({'key1':['a','a','b','b'],'key2':['one','one','one','two'],'rrow':[4,5,6,7]})
In [67]: pd.merge(left,right,on = ['key1','key2'])
Out[67]:
key1 key2 lrow rrow
0 a one 1 4
1 a one 1 5
2 b one 3 6
如果合並的列中存在重復的列,可以指定重復的列的后綴進行區分
In [68]: pd.merge(left,right,on = ['key1']) Out[68]: key1 key2_x lrow key2_y rrow 0 a one 1 one 4 1 a one 1 one 5 2 a two 2 one 4 3 a two 2 one 5 4 b one 3 one 6 5 b one 3 two 7 #指定列名 In [71]: pd.merge(left,right,on = ['key1'],suffixes=['_eft','_right']) Out[71]: key1 key2_eft lrow key2_right rrow 0 a one 1 one 4 1 a one 1 one 5 2 a two 2 one 4 3 a two 2 one 5 4 b one 3 one 6 5 b one 3 two 7
索引上的合並
有時候,DataFrame中的連接鍵位於其索引中,merge可以通過left_index = True 或 right_index =True 來講索引應用於連接鍵。
In [73]: left = DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})
In [75]: right = DataFrame({'group_val':[3.5,7]},index=['a','b'])
In [76]: pd.merge(left,right,left_on = 'key',right_index= True)
Out[76]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
對於層次化的索引,需要對應的left_on = ['key1','key2'] , right_index = True
pandas的數據合並
軸向連接
另一種合並運算也叫做連接、綁定或堆疊。
numpy有一個用於原始合並Numpy數組的concatenation函數
>>> arr = np.arange(12).reshape(3,4)
>>> arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> np.concatenate([arr,arr])
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> np.concatenate([arr,arr],axis =1)
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
pandas的concat函數提供了解決類似上述數組的鏈接操作。並且能夠解決下述問題:
1、如果各個對象其他軸上的索引不同,合並的時候是采用交集還是並集。
2、結果對象中的分組需要各不相同嗎?
3、用於連接的軸的重要性。
以Series為例
>>> s1 = Series(['0','1'],index = ['a','b'])
>>> s2 = Series([2,3,4],index = ['c','d','e'])
>>> s3 = Series({'f':5,'g':6})
>>> s1
a 0
b 1
dtype: object
>>> s3
f 5
g 6
dtype: int64
>>> pd.concat((s1,s2,s3))
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: object
>>>
默認情況下concat是在axis=0軸上工作的,生成一個新的Series。如果設定axis =1,就會生成一個DataFrame對象。
>>> pd.concat((s1,s2,s3),axis=1)
0 1 2
a 0 NaN NaN
b 1 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
上面的情況,都是另外一條軸上沒有重疊
如果另外一條軸上有重疊的話,會發生什么呢?
>>> s4
a 00000
b 11111
f 5
g 6
dtype: object
>>> pd.concat((s1,s4),axis =1)
0 1
a 0 00000
b 1 11111
f NaN 5
g NaN 6
默認是取並集的,如果是要取交集,只需要傳入join = ‘inner’
>>> pd.concat((s1,s4),axis =1,join='inner')
0 1
a 0 00000
b 1 11111
你可以通過join_axes指定要在其他軸上使用的索引。
>>> pd.concat((s1,s4),axis =1,join_axes = (['a','c','b','e'],))
0 1
a 0 00000
c NaN NaN
b 1 11111
e NaN NaN
在連接軸上創建一個層次化索引。用keys參數。
>>>
>>> result = pd.concat([s1,s1,s3],keys = ['one','two','three'])
>>> result
one a 0
b 1
two a 0
b 1
three f 5
g 6
dtype: object
如果定義axis的話,就會形成一個DataFrame對象,並且將keys作為列頭。
對於DataFrame也一樣,只不過會生成層次化索引,默認是行索引。
In [12]: df1 = DataFrame(np.arange(6).reshape(3,2),index = ['a','b','c'],columns = ['one','two'])
In [13]: df2 =DataFrame(np.arange(4).reshape(2,2),index= ['a','c'],columns = ['three','four'])
In [14]: df1
Out[14]:
one two
a 0 1
b 2 3
c 4 5
In [15]: df2
Out[15]:
three four
a 0 1
c 2 3
In [16]: pd.concat([df1,df2],keys = ['item1','item2'])
Out[16]:
four one three two
item1 a NaN 0.0 NaN 1.0
b NaN 2.0 NaN 3.0
c NaN 4.0 NaN 5.0
item2 a 1.0 NaN 0.0 NaN
c 3.0 NaN 2.0 NaN
In [17]: pd.concat([df1,df2],keys = ['item1','item2'],axis = 1)
Out[17]:
item1 item2
one two three four
a 0 1 0.0 1.0
b 2 3 NaN NaN
c 4 5 2.0 3.0
如果傳入的不是一個列表,而是一個字典,則字典的鍵就會被當做keys選項的值
In [21]: pd.concat({'level1':df1,'level2':df2},axis =1)
Out[21]:
level1 level2
one two three four
a 0 1 0.0 1.0
b 2 3 NaN NaN
c 4 5 2.0 3.0
此外,還有兩個管理層次化索引創建方式的參數,name和levels
連接軸上的索引與分析工作無關的話,可以用ignore_index忽略即可。
In [25]: df1 = DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
In [26]: df2 = DataFrame(np.random.randn(2,3),columns = ['b','d','a'])
In [27]: pd.concat([df1,df2])
Out[27]:
a b c d
0 -0.859642 -0.637505 -0.830966 0.378526
1 -0.614132 1.692308 0.600534 -0.879568
2 -0.004405 -1.271789 -0.825860 0.528685
0 -1.219479 -0.626997 NaN -2.449499
1 0.366940 1.175266 NaN 0.434918
In [28]: pd.concat([df1,df2],ignore_index = True)
Out[28]:
a b c d
0 -0.859642 -0.637505 -0.830966 0.378526
1 -0.614132 1.692308 0.600534 -0.879568
2 -0.004405 -1.271789 -0.825860 0.528685
3 -1.219479 -0.626997 NaN -2.449499
4 0.366940 1.175266 NaN 0.434918
合並重疊數據:
有索引或部分索引重疊的兩個數據集。如果要合並這兩個數據集,可用Series和DataFrame的combine_first方法
In [32]: a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index = ['f','e','d','c','b','a'])
In [33]: b =Series(np.arange((len(a)),dtype = np.float64),index = ['f','e','d','c','b','a'])
In [34]: b[:-2]
Out[34]:
f 0.0
e 1.0
d 2.0
c 3.0
dtype: float64
In [35]: a[2:]
Out[35]:
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
In [36]: b[:-2].combine_first(a[2:])
Out[36]:
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
In [37]: b
Out[37]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a 5.0
dtype: float64
In [38]: b.combine_first(a)
Out[38]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a 5.0
dtype: float64
In [39]: a.combine_first(b)
Out[39]:
f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a 5.0
dtype: float64
combine_first方法可以給調用者的缺失值數據“打補丁”,即用被調用者的非na值代替調用者的缺失值。
