python merge、concat合並數據集


數據規整化:合並、清理、過濾

pandas和python標准庫提供了一整套高級、靈活的、高效的核心函數和算法將數據規整化為你想要的形式!

本篇博客主要介紹:

合並數據集:.merge()、.concat()等方法,類似於SQL或其他關系型數據庫的連接操作。

合並數據集

1) merge 函數

merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

  

  參數             說明

  left               參與合並的左側DataFrame

  right             參與合並的右側DataFrame

  how              連接方式:‘inner’(默認)#交集;還有,‘outer’(並集)、‘left’(左全部,右部分)、‘right’(右全部,左部分)

  on                 用於連接的列名,必須同時存在於左右兩個DataFrame對象中,如果位指定,則以left和right列名的交集作為連接鍵

  left_on          左側DataFarme中用作連接鍵的列

  right_on        右側DataFarme中用作連接鍵的列

  left_index      將左側的行索引用作其連接鍵

  right_index    將右側的行索引用作其連接鍵

  sort               根據連接鍵對合並后的數據進行排序,默認為True。有時在處理大數據集時,禁用該選項可獲得更好的性能

  suffixes        字符串值元組,用於追加到重疊列名的末尾,默認為(‘_x’,‘_y’).例如,左右兩個DataFrame對象都有‘data’,則結果中就會出現‘data_x’,‘data_y’

  copy             設置為False,可以在某些特殊情況下避免將數據復制到結果數據結構中。默認總是賦值

例子:

df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], 
 "date":pd.date_range('20130102', periods=6),
  "city":['Beijing ', 'SH', 'guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
 "age":[23,44,54,32,34,32],
 "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
  "price":[1200,np.nan,2133,5433,np.nan,4432]},
  columns =['id','date','city','category','age','price'])

df
Out[46]: 
     id       date        city category  age   price
0  1001 2013-01-02    Beijing     100-A   23  1200.0
1  1002 2013-01-03          SH    100-B   44     NaN
2  1003 2013-01-04  guangzhou     110-A   54  2133.0
3  1004 2013-01-05    Shenzhen    110-C   32  5433.0
4  1005 2013-01-06    shanghai    210-A   34     NaN
5  1006 2013-01-07    BEIJING     130-F   32  4432.0

df1=pd.DataFrame({"acct_no":[1001,1002,1003,1004,1005,1006,1007,1008], 
"gender":['male','female','male','female','male','female','male','female'],
"pay":['Y','N','Y','Y','N','Y','N','Y',],
"m-point":[10,12,20,40,40,40,30,20]})

df1
Out[48]: 
   acct_no  gender pay  m-point
0     1001    male   Y       10
1     1002  female   N       12
2     1003    male   Y       20
3     1004  female   Y       40
4     1005    male   N       40
5     1006  female   Y       40
6     1007    male   N       30
7     1008  female   Y       20

data1 = pd.merge(df,df1,left_on="id",right_on="acct_no")

data1 
Out[50]: 
     id       date        city category   ...    acct_no  gender  pay m-point
0  1001 2013-01-02    Beijing     100-A   ...       1001    male    Y      10
1  1002 2013-01-03          SH    100-B   ...       1002  female    N      12
2  1003 2013-01-04  guangzhou     110-A   ...       1003    male    Y      20
3  1004 2013-01-05    Shenzhen    110-C   ...       1004  female    Y      40
4  1005 2013-01-06    shanghai    210-A   ...       1005    male    N      40
5  1006 2013-01-07    BEIJING     130-F   ...       1006  female    Y      40

[6 rows x 10 columns]

2)concat函數

 

在這里展示一種新的連接方法,對應於numpy的concatenate函數,pandas有concat函數

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)

  參數               說明

  objs               參與連接的列表或字典,且列表或字典里的對象是pandas數據類型,唯一必須給定的參數

  axis=0             指明連接的軸向,0是縱軸,1是橫軸,默認是0

  join               指明軸向索引的索引是交集還是並集:‘inner’(交集),‘outer’(並集),默認是‘outer’

  join_axis          指明用於其他n-1條軸的索引(層次化索引,某個軸向有多個索引),不執行交並集

  keys               與連接對象有關的值,用於形成連接軸向上的層次化索引(外層索引),可以是任意值的列表或數組、元組數據、數組列表(如果將levels設置成多級數組的話)

  levels             指定用作層次化索引各級別(內層索引)上的索引,如果設置keys的話

  names              用於創建分層級別的名稱,如果設置keys或levels的話

  verify_integrity   檢查結果對象新軸上的重復情況,如果發橫則引發異常,默認False,允許重復

  ignore_index       不保留連接軸上的索引,產生一組新索引range(total_length)

例子:

pd.concat([df,df1])#默認並集、縱向連接
Out[57]: 
   acct_no   age category        city   ...        id m-point  pay   price
0      NaN  23.0    100-A    Beijing    ...    1001.0     NaN  NaN  1200.0
1      NaN  44.0    100-B          SH   ...    1002.0     NaN  NaN     NaN
2      NaN  54.0    110-A  guangzhou    ...    1003.0     NaN  NaN  2133.0
3      NaN  32.0    110-C    Shenzhen   ...    1004.0     NaN  NaN  5433.0
4      NaN  34.0    210-A    shanghai   ...    1005.0     NaN  NaN     NaN
5      NaN  32.0    130-F    BEIJING    ...    1006.0     NaN  NaN  4432.0
0   1001.0   NaN      NaN         NaN   ...       NaN    10.0    Y     NaN
1   1002.0   NaN      NaN         NaN   ...       NaN    12.0    N     NaN
2   1003.0   NaN      NaN         NaN   ...       NaN    20.0    Y     NaN
3   1004.0   NaN      NaN         NaN   ...       NaN    40.0    Y     NaN
4   1005.0   NaN      NaN         NaN   ...       NaN    40.0    N     NaN
5   1006.0   NaN      NaN         NaN   ...       NaN    40.0    Y     NaN
6   1007.0   NaN      NaN         NaN   ...       NaN    30.0    N     NaN
7   1008.0   NaN      NaN         NaN   ...       NaN    20.0    Y     NaN

[14 rows x 10 columns]

pd.concat([df,df1],ignore_index = True)#生成縱軸上的並集,索引會自動生成新的一列
Out[58]: 
    acct_no   age category        city   ...        id m-point  pay   price
0       NaN  23.0    100-A    Beijing    ...    1001.0     NaN  NaN  1200.0
1       NaN  44.0    100-B          SH   ...    1002.0     NaN  NaN     NaN
2       NaN  54.0    110-A  guangzhou    ...    1003.0     NaN  NaN  2133.0
3       NaN  32.0    110-C    Shenzhen   ...    1004.0     NaN  NaN  5433.0
4       NaN  34.0    210-A    shanghai   ...    1005.0     NaN  NaN     NaN
5       NaN  32.0    130-F    BEIJING    ...    1006.0     NaN  NaN  4432.0
6    1001.0   NaN      NaN         NaN   ...       NaN    10.0    Y     NaN
7    1002.0   NaN      NaN         NaN   ...       NaN    12.0    N     NaN
8    1003.0   NaN      NaN         NaN   ...       NaN    20.0    Y     NaN
9    1004.0   NaN      NaN         NaN   ...       NaN    40.0    Y     NaN
10   1005.0   NaN      NaN         NaN   ...       NaN    40.0    N     NaN
11   1006.0   NaN      NaN         NaN   ...       NaN    40.0    Y     NaN
12   1007.0   NaN      NaN         NaN   ...       NaN    30.0    N     NaN
13   1008.0   NaN      NaN         NaN   ...       NaN    20.0    Y     NaN

[14 rows x 10 columns]

pd.concat([df,df1],axis = 1,join = 'inner')#橫向取交集,注意該方法對對象表中有重復索引時失效
Out[59]: 
     id       date        city category   ...    acct_no  gender  pay m-point
0  1001 2013-01-02    Beijing     100-A   ...       1001    male    Y      10
1  1002 2013-01-03          SH    100-B   ...       1002  female    N      12
2  1003 2013-01-04  guangzhou     110-A   ...       1003    male    Y      20
3  1004 2013-01-05    Shenzhen    110-C   ...       1004  female    Y      40
4  1005 2013-01-06    shanghai    210-A   ...       1005    male    N      40
5  1006 2013-01-07    BEIJING     130-F   ...       1006  female    Y      40

[6 rows x 10 columns]

pd.concat([df,df1],axis = 1,join = 'outer')#橫向取並集,注意該方法對對象表中有重復索引時失效
Out[60]: 
       id       date        city category   ...    acct_no  gender  pay m-point
0  1001.0 2013-01-02    Beijing     100-A   ...       1001    male    Y      10
1  1002.0 2013-01-03          SH    100-B   ...       1002  female    N      12
2  1003.0 2013-01-04  guangzhou     110-A   ...       1003    male    Y      20
3  1004.0 2013-01-05    Shenzhen    110-C   ...       1004  female    Y      40
4  1005.0 2013-01-06    shanghai    210-A   ...       1005    male    N      40
5  1006.0 2013-01-07    BEIJING     130-F   ...       1006  female    Y      40
6     NaN        NaT         NaN      NaN   ...       1007    male    N      30
7     NaN        NaT         NaN      NaN   ...       1008  female    Y      20

[8 rows x 10 columns]

3)combine_first函數(含有重疊索引的缺失值填補

#全部索引重疊
a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index = ['a','b','c','d','e','f']) a Out[62]: a NaN b 2.5 c NaN d 3.5 e 4.5 f NaN dtype: float64 b = pd.Series(np.arange(len(a)),index = ['a','b','c','d','e','f']) b Out[64]: a 0 b 1 c 2 d 3 e 4 f 5 dtype: int32 a.combine_first(b)#利用b填補了a的空值 Out[65]: a 0.0 b 2.5 c 2.0 d 3.5 e 4.5 f 5.0 dtype: float64
#部分索引重疊 a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index = ['g','b','c','d','e','f']) a Out[67]: g NaN b 2.5 c NaN d 3.5 e 4.5 f NaN dtype: float64 a.combine_first(b)#部分索引重疊 Out[68]: a 0.0 b 2.5 c 2.0 d 3.5 e 4.5 f 5.0 g NaN dtype: float64


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM