數據規整化:合並、清理、過濾
pandas和python標准庫提供了一整套高級、靈活的、高效的核心函數和算法將數據規整化為你想要的形式!
本篇博客主要介紹:
合並數據集:.merge()、.concat()等方法,類似於SQL或其他關系型數據庫的連接操作。
合並數據集
1) merge 函數
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
參數 說明
left 參與合並的左側DataFrame
right 參與合並的右側DataFrame
how 連接方式:‘inner’(默認)#交集;還有,‘outer’(並集)、‘left’(左全部,右部分)、‘right’(右全部,左部分)
on 用於連接的列名,必須同時存在於左右兩個DataFrame對象中,如果位指定,則以left和right列名的交集作為連接鍵
left_on 左側DataFarme中用作連接鍵的列
right_on 右側DataFarme中用作連接鍵的列
left_index 將左側的行索引用作其連接鍵
right_index 將右側的行索引用作其連接鍵
sort 根據連接鍵對合並后的數據進行排序,默認為True。有時在處理大數據集時,禁用該選項可獲得更好的性能
suffixes 字符串值元組,用於追加到重疊列名的末尾,默認為(‘_x’,‘_y’).例如,左右兩個DataFrame對象都有‘data’,則結果中就會出現‘data_x’,‘data_y’
copy 設置為False,可以在某些特殊情況下避免將數據復制到結果數據結構中。默認總是賦值
例子:
df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6), "city":['Beijing ', 'SH', 'guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '], "age":[23,44,54,32,34,32], "category":['100-A','100-B','110-A','110-C','210-A','130-F'], "price":[1200,np.nan,2133,5433,np.nan,4432]}, columns =['id','date','city','category','age','price']) df Out[46]: id date city category age price 0 1001 2013-01-02 Beijing 100-A 23 1200.0 1 1002 2013-01-03 SH 100-B 44 NaN 2 1003 2013-01-04 guangzhou 110-A 54 2133.0 3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 4 1005 2013-01-06 shanghai 210-A 34 NaN 5 1006 2013-01-07 BEIJING 130-F 32 4432.0 df1=pd.DataFrame({"acct_no":[1001,1002,1003,1004,1005,1006,1007,1008], "gender":['male','female','male','female','male','female','male','female'], "pay":['Y','N','Y','Y','N','Y','N','Y',], "m-point":[10,12,20,40,40,40,30,20]}) df1 Out[48]: acct_no gender pay m-point 0 1001 male Y 10 1 1002 female N 12 2 1003 male Y 20 3 1004 female Y 40 4 1005 male N 40 5 1006 female Y 40 6 1007 male N 30 7 1008 female Y 20 data1 = pd.merge(df,df1,left_on="id",right_on="acct_no") data1 Out[50]: id date city category ... acct_no gender pay m-point 0 1001 2013-01-02 Beijing 100-A ... 1001 male Y 10 1 1002 2013-01-03 SH 100-B ... 1002 female N 12 2 1003 2013-01-04 guangzhou 110-A ... 1003 male Y 20 3 1004 2013-01-05 Shenzhen 110-C ... 1004 female Y 40 4 1005 2013-01-06 shanghai 210-A ... 1005 male N 40 5 1006 2013-01-07 BEIJING 130-F ... 1006 female Y 40 [6 rows x 10 columns]
2)concat函數
在這里展示一種新的連接方法,對應於numpy的concatenate函數,pandas有concat函數
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
參數 說明
objs 參與連接的列表或字典,且列表或字典里的對象是pandas數據類型,唯一必須給定的參數
axis=0 指明連接的軸向,0是縱軸,1是橫軸,默認是0
join 指明軸向索引的索引是交集還是並集:‘inner’(交集),‘outer’(並集),默認是‘outer’
join_axis 指明用於其他n-1條軸的索引(層次化索引,某個軸向有多個索引),不執行交並集
keys 與連接對象有關的值,用於形成連接軸向上的層次化索引(外層索引),可以是任意值的列表或數組、元組數據、數組列表(如果將levels設置成多級數組的話)
levels 指定用作層次化索引各級別(內層索引)上的索引,如果設置keys的話
names 用於創建分層級別的名稱,如果設置keys或levels的話
verify_integrity 檢查結果對象新軸上的重復情況,如果發橫則引發異常,默認False,允許重復
ignore_index 不保留連接軸上的索引,產生一組新索引range(total_length)
例子:
pd.concat([df,df1])#默認並集、縱向連接 Out[57]: acct_no age category city ... id m-point pay price 0 NaN 23.0 100-A Beijing ... 1001.0 NaN NaN 1200.0 1 NaN 44.0 100-B SH ... 1002.0 NaN NaN NaN 2 NaN 54.0 110-A guangzhou ... 1003.0 NaN NaN 2133.0 3 NaN 32.0 110-C Shenzhen ... 1004.0 NaN NaN 5433.0 4 NaN 34.0 210-A shanghai ... 1005.0 NaN NaN NaN 5 NaN 32.0 130-F BEIJING ... 1006.0 NaN NaN 4432.0 0 1001.0 NaN NaN NaN ... NaN 10.0 Y NaN 1 1002.0 NaN NaN NaN ... NaN 12.0 N NaN 2 1003.0 NaN NaN NaN ... NaN 20.0 Y NaN 3 1004.0 NaN NaN NaN ... NaN 40.0 Y NaN 4 1005.0 NaN NaN NaN ... NaN 40.0 N NaN 5 1006.0 NaN NaN NaN ... NaN 40.0 Y NaN 6 1007.0 NaN NaN NaN ... NaN 30.0 N NaN 7 1008.0 NaN NaN NaN ... NaN 20.0 Y NaN [14 rows x 10 columns] pd.concat([df,df1],ignore_index = True)#生成縱軸上的並集,索引會自動生成新的一列 Out[58]: acct_no age category city ... id m-point pay price 0 NaN 23.0 100-A Beijing ... 1001.0 NaN NaN 1200.0 1 NaN 44.0 100-B SH ... 1002.0 NaN NaN NaN 2 NaN 54.0 110-A guangzhou ... 1003.0 NaN NaN 2133.0 3 NaN 32.0 110-C Shenzhen ... 1004.0 NaN NaN 5433.0 4 NaN 34.0 210-A shanghai ... 1005.0 NaN NaN NaN 5 NaN 32.0 130-F BEIJING ... 1006.0 NaN NaN 4432.0 6 1001.0 NaN NaN NaN ... NaN 10.0 Y NaN 7 1002.0 NaN NaN NaN ... NaN 12.0 N NaN 8 1003.0 NaN NaN NaN ... NaN 20.0 Y NaN 9 1004.0 NaN NaN NaN ... NaN 40.0 Y NaN 10 1005.0 NaN NaN NaN ... NaN 40.0 N NaN 11 1006.0 NaN NaN NaN ... NaN 40.0 Y NaN 12 1007.0 NaN NaN NaN ... NaN 30.0 N NaN 13 1008.0 NaN NaN NaN ... NaN 20.0 Y NaN [14 rows x 10 columns] pd.concat([df,df1],axis = 1,join = 'inner')#橫向取交集,注意該方法對對象表中有重復索引時失效 Out[59]: id date city category ... acct_no gender pay m-point 0 1001 2013-01-02 Beijing 100-A ... 1001 male Y 10 1 1002 2013-01-03 SH 100-B ... 1002 female N 12 2 1003 2013-01-04 guangzhou 110-A ... 1003 male Y 20 3 1004 2013-01-05 Shenzhen 110-C ... 1004 female Y 40 4 1005 2013-01-06 shanghai 210-A ... 1005 male N 40 5 1006 2013-01-07 BEIJING 130-F ... 1006 female Y 40 [6 rows x 10 columns] pd.concat([df,df1],axis = 1,join = 'outer')#橫向取並集,注意該方法對對象表中有重復索引時失效 Out[60]: id date city category ... acct_no gender pay m-point 0 1001.0 2013-01-02 Beijing 100-A ... 1001 male Y 10 1 1002.0 2013-01-03 SH 100-B ... 1002 female N 12 2 1003.0 2013-01-04 guangzhou 110-A ... 1003 male Y 20 3 1004.0 2013-01-05 Shenzhen 110-C ... 1004 female Y 40 4 1005.0 2013-01-06 shanghai 210-A ... 1005 male N 40 5 1006.0 2013-01-07 BEIJING 130-F ... 1006 female Y 40 6 NaN NaT NaN NaN ... 1007 male N 30 7 NaN NaT NaN NaN ... 1008 female Y 20 [8 rows x 10 columns]
3)combine_first函數(含有重疊索引的缺失值填補)
#全部索引重疊
a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index = ['a','b','c','d','e','f']) a Out[62]: a NaN b 2.5 c NaN d 3.5 e 4.5 f NaN dtype: float64 b = pd.Series(np.arange(len(a)),index = ['a','b','c','d','e','f']) b Out[64]: a 0 b 1 c 2 d 3 e 4 f 5 dtype: int32 a.combine_first(b)#利用b填補了a的空值 Out[65]: a 0.0 b 2.5 c 2.0 d 3.5 e 4.5 f 5.0 dtype: float64
#部分索引重疊 a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index = ['g','b','c','d','e','f']) a Out[67]: g NaN b 2.5 c NaN d 3.5 e 4.5 f NaN dtype: float64 a.combine_first(b)#部分索引重疊 Out[68]: a 0.0 b 2.5 c 2.0 d 3.5 e 4.5 f 5.0 g NaN dtype: float64