處理數據的時候,在數據庫經常用到inner join, left join, right join 等連表方式,而在python,有多種連接方式。自己也經常混淆,於是記錄一下。
看了很多博客,但總感覺不清不楚。於是,還是看官方文檔help()一下好了。
一.pd.merge() 數據變得更胖(主要橫向發展,因為左右表的列都連接起來了;但是也因為連接方式,可能變矮變高)
語法:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Parameters(參數解釋)
----------
-
left : DataFrame(二維數據表)
-
right : DataFrame or named Series (放你要連接的數據)
-
how: 指定拼接方式,有:{'left', 'right', 'outer', 'inner'}, default 'inner'。
-
on : label or list(使用默認值的話,python會自動去找兩個表里列名一樣的列來拼接;也可以自定義哪些列作為連接的連接鍵,例:on=['name', 'weight'],這是在兩個表有一樣的列名的時候,如果左右表列名不一樣,但是是代指同一個東西的話,那就用下面給的兩個參數)
-
left_on : label or list, or array-like (指定左表的連接列,如:left_on=['name','id'])
-
right_on : label or list, or array-like(指定右表的連接列,如:left_on=['姓名','學號'])
-
left_index : bool, default False(使用左表的行索引作為連接鍵,有可能是multiindex,那就是多個鍵)
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels. -
right_index : bool, default False(使用右表的行索引作為連接鍵,同上)
Use the index from the right DataFrame as the join key. Same caveats as
left_index. -
sort : bool, default False(對連接后的表,排序,一般不整這個,因為數據多的話,你的機子可能崩了)
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword). -
suffixes : tuple of (str, str), default ('_x', '_y')(左右表連接完,經常會出現重復的列名,那么用該參數指定同名列的后綴,輸出結果你就可以看到這是哪個表的列了,不用混淆了)
Suffix to apply to overlapping column names in the left and right
side, respectively. To raise an exception on overlapping columns use
(False, False). -
copy : bool, default True (默認用的是復制表,不在原數據上操作)
If False, avoid copy if possible. -
indicator : bool or str, default False(布爾值,可選,可以在結果中新增一列,顯示這一行數據的來源,來自左表,則顯示:left;來自右表,則顯示:right,還是兩列都有both)
If True, adds a column to output DataFrame called "_merge" with
information on the source of each row.
If string, column with information on source of each row will be added to
output DataFrame, and column will be named value of string.
Information column is Categorical-type and takes on a value of "left_only"
for observations whose merge key only appears in 'left' DataFrame,
"right_only" for observations whose merge key only appears in 'right'
DataFrame, and "both" if the observation's merge key is found in both. -
validate : str, optional(很少用,跳過)
If specified, checks if merge is of specified type.* "one_to_one" or "1:1": check if merge keys are unique in both left and right datasets. * "one_to_many" or "1:m": check if merge keys are unique in left dataset. * "many_to_one" or "m:1": check if merge keys are unique in right dataset. * "many_to_many" or "m:m": allowed, but does not result in checks. .. versionadded:: 0.21.0
Returns(返回結果)
DataFrame(返回的是一個二維數據表對象)
A DataFrame of the two merged objects.
二.pd.concat() 也是連接表,但是這里只有inner和outer方式。默認是outer,一般我要拼接多個店鋪的運營數據的時候,就用concat。因為數據表的表頭有些會不一樣,但是影響不大,我直接用拼接,一切搞掂。
語法:
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
Concatenate pandas objects along a particular axis with optional set logic
along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis,
which may be useful if the labels are the same (or overlapping) on
the passed axis number.
Parameters
----------
-
objs : 要連接的表,多表用列表[]框起來
-
axis : {0/'index', 1/'columns'}, default 0(默認數據會變高,如果指定axis=1, 數據會變變胖(橫向發展))
-
join : {'inner', 'outer'}, default 'outer'(指定連接方式,默認是outer)
How to handle indexes on other axis (or axes). -
join_axes : list of Index objects
.. deprecated:: 0.25.0Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic. Use .reindex() before or after concatenation as a replacement.
-
ignore_index : bool, default False(默認輸出結果會有index,如果你不想要,設置True.)
-
keys : sequence, default None(給輸出結果定義多出一行或者一列,給他定義一個標簽,如:key=[A, B],那么在輸出結果里,就會有一行或者一列標明A或者B,那么你就可以知道他是哪個表來的,數據透視可用)
If multiple levels passed, should contain tuples. Construct
hierarchical index using the passed keys as the outermost level. -
levels : list of sequences, default None
Specific levels (unique values) to use for constructing a
MultiIndex. Otherwise they will be inferred from the keys. -
names : list, default None
Names for the levels in the resulting hierarchical index. -
verify_integrity : bool, default False
Check whether the new concatenated axis contains duplicates. This can
be very expensive relative to the actual data concatenation. -
sort : bool, default None(是否對表進行排序,默認不排序)
-
copy : bool, default True(是否復制表操作,在原表操作不太好,萬一出錯了呢)
If False, do not copy data unnecessarily.Returns
object, type of objs
When concatenating allSeries
along the index (axis=0), a
Series
is returned. Whenobjs
contains at least one
DataFrame
, aDataFrame
is returned. When concatenating along
the columns (axis=1), aDataFrame
is returned.