pandas的拼接操作

pandas的拼接分為兩種：

級聯：pd.concat, pd.append
合並：pd.merge, pd.join

0. 回顧numpy的級聯

============================================

練習12：

生成2個3*3的矩陣，對其分別進行兩個維度上的級聯

============================================

In [19]:

import numpy as np import pandas as pd from pandas import Series,DataFrame

In [20]:

nd = np.random.randint(0,10,size=(3,3)) nd

Out[20]:

array([[6, 3, 4],
       [3, 9, 8],
       [8, 7, 8]])

In [24]:

np.concatenate((nd,nd),axis=0)#0代表行間操作

Out[24]:

array([[6, 3, 4],
       [3, 9, 8],
       [8, 7, 8],
       [6, 3, 4],
       [3, 9, 8],
       [8, 7, 8]])

In [25]:

np.concatenate([nd,nd],axis=1)#1代表列間操作，()huo[]效果一樣

Out[25]:

array([[6, 3, 4, 6, 3, 4],
       [3, 9, 8, 3, 9, 8],
       [8, 7, 8, 8, 7, 8]])

為方便講解，我們首先定義一個生成DataFrame的函數：

In [26]:

def make_df(inds,cols): #字典的key作為列名進行展示 data = {key:[key+str(i) for i in inds]for key in cols} return DataFrame(data,index=inds,columns=cols)

In [28]:

make_df([1,2],list('AB'))

Out[28]:

	A	B
1	A1	B1
2	A2	B2

1. 使用pd.concat()級聯

pandas使用pd.concat函數，與np.concatenate函數類似，只是多了一些參數：

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

1) 簡單級聯

和np.concatenate一樣，優先增加行數（默認axis=0）

In [29]:

df1 = make_df([0,1],list('AB')) df2 = make_df([2,3],list('AB'))

In [30]:

display(df1,df2)

	A	B
0	A0	B0
1	A1	B1

	A	B
2	A2	B2
3	A3	B3

可以通過設置axis來改變級聯方向

In [31]:

pd.concat([df1,df2])

Out[31]:

	A	B
0	A0	B0
1	A1	B1
2	A2	B2
3	A3	B3

In [32]:

pd.concat((df1,df2),axis = 1)

Out[32]:

	A	B	A	B
0	A0	B0	NaN	NaN
1	A1	B1	NaN	NaN
2	NaN	NaN	A2	B2
3	NaN	NaN	A3	B3

注意index在級聯時可以重復

也可以選擇忽略ignore_index，重新索引

In [34]:

pd.concat((df1,df2),axis=1,ignore_index=True)

Out[34]:

	0	1	2	3
0	A0	B0	NaN	NaN
1	A1	B1	NaN	NaN
2	NaN	NaN	A2	B2
3	NaN	NaN	A3	B3

或者使用多層索引 keys

concat([x,y],keys=['x','y'])

In [13]:

pd.concat([df1,df2],keys=['x','y'])

Out[13]:

		A	B
x	0	A1	B1
x	1	A2	B2
y	0	A3	B3
y	1	A4	B4

In [ ]:

#pd 模塊 import pandas as pd

#df1,df2 具體的實例 #級聯的方法，屬於上一級，DataFrame來自pandas

============================================

練習13：

想一想級聯的應用場景？
使用昨天的知識，建立一個期中考試張三、李四的成績表ddd
假設新增考試學科"計算機"，如何實現？
新增王老五同學的成績，如何實現？

============================================

In [ ]:

2) 不匹配級聯

不匹配指的是級聯的維度的索引不一致。例如縱向級聯時列索引不一致，橫向級聯時行索引不一致

In [38]:

df1 = make_df([1,2],list('AB')) df2 = make_df([2,4],list('BC')) display(df1,df2)

	A	B
1	A1	B1
2	A2	B2

	B	C
2	B2	C2
4	B4	C4

有3種連接方式：

外連接：補NaN（默認模式）

In [39]:

pd.concat([df1,df2])

C:\Users\BLX\AppData\Roaming\Python\Python37\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

Out[39]:

	A	B	C
1	A1	B1	NaN
2	A2	B2	NaN
2	NaN	B2	C2
4	NaN	B4	C4

內連接：只連接匹配的項

In [41]:

#合並顯示共有數據
pd.concat((df1,df2),join = 'inner',axis = 1)

Out[41]:

	A	B	B	C
2	A2	B2	B2	C2

連接指定軸 join_axes

In [42]:

df2.columns

Out[42]:

Index(['B', 'C'], dtype='object')

In [43]:

#join_axex以某個DataFrame的列索引為新的列索引值
pd.concat([df1,df2],join_axes=[df2.columns])

Out[43]:

	B	C
1	B1	NaN
2	B2	NaN
2	B2	C2
4	B4	C4

============================================

練習14：

假設【期末】考試ddd2的成績沒有張三的，只有李四、王老五、趙小六的，使用多種方法級聯

============================================

3) 使用append()函數添加

由於在后面級聯的使用非常普遍，因此有一個函數append專門用於在后面添加

In [44]:

display(df1,df2)

	A	B
1	A1	B1
2	A2	B2

	B	C
2	B2	C2
4	B4	C4

In [49]:

#append函數屬於DataFrame，concat這函數屬於pandas模塊
#pd.concat((df1,df2)) df1.append(df2)

Out[49]:

	A	B	C
1	A1	B1	NaN
2	A2	B2	NaN
2	NaN	B2	C2
4	NaN	B4	C4

============================================

練習15：

新建一個只有張三李四王老五的期末考試成績單ddd3，使用append()與期中考試成績表ddd級聯

============================================

2. 使用pd.merge()合並

merge與concat的區別在於，merge需要依據某一共同的行或列來進行合並

使用pd.merge()合並時，會自動根據兩者相同column名稱的那一列，作為key來進行合並。

注意每一列元素的順序不要求一致

1) 一對一合並

In [54]:

#merge根據相同的元素進行合並的
df1 = DataFrame({'employee':['Po','Sara','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Sara','Bush'], 'work_time':[2,3,1]}) display(df1,df2)

	employee	group
0	Po	sail
1	Sara	couting
2	Danis	marketing

	employee	work_time
0	Po	2
1	Sara	3
2	Bush	1

In [55]:

pd.merge(df1,df2)

Out[55]:

	employee	group	work_time
0	Po	sail	2
1	Sara	couting	3

In [56]:

df1.merge(df2)

Out[56]:

	employee	group	work_time
0	Po	sail	2
1	Sara	couting	3

2) 多對一合並

In [57]:

df1 = DataFrame({'employee':['Po','Sara','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Po','Bush'], 'work_time':[2,3,1]}) display(df1,df2)

	employee	group
0	Po	sail
1	Sara	couting
2	Danis	marketing

	employee	work_time
0	Po	2
1	Po	3
2	Bush	1

In [58]:

pd.merge(df1,df2)

Out[58]:

	employee	group	work_time
0	Po	sail	2
1	Po	sail	3

3) 多對多合並

In [61]:

df1 = DataFrame({'employee':['Po','Po','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Po','Bush'], 'work_time':[2,3,1]}) display(df1,df2)

	employee	group
0	Po	sail
1	Po	couting
2	Danis	marketing

	employee	work_time
0	Po	2
1	Po	3
2	Bush	1

In [62]:

pd.merge(df1,df2)

Out[62]:

	employee	group	work_time
0	Po	sail	2
1	Po	sail	3
2	Po	couting	2
3	Po	couting	3

4) key的規范化

使用on=顯式指定哪一列為key,當有多個key相同時使用

In [66]:

df3 = DataFrame({'employee':['Po','Summer','Flower'], 'group':['sail','marketing','serch'], 'salary':[12000,10000,8000]}) df4 = DataFrame({'employee':['Po','Winter','Flower'], 'group':['marketing','marketing','serch'], 'work_time':[2,1,5]}) display(df3,df4)

	employee	group	salary
0	Po	sail	12000
1	Summer	marketing	10000
2	Flower	serch	8000

	employee	group	work_time
0	Po	marketing	2
1	Winter	marketing	1
2	Flower	serch	5

In [67]:

pd.merge(df3,df4)

Out[67]:

	employee	group	salary	work_time
0	Flower	serch	8000	5

In [70]:

pd.merge(df3,df4,on='employee')

Out[70]:

	employee	group_x	salary	group_y	work_time
0	Po	sail	12000	marketing	2
1	Flower	serch	8000	serch	5

In [73]:

pd.merge(df3,df4,on='group',suffixes=['_A','_B'])

Out[73]:

	employee_A	group	salary	employee_B	work_time
0	Summer	marketing	10000	Po	2
1	Summer	marketing	10000	Winter	1
2	Flower	serch	8000	Flower	5

使用left_on和right_on指定左右兩邊的列作為key，當左右兩邊的key都不想等時使用
參數1為左，參數2為右

In [79]:

df3 = DataFrame({'employer':['Po','Summer','Flower'], 'Team':['sail','marketing','serch'], 'salary':[12000,10000,8000]}) df4 = DataFrame({'employee':['Po','Winter','Flower'], 'group':['marketing','marketing','serch'], 'work_time':[2,1,5]}) display(df3,df4)

	employer	Team	salary
0	Po	sail	12000
1	Summer	marketing	10000
2	Flower	serch	8000

	employee	group	work_time
0	Po	marketing	2
1	Winter	marketing	1
2	Flower	serch	5

In [81]:

pd.merge(df3,df4,left_on='employer',right_on='employee')

Out[81]:

	employer	Team	salary	employee	group	work_time
0	Po	sail	12000	Po	marketing	2
1	Flower	serch	8000	Flower	serch	5

In [82]:

pd.merge(df3,df4,left_on='Team',right_on='group')

Out[82]:

	employer	Team	salary	employee	group	work_time
0	Summer	marketing	10000	Po	marketing	2
1	Summer	marketing	10000	Winter	marketing	1
2	Flower	serch	8000	Flower	serch	5

============================================

練習16：

假設有兩份成績單，除了ddd是張三李四王老五之外，還有ddd4是張三和趙小六的成績單，如何合並？
如果ddd4中張三的名字被打錯了，成為了張十三，怎么辦？
自行練習多對一，多對多的情況
自學left_index,right_index

============================================

5) 內合並與外合並

內合並：只保留兩者都有的key（默認模式）

In [85]:

df1 = DataFrame({'age':[18,22,33],'height':[175,169,180]}) df2 = DataFrame({'age':[18,23,31],'weight':[65,70,80]})

In [86]:

pd.merge(df1,df2)

Out[86]:

	age	height	weight
0	18	175	65

In [87]:

df1.merge(df2,how='inner')

Out[87]:

	age	height	weight
0	18	175	65

外合並 how='outer'：補NaN

In [88]:

df1.merge(df2,how = 'outer')

Out[88]:

	age	height	weight
0	18	175.0	65.0
1	22	169.0	NaN
2	33	180.0	NaN
3	23	NaN	70.0
4	31	NaN	80.0

左合並、右合並：how='left'，how='right'，

In [89]:

df1.merge(df2,how = 'left')#保留左側

Out[89]:

	age	height	weight
0	18	175	65.0
1	22	169	NaN
2	33	180	NaN

In [90]:

pd.merge(df1,df2,how='right')#保留右側

Out[90]:

	age	height	weight
0	18	175.0	65
1	23	NaN	70
2	31	NaN	80

============================================

練習17：

如果只有張三趙小六語數英三個科目的成績，如何合並？
考慮應用情景，使用多種方式合並ddd與ddd4

============================================

6) 列沖突的解決

當列沖突時，即有多個列名稱相同時，需要使用on=來指定哪一個列作為key，配合suffixes指定沖突列名

可以使用suffixes=自己指定后綴

In [91]:

display(df3,df4)

	employer	Team	salary
0	Po	sail	12000
1	Summer	marketing	10000
2	Flower	serch	8000

	employee	group	work_time
0	Po	marketing	2
1	Winter	marketing	1
2	Flower	serch	5

In [93]:

df3.columns = ['employee','group','salary'] display(df3)

	employee	group	salary
0	Po	sail	12000
1	Summer	marketing	10000
2	Flower	serch	8000

In [94]:

pd.merge(df3,df4,on='employee',suffixes=['_李','_王'])

Out[94]:

	employee	group_李	salary	group_王	work_time
0	Po	sail	12000	marketing	2
1	Flower	serch	8000	serch	5

============================================

練習18：

假設有兩個同學都叫李四，ddd5、ddd6都是張三和李四的成績表，如何合並？

============================================

作業

3. 案例分析：美國各州人口數據分析

首先導入文件，並查看數據樣本

In [62]:

import numpy as np import pandas as pd from pandas import Series,DataFrame

In [63]:

#使用pandas讀取數據
pop = pd.read_csv('../../data/state-population.csv') areas = pd.read_csv('../../data/state-areas.csv') abb = pd.read_csv('../../data/state-abbrevs.csv')

In [64]:

pop.shape

Out[64]:

(2544, 4)

In [65]:

pop.head()

Out[65]:

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0
2	AL	under18	2010	1130966.0
3	AL	total	2010	4785570.0
4	AL	under18	2011	1125763.0

In [70]:

areas.shape

Out[70]:

(52, 2)

In [69]:

abb.shape

Out[69]:

(51, 2)

合並pop與abbrevs兩個DataFrame，分別依據state/region列和abbreviation列來合並。

為了保留所有信息，使用外合並。

In [71]:

pop.head()

Out[71]:

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0
2	AL	under18	2010	1130966.0
3	AL	total	2010	4785570.0
4	AL	under18	2011	1125763.0

In [72]:

abb.head()

Out[72]:

	state	abbreviation
0	Alabama	AL
1	Alaska	AK
2	Arizona	AZ
3	Arkansas	AR
4	California	CA

In [73]:

display(pop.shape,abb.shape)

(2544, 4)

(51, 2)

In [78]:

#此時的場景 left == outer left數據大於abb
#left效果比outer差一些 #abb 河北 pop_m = pop.merge(abb,left_on='state/region',right_on='abbreviation',how = 'outer') pop_m.shape

Out[78]:

(2544, 6)

去除abbreviation的那一列（axis=1）

In [79]:

pop_m.head()

Out[79]:

	state/region	ages	year	population	state	abbreviation
0	AL	under18	2012	1117489.0	Alabama	AL
1	AL	total	2012	4817528.0	Alabama	AL
2	AL	under18	2010	1130966.0	Alabama	AL
3	AL	total	2010	4785570.0	Alabama	AL
4	AL	under18	2011	1125763.0	Alabama	AL

In [83]:

pop_m.drop('abbreviation',axis = 1,inplace=True)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last) <ipython-input-83-15dcfc478d0b> in <module>() ----> 1 pop_m.drop('abbreviation',axis = 1,inplace=True) /usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors)  2159 new_axis = axis.drop(labels, level=level, errors=errors)  2160 else: -> 2161 new_axis = axis.drop(labels, errors=errors)  2162 dropped = self.reindex(**{axis_name: new_axis})  2163 try: /usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors)  3622 if errors != 'ignore':  3623 raise ValueError('labels %s not contained in axis' % -> 3624 labels[mask])  3625 indexer = indexer[~mask]  3626 return self.delete(indexer) ValueError: labels ['abbreviation'] not contained in axis

In [82]:

pop_m.head()

Out[82]:

	state/region	ages	year	population	state
0	AL	under18	2012	1117489.0	Alabama
1	AL	total	2012	4817528.0	Alabama
2	AL	under18	2010	1130966.0	Alabama
3	AL	total	2010	4785570.0	Alabama
4	AL	under18	2011	1125763.0	Alabama

查看存在缺失數據的列。

使用.isnull().any()，只有某一列存在一個缺失數據，就會顯示True。

In [88]:

pop_m.isnull().any()

Out[88]:

state/region    False
ages            False
year            False
population       True
state            True
dtype: bool

In [ ]:

#population 和 state這兩列有數據缺失的情況

查看缺失數據

In [92]:

#為空的行索引
pop_m.loc[pop_m.isnull().any(axis = 1)]

Out[92]:

	state/region	ages	year	population	state
2448	PR	under18	1990	NaN	NaN
2449	PR	total	1990	NaN	NaN
2450	PR	total	1991	NaN	NaN
2451	PR	under18	1991	NaN	NaN
2452	PR	total	1993	NaN	NaN
2453	PR	under18	1993	NaN	NaN
2454	PR	under18	1992	NaN	NaN
2455	PR	total	1992	NaN	NaN
2456	PR	under18	1994	NaN	NaN
2457	PR	total	1994	NaN	NaN
2458	PR	total	1995	NaN	NaN
2459	PR	under18	1995	NaN	NaN
2460	PR	under18	1996	NaN	NaN
2461	PR	total	1996	NaN	NaN
2462	PR	under18	1998	NaN	NaN
2463	PR	total	1998	NaN	NaN
2464	PR	total	1997	NaN	NaN
2465	PR	under18	1997	NaN	NaN
2466	PR	total	1999	NaN	NaN
2467	PR	under18	1999	NaN	NaN
2468	PR	total	2000	3810605.0	NaN
2469	PR	under18	2000	1089063.0	NaN
2470	PR	total	2001	3818774.0	NaN
2471	PR	under18	2001	1077566.0	NaN
2472	PR	total	2002	3823701.0	NaN
2473	PR	under18	2002	1065051.0	NaN
2474	PR	total	2004	3826878.0	NaN
2475	PR	under18	2004	1035919.0	NaN
2476	PR	total	2003	3826095.0	NaN
2477	PR	under18	2003	1050615.0	NaN
...	...	...	...	...	...
2514	USA	under18	1999	71946051.0	NaN
2515	USA	total	2000	282162411.0	NaN
2516	USA	under18	2000	72376189.0	NaN
2517	USA	total	1999	279040181.0	NaN
2518	USA	total	2001	284968955.0	NaN
2519	USA	under18	2001	72671175.0	NaN
2520	USA	total	2002	287625193.0	NaN
2521	USA	under18	2002	72936457.0	NaN
2522	USA	total	2003	290107933.0	NaN
2523	USA	under18	2003	73100758.0	NaN
2524	USA	total	2004	292805298.0	NaN
2525	USA	under18	2004	73297735.0	NaN
2526	USA	total	2005	295516599.0	NaN
2527	USA	under18	2005	73523669.0	NaN
2528	USA	total	2006	298379912.0	NaN
2529	USA	under18	2006	73757714.0	NaN
2530	USA	total	2007	301231207.0	NaN
2531	USA	under18	2007	74019405.0	NaN
2532	USA	total	2008	304093966.0	NaN
2533	USA	under18	2008	74104602.0	NaN
2534	USA	under18	2013	73585872.0	NaN
2535	USA	total	2013	316128839.0	NaN
2536	USA	total	2009	306771529.0	NaN
2537	USA	under18	2009	74134167.0	NaN
2538	USA	under18	2010	74119556.0	NaN
2539	USA	total	2010	309326295.0	NaN
2540	USA	under18	2011	73902222.0	NaN
2541	USA	total	2011	311582564.0	NaN
2542	USA	under18	2012	73708179.0	NaN
2543	USA	total	2012	313873685.0	NaN

96 rows × 5 columns

根據數據是否缺失情況顯示數據，如果缺失為True，那么顯示

找到有哪些state/region使得state的值為NaN，使用unique()查看非重復值

In [94]:

condition = pop_m['state'].isnull() pop_m['state/region'][condition].unique()

Out[94]:

array(['PR', 'USA'], dtype=object)

In [95]:

areas

Out[95]:

	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425
2	Arizona	114006
3	Arkansas	53182
4	California	163707
5	Colorado	104100
6	Connecticut	5544
7	Delaware	1954
8	Florida	65758
9	Georgia	59441
10	Hawaii	10932
11	Idaho	83574
12	Illinois	57918
13	Indiana	36420
14	Iowa	56276
15	Kansas	82282
16	Kentucky	40411
17	Louisiana	51843
18	Maine	35387
19	Maryland	12407
20	Massachusetts	10555
21	Michigan	96810
22	Minnesota	86943
23	Mississippi	48434
24	Missouri	69709
25	Montana	147046
26	Nebraska	77358
27	Nevada	110567
28	New Hampshire	9351
29	New Jersey	8722
30	New Mexico	121593
31	New York	54475
32	North Carolina	53821
33	North Dakota	70704
34	Ohio	44828
35	Oklahoma	69903
36	Oregon	98386
37	Pennsylvania	46058
38	Rhode Island	1545
39	South Carolina	32007
40	South Dakota	77121
41	Tennessee	42146
42	Texas	268601
43	Utah	84904
44	Vermont	9615
45	Virginia	42769
46	Washington	71303
47	West Virginia	24231
48	Wisconsin	65503
49	Wyoming	97818
50	District of Columbia	68
51	Puerto Rico	3515

In [ ]:

只有兩個州，對應的州名為空

為找到的這些state/region的state項補上正確的值，從而去除掉state這一列的所有NaN！

記住這樣清除缺失數據NaN的方法！

In [96]:

#Puerto Rico 

conditon = pop_m['state/region'] == 'PR' condition

Out[96]:

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Name: state, Length: 2544, dtype: bool

In [97]:

pop_m['state'][condition] = 'Puerto Rico'

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

In [99]:

condition = pop_m['state/region'] == 'USA' pop_m['state'][condition] = 'United State'

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [100]:

#剛才的填補操作，起作用了
pop_m.isnull().any()

Out[100]:

state/region    False
ages            False
year            False
population       True
state           False
dtype: bool

合並各州面積數據areas，使用左合並。

思考一下為什么使用外合並？

In [102]:

pop.head() #人口的DataFrame和abb合並，有了州名全程 #可以和areas DataFrame進行合並

Out[102]:

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0
2	AL	under18	2010	1130966.0
3	AL	total	2010	4785570.0
4	AL	under18	2011	1125763.0

In [103]:

pop_areas_m = pop_m.merge(areas,how = 'outer')

繼續尋找存在缺失數據的列

In [105]:

pop_areas_m.shape

Out[105]:

(2544, 6)

In [109]:

areas

Out[109]:

	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425
2	Arizona	114006
3	Arkansas	53182
4	California	163707
5	Colorado	104100
6	Connecticut	5544
7	Delaware	1954
8	Florida	65758
9	Georgia	59441
10	Hawaii	10932
11	Idaho	83574
12	Illinois	57918
13	Indiana	36420
14	Iowa	56276
15	Kansas	82282
16	Kentucky	40411
17	Louisiana	51843
18	Maine	35387
19	Maryland	12407
20	Massachusetts	10555
21	Michigan	96810
22	Minnesota	86943
23	Mississippi	48434
24	Missouri	69709
25	Montana	147046
26	Nebraska	77358
27	Nevada	110567
28	New Hampshire	9351
29	New Jersey	8722
30	New Mexico	121593
31	New York	54475
32	North Carolina	53821
33	North Dakota	70704
34	Ohio	44828
35	Oklahoma	69903
36	Oregon	98386
37	Pennsylvania	46058
38	Rhode Island	1545
39	South Carolina	32007
40	South Dakota	77121
41	Tennessee	42146
42	Texas	268601
43	Utah	84904
44	Vermont	9615
45	Virginia	42769
46	Washington	71303
47	West Virginia	24231
48	Wisconsin	65503
49	Wyoming	97818
50	District of Columbia	68
51	Puerto Rico	3515

In [106]:

pop_areas_m.isnull().any()

Out[106]:

state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool

我們會發現area(sq.mi)這一列有缺失數據，為了找出是哪一行，我們需要找出是哪個state沒有數據

In [110]:

cond = pop_areas_m['area (sq. mi)'].isnull() cond

Out[110]:

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Name: area (sq. mi), Length: 2544, dtype: bool

In [111]:

pop_areas_m['state/region'][cond]

Out[111]:

2496    USA
2497    USA
2498    USA
2499    USA
2500    USA
2501    USA
2502    USA
2503    USA
2504    USA
2505    USA
2506    USA
2507    USA
2508    USA
2509    USA
2510    USA
2511    USA
2512    USA
2513    USA
2514    USA
2515    USA
2516    USA
2517    USA
2518    USA
2519    USA
2520    USA
2521    USA
2522    USA
2523    USA
2524    USA
2525    USA
2526    USA
2527    USA
2528    USA
2529    USA
2530    USA
2531    USA
2532    USA
2533    USA
2534    USA
2535    USA
2536    USA
2537    USA
2538    USA
2539    USA
2540    USA
2541    USA
2542    USA
2543    USA
Name: state/region, dtype: object

去除含有缺失數據的行

In [112]:

pop_areas_m.shape

Out[112]:

(2544, 6)

In [114]:

pop_areas_r = pop_areas_m.dropna()

In [115]:

pop_areas_r.shape

Out[115]:

(2476, 6)

查看數據是否缺失

In [116]:

pop_areas_r.isnull().any()

Out[116]:

state/region     False
ages             False
year             False
population       False
state            False
area (sq. mi)    False
dtype: bool

找出2010年的全民人口數據,df.query(查詢語句)

In [117]:

pop_areas_r.head()

Out[117]:

	state/region	ages	year	population	state	area (sq. mi)
0	AL	under18	2012	1117489.0	Alabama	52423.0
1	AL	total	2012	4817528.0	Alabama	52423.0
2	AL	under18	2010	1130966.0	Alabama	52423.0
3	AL	total	2010	4785570.0	Alabama	52423.0
4	AL	under18	2011	1125763.0	Alabama	52423.0

In [120]:

t_2010 = pop_areas_r.query("ages == 'total' and year == 2010")

In [121]:

t_2010.shape

Out[121]:

(52, 6)

In [122]:

t_2010

Out[122]:

	state/region	ages	year	population	state	area (sq. mi)
3	AL	total	2010	4785570.0	Alabama	52423.0
91	AK	total	2010	713868.0	Alaska	656425.0
101	AZ	total	2010	6408790.0	Arizona	114006.0
189	AR	total	2010	2922280.0	Arkansas	53182.0
197	CA	total	2010	37333601.0	California	163707.0
283	CO	total	2010	5048196.0	Colorado	104100.0
293	CT	total	2010	3579210.0	Connecticut	5544.0
379	DE	total	2010	899711.0	Delaware	1954.0
389	DC	total	2010	605125.0	District of Columbia	68.0
475	FL	total	2010	18846054.0	Florida	65758.0
485	GA	total	2010	9713248.0	Georgia	59441.0
570	HI	total	2010	1363731.0	Hawaii	10932.0
581	ID	total	2010	1570718.0	Idaho	83574.0
666	IL	total	2010	12839695.0	Illinois	57918.0
677	IN	total	2010	6489965.0	Indiana	36420.0
762	IA	total	2010	3050314.0	Iowa	56276.0
773	KS	total	2010	2858910.0	Kansas	82282.0
858	KY	total	2010	4347698.0	Kentucky	40411.0
869	LA	total	2010	4545392.0	Louisiana	51843.0
954	ME	total	2010	1327366.0	Maine	35387.0
965	MD	total	2010	5787193.0	Maryland	12407.0
1050	MA	total	2010	6563263.0	Massachusetts	10555.0
1061	MI	total	2010	9876149.0	Michigan	96810.0
1146	MN	total	2010	5310337.0	Minnesota	86943.0
1157	MS	total	2010	2970047.0	Mississippi	48434.0
1242	MO	total	2010	5996063.0	Missouri	69709.0
1253	MT	total	2010	990527.0	Montana	147046.0
1338	NE	total	2010	1829838.0	Nebraska	77358.0
1349	NV	total	2010	2703230.0	Nevada	110567.0
1434	NH	total	2010	1316614.0	New Hampshire	9351.0
1445	NJ	total	2010	8802707.0	New Jersey	8722.0
1530	NM	total	2010	2064982.0	New Mexico	121593.0
1541	NY	total	2010	19398228.0	New York	54475.0
1626	NC	total	2010	9559533.0	North Carolina	53821.0
1637	ND	total	2010	674344.0	North Dakota	70704.0
1722	OH	total	2010	11545435.0	Ohio	44828.0
1733	OK	total	2010	3759263.0	Oklahoma	69903.0
1818	OR	total	2010	3837208.0	Oregon	98386.0
1829	PA	total	2010	12710472.0	Pennsylvania	46058.0
1914	RI	total	2010	1052669.0	Rhode Island	1545.0
1925	SC	total	2010	4636361.0	South Carolina	32007.0
2010	SD	total	2010	816211.0	South Dakota	77121.0
2021	TN	total	2010	6356683.0	Tennessee	42146.0
2106	TX	total	2010	25245178.0	Texas	268601.0
2117	UT	total	2010	2774424.0	Utah	84904.0
2202	VT	total	2010	625793.0	Vermont	9615.0
2213	VA	total	2010	8024417.0	Virginia	42769.0
2298	WA	total	2010	6742256.0	Washington	71303.0
2309	WV	total	2010	1854146.0	West Virginia	24231.0
2394	WI	total	2010	5689060.0	Wisconsin	65503.0
2405	WY	total	2010	564222.0	Wyoming	97818.0
2490	PR	total	2010	3721208.0	Puerto Rico	3515.0

對查詢結果進行處理，以state列作為新的行索引:set_index

In [124]:

t_2010.set_index('state',inplace=True)

In [126]:

t_2010

Out[126]:

	state/region	ages	year	population	area (sq. mi)
state
Alabama	AL	total	2010	4785570.0	52423.0
Alaska	AK	total	2010	713868.0	656425.0
Arizona	AZ	total	2010	6408790.0	114006.0
Arkansas	AR	total	2010	2922280.0	53182.0
California	CA	total	2010	37333601.0	163707.0
Colorado	CO	total	2010	5048196.0	104100.0
Connecticut	CT	total	2010	3579210.0	5544.0
Delaware	DE	total	2010	899711.0	1954.0
District of Columbia	DC	total	2010	605125.0	68.0
Florida	FL	total	2010	18846054.0	65758.0
Georgia	GA	total	2010	9713248.0	59441.0
Hawaii	HI	total	2010	1363731.0	10932.0
Idaho	ID	total	2010	1570718.0	83574.0
Illinois	IL	total	2010	12839695.0	57918.0
Indiana	IN	total	2010	6489965.0	36420.0
Iowa	IA	total	2010	3050314.0	56276.0
Kansas	KS	total	2010	2858910.0	82282.0
Kentucky	KY	total	2010	4347698.0	40411.0
Louisiana	LA	total	2010	4545392.0	51843.0
Maine	ME	total	2010	1327366.0	35387.0
Maryland	MD	total	2010	5787193.0	12407.0
Massachusetts	MA	total	2010	6563263.0	10555.0
Michigan	MI	total	2010	9876149.0	96810.0
Minnesota	MN	total	2010	5310337.0	86943.0
Mississippi	MS	total	2010	2970047.0	48434.0
Missouri	MO	total	2010	5996063.0	69709.0
Montana	MT	total	2010	990527.0	147046.0
Nebraska	NE	total	2010	1829838.0	77358.0
Nevada	NV	total	2010	2703230.0	110567.0
New Hampshire	NH	total	2010	1316614.0	9351.0
New Jersey	NJ	total	2010	8802707.0	8722.0
New Mexico	NM	total	2010	2064982.0	121593.0
New York	NY	total	2010	19398228.0	54475.0
North Carolina	NC	total	2010	9559533.0	53821.0
North Dakota	ND	total	2010	674344.0	70704.0
Ohio	OH	total	2010	11545435.0	44828.0
Oklahoma	OK	total	2010	3759263.0	69903.0
Oregon	OR	total	2010	3837208.0	98386.0
Pennsylvania	PA	total	2010	12710472.0	46058.0
Rhode Island	RI	total	2010	1052669.0	1545.0
South Carolina	SC	total	2010	4636361.0	32007.0
South Dakota	SD	total	2010	816211.0	77121.0
Tennessee	TN	total	2010	6356683.0	42146.0
Texas	TX	total	2010	25245178.0	268601.0
Utah	UT	total	2010	2774424.0	84904.0
Vermont	VT	total	2010	625793.0	9615.0
Virginia	VA	total	2010	8024417.0	42769.0
Washington	WA	total	2010	6742256.0	71303.0
West Virginia	WV	total	2010	1854146.0	24231.0
Wisconsin	WI	total	2010	5689060.0	65503.0
Wyoming	WY	total	2010	564222.0	97818.0
Puerto Rico	PR	total	2010	3721208.0	3515.0

計算人口密度。注意是Series/Series，其結果還是一個Series。

In [127]:

pop_density = t_2010['population']/t_2010["area (sq. mi)"] pop_density

Out[127]:

state
Alabama                   91.287603
Alaska                     1.087509
Arizona                   56.214497
Arkansas                  54.948667
California               228.051342
Colorado                  48.493718
Connecticut              645.600649
Delaware                 460.445752
District of Columbia    8898.897059
Florida                  286.597129
Georgia                  163.409902
Hawaii                   124.746707
Idaho                     18.794338
Illinois                 221.687472
Indiana                  178.197831
Iowa                      54.202751
Kansas                    34.745266
Kentucky                 107.586994
Louisiana                 87.676099
Maine                     37.509990
Maryland                 466.445797
Massachusetts            621.815538
Michigan                 102.015794
Minnesota                 61.078373
Mississippi               61.321530
Missouri                  86.015622
Montana                    6.736171
Nebraska                  23.654153
Nevada                    24.448796
New Hampshire            140.799273
New Jersey              1009.253268
New Mexico                16.982737
New York                 356.094135
North Carolina           177.617157
North Dakota               9.537565
Ohio                     257.549634
Oklahoma                  53.778278
Oregon                    39.001565
Pennsylvania             275.966651
Rhode Island             681.339159
South Carolina           144.854594
South Dakota              10.583512
Tennessee                150.825298
Texas                     93.987655
Utah                      32.677188
Vermont                   65.085075
Virginia                 187.622273
Washington                94.557817
West Virginia             76.519582
Wisconsin                 86.851900
Wyoming                    5.768079
Puerto Rico             1058.665149
dtype: float64

排序，並找出人口密度最高的五個州sort_values()

In [128]:

type(pop_density)

Out[128]:

pandas.core.series.Series

In [130]:

pop_density.sort_values(inplace=True)

找出人口密度最低的五個州

In [131]:

pop_density[:5]

Out[131]:

state
Alaska           1.087509
Wyoming          5.768079
Montana          6.736171
North Dakota     9.537565
South Dakota    10.583512
dtype: float64

In [132]:

pop_density.tail()

Out[132]:

state
Connecticut              645.600649
Rhode Island             681.339159
New Jersey              1009.253268
Puerto Rico             1058.665149
District of Columbia    8898.897059
dtype: float64

要點總結：

統一用loc()索引
善於使用.isnull().any()找到存在NaN的列
善於使用.unique()確定該列中哪些key是我們需要的
一般使用外合並、左合並，目的只有一個：寧願該列是NaN也不要丟棄其他列的信息

回顧：Series/DataFrame運算與ndarray運算的區別

Series與DataFrame沒有廣播，如果對應index沒有值，則記為NaN；或者使用add的fill_value來補缺失值
ndarray有廣播，通過重復已有值來計算

python學習——pandas的拼接操作

pandas的拼接操作

0. 回顧numpy的級聯

1. 使用pd.concat()級聯

1) 簡單級聯

2) 不匹配級聯

3) 使用append()函數添加

2. 使用pd.merge()合並

1) 一對一合並

2) 多對一合並

3) 多對多合並

4) key的規范化

5) 內合並與外合並

6) 列沖突的解決

作業

3. 案例分析：美國各州人口數據分析

回顧：Series/DataFrame運算與ndarray運算的區別

免責聲明！