python學習——pandas的拼接操作


pandas的拼接操作

 

pandas的拼接分為兩種:

  • 級聯:pd.concat, pd.append
  • 合並:pd.merge, pd.join
 

0. 回顧numpy的級聯

 

============================================

練習12:

  1. 生成2個3*3的矩陣,對其分別進行兩個維度上的級聯

============================================

In [19]:
import numpy as np import pandas as pd from pandas import Series,DataFrame 
In [20]:
nd = np.random.randint(0,10,size=(3,3)) nd 
Out[20]:
array([[6, 3, 4],
       [3, 9, 8],
       [8, 7, 8]])
In [24]:
np.concatenate((nd,nd),axis=0)#0代表行間操作 
Out[24]:
array([[6, 3, 4],
       [3, 9, 8],
       [8, 7, 8],
       [6, 3, 4],
       [3, 9, 8],
       [8, 7, 8]])
In [25]:
np.concatenate([nd,nd],axis=1)#1代表列間操作,()huo[]效果一樣 
Out[25]:
array([[6, 3, 4, 6, 3, 4],
       [3, 9, 8, 3, 9, 8],
       [8, 7, 8, 8, 7, 8]])
 

為方便講解,我們首先定義一個生成DataFrame的函數:

In [26]:
def make_df(inds,cols): #字典的key作為列名進行展示 data = {key:[key+str(i) for i in inds]for key in cols} return DataFrame(data,index=inds,columns=cols) 
In [28]:
make_df([1,2],list('AB')) 
Out[28]:
  A B
1 A1 B1
2 A2 B2
 

1. 使用pd.concat()級聯

 

pandas使用pd.concat函數,與np.concatenate函數類似,只是多了一些參數:

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)
 

1) 簡單級聯

 

和np.concatenate一樣,優先增加行數(默認axis=0)

In [29]:
df1 = make_df([0,1],list('AB')) df2 = make_df([2,3],list('AB')) 
In [30]:
display(df1,df2) 
 
  A B
0 A0 B0
1 A1 B1
 
  A B
2 A2 B2
3 A3 B3
 

可以通過設置axis來改變級聯方向

In [31]:
pd.concat([df1,df2]) 
Out[31]:
  A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
In [32]:
pd.concat((df1,df2),axis = 1) 
Out[32]:
  A B A B
0 A0 B0 NaN NaN
1 A1 B1 NaN NaN
2 NaN NaN A2 B2
3 NaN NaN A3 B3
 

注意index在級聯時可以重復

 

也可以選擇忽略ignore_index,重新索引

In [34]:
pd.concat((df1,df2),axis=1,ignore_index=True) 
Out[34]:
  0 1 2 3
0 A0 B0 NaN NaN
1 A1 B1 NaN NaN
2 NaN NaN A2 B2
3 NaN NaN A3 B3
 

或者使用多層索引 keys

concat([x,y],keys=['x','y'])

In [13]:
pd.concat([df1,df2],keys=['x','y']) 
Out[13]:
    A B
x 0 A1 B1
1 A2 B2
y 0 A3 B3
1 A4 B4
In [ ]:
#pd 模塊 import pandas as pd

#df1,df2 具體的實例 #級聯的方法,屬於上一級,DataFrame來自pandas 
 

============================================

練習13:

  1. 想一想級聯的應用場景?

  2. 使用昨天的知識,建立一個期中考試張三、李四的成績表ddd

  3. 假設新增考試學科"計算機",如何實現?

  4. 新增王老五同學的成績,如何實現?

============================================

In [ ]:
 
 

2) 不匹配級聯

 

不匹配指的是級聯的維度的索引不一致。例如縱向級聯時列索引不一致,橫向級聯時行索引不一致

In [38]:
df1 = make_df([1,2],list('AB')) df2 = make_df([2,4],list('BC')) display(df1,df2) 
 
  A B
1 A1 B1
2 A2 B2
 
  B C
2 B2 C2
4 B4 C4
 

有3種連接方式:

  • 外連接:補NaN(默認模式)
In [39]:
pd.concat([df1,df2]) 
 
C:\Users\BLX\AppData\Roaming\Python\Python37\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.
Out[39]:
  A B C
1 A1 B1 NaN
2 A2 B2 NaN
2 NaN B2 C2
4 NaN B4 C4
 
  • 內連接:只連接匹配的項
In [41]:
#合並顯示共有數據
pd.concat((df1,df2),join = 'inner',axis = 1) 
Out[41]:
  A B B C
2 A2 B2 B2 C2
 
  • 連接指定軸 join_axes
In [42]:
df2.columns 
Out[42]:
Index(['B', 'C'], dtype='object')
In [43]:
#join_axex以某個DataFrame的列索引為新的列索引值
pd.concat([df1,df2],join_axes=[df2.columns]) 
Out[43]:
  B C
1 B1 NaN
2 B2 NaN
2 B2 C2
4 B4 C4
 

============================================

練習14:

假設【期末】考試ddd2的成績沒有張三的,只有李四、王老五、趙小六的,使用多種方法級聯

============================================

 

3) 使用append()函數添加

 

由於在后面級聯的使用非常普遍,因此有一個函數append專門用於在后面添加

In [44]:
display(df1,df2) 
 
  A B
1 A1 B1
2 A2 B2
 
  B C
2 B2 C2
4 B4 C4
In [49]:
#append函數屬於DataFrame,concat這函數屬於pandas模塊
#pd.concat((df1,df2)) df1.append(df2) 
Out[49]:
  A B C
1 A1 B1 NaN
2 A2 B2 NaN
2 NaN B2 C2
4 NaN B4 C4
 

============================================

練習15:

新建一個只有張三李四王老五的期末考試成績單ddd3,使用append()與期中考試成績表ddd級聯

============================================

 

2. 使用pd.merge()合並

 

merge與concat的區別在於,merge需要依據某一共同的行或列來進行合並

使用pd.merge()合並時,會自動根據兩者相同column名稱的那一列,作為key來進行合並。

注意每一列元素的順序不要求一致

 

1) 一對一合並

In [54]:
#merge根據相同的元素進行合並的
df1 = DataFrame({'employee':['Po','Sara','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Sara','Bush'], 'work_time':[2,3,1]}) display(df1,df2) 
 
  employee group
0 Po sail
1 Sara couting
2 Danis marketing
 
  employee work_time
0 Po 2
1 Sara 3
2 Bush 1
In [55]:
pd.merge(df1,df2) 
Out[55]:
  employee group work_time
0 Po sail 2
1 Sara couting 3
In [56]:
df1.merge(df2) 
Out[56]:
  employee group work_time
0 Po sail 2
1 Sara couting 3
 

2) 多對一合並

In [57]:
df1 = DataFrame({'employee':['Po','Sara','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Po','Bush'], 'work_time':[2,3,1]}) display(df1,df2) 
 
  employee group
0 Po sail
1 Sara couting
2 Danis marketing
 
  employee work_time
0 Po 2
1 Po 3
2 Bush 1
In [58]:
pd.merge(df1,df2) 
Out[58]:
  employee group work_time
0 Po sail 2
1 Po sail 3
 

3) 多對多合並

In [61]:
df1 = DataFrame({'employee':['Po','Po','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Po','Bush'], 'work_time':[2,3,1]}) display(df1,df2) 
 
  employee group
0 Po sail
1 Po couting
2 Danis marketing
 
  employee work_time
0 Po 2
1 Po 3
2 Bush 1
In [62]:
pd.merge(df1,df2) 
Out[62]:
  employee group work_time
0 Po sail 2
1 Po sail 3
2 Po couting 2
3 Po couting 3
 

4) key的規范化

 
  • 使用on=顯式指定哪一列為key,當有多個key相同時使用
In [66]:
df3 = DataFrame({'employee':['Po','Summer','Flower'], 'group':['sail','marketing','serch'], 'salary':[12000,10000,8000]}) df4 = DataFrame({'employee':['Po','Winter','Flower'], 'group':['marketing','marketing','serch'], 'work_time':[2,1,5]}) display(df3,df4) 
 
  employee group salary
0 Po sail 12000
1 Summer marketing 10000
2 Flower serch 8000
 
  employee group work_time
0 Po marketing 2
1 Winter marketing 1
2 Flower serch 5
In [67]:
pd.merge(df3,df4) 
Out[67]:
  employee group salary work_time
0 Flower serch 8000 5
In [70]:
pd.merge(df3,df4,on='employee') 
Out[70]:
  employee group_x salary group_y work_time
0 Po sail 12000 marketing 2
1 Flower serch 8000 serch 5
In [73]:
pd.merge(df3,df4,on='group',suffixes=['_A','_B']) 
Out[73]:
  employee_A group salary employee_B work_time
0 Summer marketing 10000 Po 2
1 Summer marketing 10000 Winter 1
2 Flower serch 8000 Flower 5
 
  • 使用left_on和right_on指定左右兩邊的列作為key,當左右兩邊的key都不想等時使用
  • 參數1為左,參數2為右
In [79]:
df3 = DataFrame({'employer':['Po','Summer','Flower'], 'Team':['sail','marketing','serch'], 'salary':[12000,10000,8000]}) df4 = DataFrame({'employee':['Po','Winter','Flower'], 'group':['marketing','marketing','serch'], 'work_time':[2,1,5]}) display(df3,df4) 
 
  employer Team salary
0 Po sail 12000
1 Summer marketing 10000
2 Flower serch 8000
 
  employee group work_time
0 Po marketing 2
1 Winter marketing 1
2 Flower serch 5
In [81]:
pd.merge(df3,df4,left_on='employer',right_on='employee') 
Out[81]:
  employer Team salary employee group work_time
0 Po sail 12000 Po marketing 2
1 Flower serch 8000 Flower serch 5
In [82]:
pd.merge(df3,df4,left_on='Team',right_on='group') 
Out[82]:
  employer Team salary employee group work_time
0 Summer marketing 10000 Po marketing 2
1 Summer marketing 10000 Winter marketing 1
2 Flower serch 8000 Flower serch 5
 

============================================

練習16:

  1. 假設有兩份成績單,除了ddd是張三李四王老五之外,還有ddd4是張三和趙小六的成績單,如何合並?

  2. 如果ddd4中張三的名字被打錯了,成為了張十三,怎么辦?

  3. 自行練習多對一,多對多的情況

  4. 自學left_index,right_index

============================================

 

5) 內合並與外合並

 
  • 內合並:只保留兩者都有的key(默認模式)
In [85]:
df1 = DataFrame({'age':[18,22,33],'height':[175,169,180]}) df2 = DataFrame({'age':[18,23,31],'weight':[65,70,80]}) 
In [86]:
pd.merge(df1,df2) 
Out[86]:
  age height weight
0 18 175 65
In [87]:
df1.merge(df2,how='inner') 
Out[87]:
  age height weight
0 18 175 65
 
  • 外合並 how='outer':補NaN
In [88]:
df1.merge(df2,how = 'outer') 
Out[88]:
  age height weight
0 18 175.0 65.0
1 22 169.0 NaN
2 33 180.0 NaN
3 23 NaN 70.0
4 31 NaN 80.0
 
  • 左合並、右合並:how='left',how='right',
In [89]:
df1.merge(df2,how = 'left')#保留左側 
Out[89]:
  age height weight
0 18 175 65.0
1 22 169 NaN
2 33 180 NaN
In [90]:
pd.merge(df1,df2,how='right')#保留右側 
Out[90]:
  age height weight
0 18 175.0 65
1 23 NaN 70
2 31 NaN 80
 

============================================

練習17:

  1. 如果只有張三趙小六語數英三個科目的成績,如何合並?

  2. 考慮應用情景,使用多種方式合並ddd與ddd4

============================================

 

6) 列沖突的解決

 

當列沖突時,即有多個列名稱相同時,需要使用on=來指定哪一個列作為key,配合suffixes指定沖突列名

 

可以使用suffixes=自己指定后綴

In [91]:
display(df3,df4) 
 
  employer Team salary
0 Po sail 12000
1 Summer marketing 10000
2 Flower serch 8000
 
  employee group work_time
0 Po marketing 2
1 Winter marketing 1
2 Flower serch 5
In [93]:
df3.columns = ['employee','group','salary'] display(df3) 
 
  employee group salary
0 Po sail 12000
1 Summer marketing 10000
2 Flower serch 8000
In [94]:
pd.merge(df3,df4,on='employee',suffixes=['_李','_王']) 
Out[94]:
  employee group_李 salary group_王 work_time
0 Po sail 12000 marketing 2
1 Flower serch 8000 serch 5
 

============================================

練習18:

假設有兩個同學都叫李四,ddd5、ddd6都是張三和李四的成績表,如何合並?

============================================

 

作業

3. 案例分析:美國各州人口數據分析

 

首先導入文件,並查看數據樣本

In [62]:
import numpy as np import pandas as pd from pandas import Series,DataFrame 
In [63]:
#使用pandas讀取數據
pop = pd.read_csv('../../data/state-population.csv') areas = pd.read_csv('../../data/state-areas.csv') abb = pd.read_csv('../../data/state-abbrevs.csv') 
In [64]:
pop.shape 
Out[64]:
(2544, 4)
In [65]:
pop.head() 
Out[65]:
  state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
In [70]:
areas.shape 
Out[70]:
(52, 2)
In [69]:
abb.shape 
Out[69]:
(51, 2)
 

合並pop與abbrevs兩個DataFrame,分別依據state/region列和abbreviation列來合並。

為了保留所有信息,使用外合並。

In [71]:
pop.head() 
Out[71]:
  state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
In [72]:
abb.head() 
Out[72]:
  state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
In [73]:
display(pop.shape,abb.shape) 
 
(2544, 4)
 
(51, 2)
In [78]:
#此時的場景 left == outer left數據大於abb
#left效果比outer差一些 #abb 河北 pop_m = pop.merge(abb,left_on='state/region',right_on='abbreviation',how = 'outer') pop_m.shape 
Out[78]:
(2544, 6)
 

去除abbreviation的那一列(axis=1)

In [79]:
pop_m.head() 
Out[79]:
  state/region ages year population state abbreviation
0 AL under18 2012 1117489.0 Alabama AL
1 AL total 2012 4817528.0 Alabama AL
2 AL under18 2010 1130966.0 Alabama AL
3 AL total 2010 4785570.0 Alabama AL
4 AL under18 2011 1125763.0 Alabama AL
In [83]:
pop_m.drop('abbreviation',axis = 1,inplace=True) 
 
---------------------------------------------------------------------------
ValueError Traceback (most recent call last) <ipython-input-83-15dcfc478d0b> in <module>() ----> 1 pop_m.drop('abbreviation',axis = 1,inplace=True) /usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors)  2159 new_axis = axis.drop(labels, level=level, errors=errors)  2160 else: -> 2161 new_axis = axis.drop(labels, errors=errors)  2162 dropped = self.reindex(**{axis_name: new_axis})  2163 try: /usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors)  3622 if errors != 'ignore':  3623 raise ValueError('labels %s not contained in axis' % -> 3624 labels[mask])  3625 indexer = indexer[~mask]  3626 return self.delete(indexer) ValueError: labels ['abbreviation'] not contained in axis
In [82]:
pop_m.head() 
Out[82]:
  state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama
 

查看存在缺失數據的列。

使用.isnull().any(),只有某一列存在一個缺失數據,就會顯示True。

In [88]:
pop_m.isnull().any() 
Out[88]:
state/region    False
ages            False
year            False
population       True
state            True
dtype: bool
In [ ]:
#population 和 state這兩列有數據缺失的情況
 

查看缺失數據

In [92]:
#為空的行索引
pop_m.loc[pop_m.isnull().any(axis = 1)] 
Out[92]:
  state/region ages year population state
2448 PR under18 1990 NaN NaN
2449 PR total 1990 NaN NaN
2450 PR total 1991 NaN NaN
2451 PR under18 1991 NaN NaN
2452 PR total 1993 NaN NaN
2453 PR under18 1993 NaN NaN
2454 PR under18 1992 NaN NaN
2455 PR total 1992 NaN NaN
2456 PR under18 1994 NaN NaN
2457 PR total 1994 NaN NaN
2458 PR total 1995 NaN NaN
2459 PR under18 1995 NaN NaN
2460 PR under18 1996 NaN NaN
2461 PR total 1996 NaN NaN
2462 PR under18 1998 NaN NaN
2463 PR total 1998 NaN NaN
2464 PR total 1997 NaN NaN
2465 PR under18 1997 NaN NaN
2466 PR total 1999 NaN NaN
2467 PR under18 1999 NaN NaN
2468 PR total 2000 3810605.0 NaN
2469 PR under18 2000 1089063.0 NaN
2470 PR total 2001 3818774.0 NaN
2471 PR under18 2001 1077566.0 NaN
2472 PR total 2002 3823701.0 NaN
2473 PR under18 2002 1065051.0 NaN
2474 PR total 2004 3826878.0 NaN
2475 PR under18 2004 1035919.0 NaN
2476 PR total 2003 3826095.0 NaN
2477 PR under18 2003 1050615.0 NaN
... ... ... ... ... ...
2514 USA under18 1999 71946051.0 NaN
2515 USA total 2000 282162411.0 NaN
2516 USA under18 2000 72376189.0 NaN
2517 USA total 1999 279040181.0 NaN
2518 USA total 2001 284968955.0 NaN
2519 USA under18 2001 72671175.0 NaN
2520 USA total 2002 287625193.0 NaN
2521 USA under18 2002 72936457.0 NaN
2522 USA total 2003 290107933.0 NaN
2523 USA under18 2003 73100758.0 NaN
2524 USA total 2004 292805298.0 NaN
2525 USA under18 2004 73297735.0 NaN
2526 USA total 2005 295516599.0 NaN
2527 USA under18 2005 73523669.0 NaN
2528 USA total 2006 298379912.0 NaN
2529 USA under18 2006 73757714.0 NaN
2530 USA total 2007 301231207.0 NaN
2531 USA under18 2007 74019405.0 NaN
2532 USA total 2008 304093966.0 NaN
2533 USA under18 2008 74104602.0 NaN
2534 USA under18 2013 73585872.0 NaN
2535 USA total 2013 316128839.0 NaN
2536 USA total 2009 306771529.0 NaN
2537 USA under18 2009 74134167.0 NaN
2538 USA under18 2010 74119556.0 NaN
2539 USA total 2010 309326295.0 NaN
2540 USA under18 2011 73902222.0 NaN
2541 USA total 2011 311582564.0 NaN
2542 USA under18 2012 73708179.0 NaN
2543 USA total 2012 313873685.0 NaN

96 rows × 5 columns

 

根據數據是否缺失情況顯示數據,如果缺失為True,那么顯示

 

找到有哪些state/region使得state的值為NaN,使用unique()查看非重復值

In [94]:
condition = pop_m['state'].isnull() pop_m['state/region'][condition].unique() 
Out[94]:
array(['PR', 'USA'], dtype=object)
In [95]:
areas
Out[95]:
  state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
5 Colorado 104100
6 Connecticut 5544
7 Delaware 1954
8 Florida 65758
9 Georgia 59441
10 Hawaii 10932
11 Idaho 83574
12 Illinois 57918
13 Indiana 36420
14 Iowa 56276
15 Kansas 82282
16 Kentucky 40411
17 Louisiana 51843
18 Maine 35387
19 Maryland 12407
20 Massachusetts 10555
21 Michigan 96810
22 Minnesota 86943
23 Mississippi 48434
24 Missouri 69709
25 Montana 147046
26 Nebraska 77358
27 Nevada 110567
28 New Hampshire 9351
29 New Jersey 8722
30 New Mexico 121593
31 New York 54475
32 North Carolina 53821
33 North Dakota 70704
34 Ohio 44828
35 Oklahoma 69903
36 Oregon 98386
37 Pennsylvania 46058
38 Rhode Island 1545
39 South Carolina 32007
40 South Dakota 77121
41 Tennessee 42146
42 Texas 268601
43 Utah 84904
44 Vermont 9615
45 Virginia 42769
46 Washington 71303
47 West Virginia 24231
48 Wisconsin 65503
49 Wyoming 97818
50 District of Columbia 68
51 Puerto Rico 3515
In [ ]:
只有兩個州對應的州名為空 
 

為找到的這些state/region的state項補上正確的值,從而去除掉state這一列的所有NaN!

記住這樣清除缺失數據NaN的方法!

In [96]:
#Puerto Rico 

conditon = pop_m['state/region'] == 'PR' condition 
Out[96]:
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Name: state, Length: 2544, dtype: bool
In [97]:
pop_m['state'][condition] = 'Puerto Rico' 
 
/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [99]:
condition = pop_m['state/region'] == 'USA' pop_m['state'][condition] = 'United State' 
 
/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [100]:
#剛才的填補操作,起作用了
pop_m.isnull().any() 
Out[100]:
state/region    False
ages            False
year            False
population       True
state           False
dtype: bool
 

合並各州面積數據areas,使用左合並。

思考一下為什么使用外合並?

In [102]:
pop.head() #人口的DataFrame和abb合並,有了州名全程 #可以和areas DataFrame進行合並 
Out[102]:
  state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
In [103]:
pop_areas_m = pop_m.merge(areas,how = 'outer') 
 

繼續尋找存在缺失數據的列

In [105]:
pop_areas_m.shape 
Out[105]:
(2544, 6)
In [109]:
areas
Out[109]:
  state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
5 Colorado 104100
6 Connecticut 5544
7 Delaware 1954
8 Florida 65758
9 Georgia 59441
10 Hawaii 10932
11 Idaho 83574
12 Illinois 57918
13 Indiana 36420
14 Iowa 56276
15 Kansas 82282
16 Kentucky 40411
17 Louisiana 51843
18 Maine 35387
19 Maryland 12407
20 Massachusetts 10555
21 Michigan 96810
22 Minnesota 86943
23 Mississippi 48434
24 Missouri 69709
25 Montana 147046
26 Nebraska 77358
27 Nevada 110567
28 New Hampshire 9351
29 New Jersey 8722
30 New Mexico 121593
31 New York 54475
32 North Carolina 53821
33 North Dakota 70704
34 Ohio 44828
35 Oklahoma 69903
36 Oregon 98386
37 Pennsylvania 46058
38 Rhode Island 1545
39 South Carolina 32007
40 South Dakota 77121
41 Tennessee 42146
42 Texas 268601
43 Utah 84904
44 Vermont 9615
45 Virginia 42769
46 Washington 71303
47 West Virginia 24231
48 Wisconsin 65503
49 Wyoming 97818
50 District of Columbia 68
51 Puerto Rico 3515
In [106]:
pop_areas_m.isnull().any() 
Out[106]:
state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool
 

我們會發現area(sq.mi)這一列有缺失數據,為了找出是哪一行,我們需要找出是哪個state沒有數據

In [110]:
cond = pop_areas_m['area (sq. mi)'].isnull() cond 
Out[110]:
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Name: area (sq. mi), Length: 2544, dtype: bool
In [111]:
pop_areas_m['state/region'][cond] 
Out[111]:
2496    USA
2497    USA
2498    USA
2499    USA
2500    USA
2501    USA
2502    USA
2503    USA
2504    USA
2505    USA
2506    USA
2507    USA
2508    USA
2509    USA
2510    USA
2511    USA
2512    USA
2513    USA
2514    USA
2515    USA
2516    USA
2517    USA
2518    USA
2519    USA
2520    USA
2521    USA
2522    USA
2523    USA
2524    USA
2525    USA
2526    USA
2527    USA
2528    USA
2529    USA
2530    USA
2531    USA
2532    USA
2533    USA
2534    USA
2535    USA
2536    USA
2537    USA
2538    USA
2539    USA
2540    USA
2541    USA
2542    USA
2543    USA
Name: state/region, dtype: object
 

去除含有缺失數據的行

In [112]:
pop_areas_m.shape 
Out[112]:
(2544, 6)
In [114]:
pop_areas_r = pop_areas_m.dropna() 
In [115]:
pop_areas_r.shape 
Out[115]:
(2476, 6)
 

查看數據是否缺失

In [116]:
pop_areas_r.isnull().any() 
Out[116]:
state/region     False
ages             False
year             False
population       False
state            False
area (sq. mi)    False
dtype: bool
 

找出2010年的全民人口數據,df.query(查詢語句)

In [117]:
pop_areas_r.head() 
Out[117]:
  state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0
In [120]:
t_2010 = pop_areas_r.query("ages == 'total' and year == 2010") 
In [121]:
t_2010.shape 
Out[121]:
(52, 6)
In [122]:
t_2010
Out[122]:
  state/region ages year population state area (sq. mi)
3 AL total 2010 4785570.0 Alabama 52423.0
91 AK total 2010 713868.0 Alaska 656425.0
101 AZ total 2010 6408790.0 Arizona 114006.0
189 AR total 2010 2922280.0 Arkansas 53182.0
197 CA total 2010 37333601.0 California 163707.0
283 CO total 2010 5048196.0 Colorado 104100.0
293 CT total 2010 3579210.0 Connecticut 5544.0
379 DE total 2010 899711.0 Delaware 1954.0
389 DC total 2010 605125.0 District of Columbia 68.0
475 FL total 2010 18846054.0 Florida 65758.0
485 GA total 2010 9713248.0 Georgia 59441.0
570 HI total 2010 1363731.0 Hawaii 10932.0
581 ID total 2010 1570718.0 Idaho 83574.0
666 IL total 2010 12839695.0 Illinois 57918.0
677 IN total 2010 6489965.0 Indiana 36420.0
762 IA total 2010 3050314.0 Iowa 56276.0
773 KS total 2010 2858910.0 Kansas 82282.0
858 KY total 2010 4347698.0 Kentucky 40411.0
869 LA total 2010 4545392.0 Louisiana 51843.0
954 ME total 2010 1327366.0 Maine 35387.0
965 MD total 2010 5787193.0 Maryland 12407.0
1050 MA total 2010 6563263.0 Massachusetts 10555.0
1061 MI total 2010 9876149.0 Michigan 96810.0
1146 MN total 2010 5310337.0 Minnesota 86943.0
1157 MS total 2010 2970047.0 Mississippi 48434.0
1242 MO total 2010 5996063.0 Missouri 69709.0
1253 MT total 2010 990527.0 Montana 147046.0
1338 NE total 2010 1829838.0 Nebraska 77358.0
1349 NV total 2010 2703230.0 Nevada 110567.0
1434 NH total 2010 1316614.0 New Hampshire 9351.0
1445 NJ total 2010 8802707.0 New Jersey 8722.0
1530 NM total 2010 2064982.0 New Mexico 121593.0
1541 NY total 2010 19398228.0 New York 54475.0
1626 NC total 2010 9559533.0 North Carolina 53821.0
1637 ND total 2010 674344.0 North Dakota 70704.0
1722 OH total 2010 11545435.0 Ohio 44828.0
1733 OK total 2010 3759263.0 Oklahoma 69903.0
1818 OR total 2010 3837208.0 Oregon 98386.0
1829 PA total 2010 12710472.0 Pennsylvania 46058.0
1914 RI total 2010 1052669.0 Rhode Island 1545.0
1925 SC total 2010 4636361.0 South Carolina 32007.0
2010 SD total 2010 816211.0 South Dakota 77121.0
2021 TN total 2010 6356683.0 Tennessee 42146.0
2106 TX total 2010 25245178.0 Texas 268601.0
2117 UT total 2010 2774424.0 Utah 84904.0
2202 VT total 2010 625793.0 Vermont 9615.0
2213 VA total 2010 8024417.0 Virginia 42769.0
2298 WA total 2010 6742256.0 Washington 71303.0
2309 WV total 2010 1854146.0 West Virginia 24231.0
2394 WI total 2010 5689060.0 Wisconsin 65503.0
2405 WY total 2010 564222.0 Wyoming 97818.0
2490 PR total 2010 3721208.0 Puerto Rico 3515.0
 

對查詢結果進行處理,以state列作為新的行索引:set_index

In [124]:
t_2010.set_index('state',inplace=True) 
In [126]:
t_2010
Out[126]:
  state/region ages year population area (sq. mi)
state          
Alabama AL total 2010 4785570.0 52423.0
Alaska AK total 2010 713868.0 656425.0
Arizona AZ total 2010 6408790.0 114006.0
Arkansas AR total 2010 2922280.0 53182.0
California CA total 2010 37333601.0 163707.0
Colorado CO total 2010 5048196.0 104100.0
Connecticut CT total 2010 3579210.0 5544.0
Delaware DE total 2010 899711.0 1954.0
District of Columbia DC total 2010 605125.0 68.0
Florida FL total 2010 18846054.0 65758.0
Georgia GA total 2010 9713248.0 59441.0
Hawaii HI total 2010 1363731.0 10932.0
Idaho ID total 2010 1570718.0 83574.0
Illinois IL total 2010 12839695.0 57918.0
Indiana IN total 2010 6489965.0 36420.0
Iowa IA total 2010 3050314.0 56276.0
Kansas KS total 2010 2858910.0 82282.0
Kentucky KY total 2010 4347698.0 40411.0
Louisiana LA total 2010 4545392.0 51843.0
Maine ME total 2010 1327366.0 35387.0
Maryland MD total 2010 5787193.0 12407.0
Massachusetts MA total 2010 6563263.0 10555.0
Michigan MI total 2010 9876149.0 96810.0
Minnesota MN total 2010 5310337.0 86943.0
Mississippi MS total 2010 2970047.0 48434.0
Missouri MO total 2010 5996063.0 69709.0
Montana MT total 2010 990527.0 147046.0
Nebraska NE total 2010 1829838.0 77358.0
Nevada NV total 2010 2703230.0 110567.0
New Hampshire NH total 2010 1316614.0 9351.0
New Jersey NJ total 2010 8802707.0 8722.0
New Mexico NM total 2010 2064982.0 121593.0
New York NY total 2010 19398228.0 54475.0
North Carolina NC total 2010 9559533.0 53821.0
North Dakota ND total 2010 674344.0 70704.0
Ohio OH total 2010 11545435.0 44828.0
Oklahoma OK total 2010 3759263.0 69903.0
Oregon OR total 2010 3837208.0 98386.0
Pennsylvania PA total 2010 12710472.0 46058.0
Rhode Island RI total 2010 1052669.0 1545.0
South Carolina SC total 2010 4636361.0 32007.0
South Dakota SD total 2010 816211.0 77121.0
Tennessee TN total 2010 6356683.0 42146.0
Texas TX total 2010 25245178.0 268601.0
Utah UT total 2010 2774424.0 84904.0
Vermont VT total 2010 625793.0 9615.0
Virginia VA total 2010 8024417.0 42769.0
Washington WA total 2010 6742256.0 71303.0
West Virginia WV total 2010 1854146.0 24231.0
Wisconsin WI total 2010 5689060.0 65503.0
Wyoming WY total 2010 564222.0 97818.0
Puerto Rico PR total 2010 3721208.0 3515.0
 

計算人口密度。注意是Series/Series,其結果還是一個Series。

In [127]:
pop_density = t_2010['population']/t_2010["area (sq. mi)"] pop_density 
Out[127]:
state
Alabama                   91.287603
Alaska                     1.087509
Arizona                   56.214497
Arkansas                  54.948667
California               228.051342
Colorado                  48.493718
Connecticut              645.600649
Delaware                 460.445752
District of Columbia    8898.897059
Florida                  286.597129
Georgia                  163.409902
Hawaii                   124.746707
Idaho                     18.794338
Illinois                 221.687472
Indiana                  178.197831
Iowa                      54.202751
Kansas                    34.745266
Kentucky                 107.586994
Louisiana                 87.676099
Maine                     37.509990
Maryland                 466.445797
Massachusetts            621.815538
Michigan                 102.015794
Minnesota                 61.078373
Mississippi               61.321530
Missouri                  86.015622
Montana                    6.736171
Nebraska                  23.654153
Nevada                    24.448796
New Hampshire            140.799273
New Jersey              1009.253268
New Mexico                16.982737
New York                 356.094135
North Carolina           177.617157
North Dakota               9.537565
Ohio                     257.549634
Oklahoma                  53.778278
Oregon                    39.001565
Pennsylvania             275.966651
Rhode Island             681.339159
South Carolina           144.854594
South Dakota              10.583512
Tennessee                150.825298
Texas                     93.987655
Utah                      32.677188
Vermont                   65.085075
Virginia                 187.622273
Washington                94.557817
West Virginia             76.519582
Wisconsin                 86.851900
Wyoming                    5.768079
Puerto Rico             1058.665149
dtype: float64
 

排序,並找出人口密度最高的五個州sort_values()

In [128]:
type(pop_density) 
Out[128]:
pandas.core.series.Series
In [130]:
pop_density.sort_values(inplace=True) 
 

找出人口密度最低的五個州

In [131]:
pop_density[:5] 
Out[131]:
state
Alaska           1.087509
Wyoming          5.768079
Montana          6.736171
North Dakota     9.537565
South Dakota    10.583512
dtype: float64
In [132]:
pop_density.tail() 
Out[132]:
state
Connecticut              645.600649
Rhode Island             681.339159
New Jersey              1009.253268
Puerto Rico             1058.665149
District of Columbia    8898.897059
dtype: float64
 

要點總結:

  • 統一用loc()索引
  • 善於使用.isnull().any()找到存在NaN的列
  • 善於使用.unique()確定該列中哪些key是我們需要的
  • 一般使用外合並、左合並,目的只有一個:寧願該列是NaN也不要丟棄其他列的信息
 

回顧:Series/DataFrame運算與ndarray運算的區別

 
  • Series與DataFrame沒有廣播,如果對應index沒有值,則記為NaN;或者使用add的fill_value來補缺失值
  • ndarray有廣播,通過重復已有值來計算


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM