1.Pandas概述
- Pandas是Python的一個數據分析包,該工具為解決數據分析任務而創建。
- Pandas納入大量庫和標准數據模型,提供高效的操作數據集所需的工具。
- Pandas提供大量能使我們快速便捷地處理數據的函數和方法。
- Pandas是字典形式,基於NumPy創建,讓NumPy為中心的應用變得更加簡單。
2.Pandas安裝
3.Pandas引入
4.Pandas數據結構
4.1Series
import numpy as np
import pandas as pd
s=pd.Series([1,2,3,np.nan,5,6])
print(s)
----------執行以上程序,返回的結果為----------
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
dtype: float64
4.2DataFrame
DataFrame是表格型數據結構,包含一組有序的列,每列可以是不同的值類型。DataFrame有行索引和列索引,可以看成由Series組成的字典。
import numpy as np
import pandas as pd
dates=pd.date_range('2019-08-01',periods=6)
pd=pd.DataFrame(np.random.randn(6,4),index=dates,columns=['A','B','C','D'])
print('輸出6行4列的表格:')
print(pd)
print('\n')
print('輸出第二列:')
print(pd['B'])
print('\n')
----------執行以上程序,返回的結果為----------
輸出6行4列的表格:
A B C D
2019-08-01 0.796050 -0.383286 -1.465294 -0.272321
2019-08-02 -1.431981 -0.875381 1.371449 0.321703
2019-08-03 -1.497636 1.258925 -1.374210 -0.765626
2019-08-04 2.518305 0.125094 2.647512 -0.024748
2019-08-05 -0.319238 0.395384 -0.582052 -0.396132
2019-08-06 -0.519434 1.873216 1.685524 -1.493000
輸出第二列:
2019-08-01 -0.383286
2019-08-02 -0.875381
2019-08-03 1.258925
2019-08-04 0.125094
2019-08-05 0.395384
2019-08-06 1.873216
Freq: D, Name: B, dtype: float64
-------------------------------------------
import numpy as np
import pandas as pd
from
datetime
import
datetime as dt
print('通過字典創建DataFrame:')
df_1=pd.DataFrame({'A':1.0,
'B':pd.Timestamp(2019,8,19),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(['test','train','test','train']),
'F':'foo'})
print(df_1)
print('\n')
print('返回每列的數據類型:')
print(df_1.dtypes)
print('\n')
print('返回行的序號:')
print(df_1.index)
print('\n')
print('返回列的序號名字:')
print(df_1.columns)
print('\n')
print('把每個值進行打印出來:')
print(df_1.values)
print('\n')
print('數字總結:')
print(df_1.describe())
print('\n')
print('翻轉數據:')
print(df_1.T)
print('\n')
print('按第一列進行排序:')
print('\n')
print('按某列的值進行排序:')
print(df_1.sort_values('E'))
print('\n')
----------執行以上程序,返回的結果為----------
通過字典創建DataFrame:
A B C D E F
0 1.0 2019-08-19 1.0 3 test foo
1 1.0 2019-08-19 1.0 3 train foo
2 1.0 2019-08-19 1.0 3 test foo
3 1.0 2019-08-19 1.0 3 train foo
返回每列的數據類型:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
返回行的序號:
Int64Index([0, 1, 2, 3], dtype='int64')
返回列的序號名字:
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
把每個值進行打印出來:
[[1.0 Timestamp('2019-08-19 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2019-08-19 00:00:00') 1.0 3 'train' 'foo']
[1.0 Timestamp('2019-08-19 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2019-08-19 00:00:00') 1.0 3 'train' 'foo']]
數字總結:
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
翻轉數據:
0 1 2 3
A 1 1 1 1
B 2019-08-19 00:00:00 2019-08-19 00:00:00 2019-08-19 00:00:00 2019-08-19 00:00:00
C 1 1 1 1
D 3 3 3 3
E test train test train
F foo foo foo foo
按第一列進行排序:
F E D C B A
0 foo test 3 1.0 2019-08-19 1.0
1 foo train 3 1.0 2019-08-19 1.0
2 foo test 3 1.0 2019-08-19 1.0
3 foo train 3 1.0 2019-08-19 1.0
按某列的值進行排序:
A B C D E F
0 1.0 2019-08-19 1.0 3 test foo
2 1.0 2019-08-19 1.0 3 test foo
1 1.0 2019-08-19 1.0 3 train foo
3 1.0 2019-08-19 1.0 3 train foo
5.Pandas選擇數據
import numpy as np
import pandas as pd
dates=pd.date_range('2019-08-01',periods=6)
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=['A','B','C','D'])
print('輸出6行4列的數據:')
print(df)
print('打印B列數據:')
print(df['B'])
----------執行以上程序,返回的結果為----------
輸出6行4列的數據:
A B C D
2019-08-01 -0.856790 -1.968381 -0.590032 -0.511943
2019-08-02 0.032420 0.750065 -1.168060 -1.571403
2019-08-03 0.962793 -2.377613 1.447871 -1.515988
2019-08-04 1.078565 1.780728 -0.060782 1.393749
2019-08-05 -1.785669 1.161425 0.440988 1.233997
2019-08-06 -0.740927 -0.877388 -0.868203 1.395331
打印B列數據:
2019-08-01 -1.968381
2019-08-02 0.750065
2019-08-03 -2.377613
2019-08-04 1.780728
2019-08-05 1.161425
2019-08-06 -0.877388
Freq: D, Name: B, dtype: float64
切片選擇
print('切片選擇:')
print(df[0:3],df['20190801':'20190804'])
----------執行以上程序,返回的結果為----------
切片選擇:
A B C D
2019-08-01 -0.456445 -1.641900 0.878254 -0.265412
2019-08-02 0.223910 -1.524222 0.428250 0.410542
2019-08-03 -1.248945 0.649155 -1.039407 0.138473
A B C D
2019-08-01 -0.456445 -1.641900 0.878254 -0.265412
2019-08-02 0.223910 -1.524222 0.428250 0.410542
2019-08-03 -1.248945 0.649155 -1.039407 0.138473
2019-08-04 -1.135849 1.404054 -0.771489 -0.685064
根據標簽loc-行標簽進行選擇數據
print('根據行標簽進行選擇數據:')
print(df.loc['2019-08-01',['A','B']])
----------執行以上程序,返回的結果為----------
根據行標簽進行選擇數據:
A -0.495304
B -0.083505
Name: 2019-08-01 00:00:00, dtype: float64
根據序列iloc-行號進行選擇數據
import numpy as np
import pandas as pd
print('輸出第三行第一列的數據:')
print(df.iloc[3,1])
print('\n')
print('進行切片選擇:')
print(df.iloc[3:5,0:2])
print('\n')
print('進行不連續篩選:')
print(df.iloc[[1,2,4],[0,2]])
----------執行以上程序,返回的結果為----------
輸出第三行第一列的數據:
1.2355112660049548
進行切片選擇:
A B
2019-08-04 -0.943150 1.235511
2019-08-05 -0.245097 -1.272304
進行不連續篩選:
A C
2019-08-02 -0.212743 -0.584698
2019-08-03 0.012863 -0.896789
2019-08-05 -0.245097 2.646507
根據混合的兩種ix
import numpy as np
import pandas as pd
print(df.ix(:3,[A,C]))
----------執行以上程序,返回的結果為----------
A C
2019-08-01 1.591064 1.272731
2019-08-02 1.820216 0.657560
2019-08-03 0.358265 -1.197687
根據判斷篩選
import numpy as np
import pandas as pd
print('根據判斷篩選:')
print(df[df.A>0])
----------執行以上程序,返回的結果為----------
根據判斷篩選:
A B C D
2019-08-01 1.098786 0.261861 1.430775 -1.161001
2019-08-05 0.527853 -0.612058 -0.906565 1.279515
6.Pandas設置數據
根據loc和iloc設置
import numpy as np
import pandas as pd
dates=pd.date_range('2019-08-01',periods=6)
df=pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
print('輸出6行4列的數據:')
print(df)
print('\n')
print('更改后的數據:')
df.iloc[2,2]=999
df.loc['2019-08-01','D']=999
print(df)
print('\n')
----------執行以上程序,返回的結果為----------
輸出6行4列的數據:
A B C D
2019-08-01 0 1 2 3
2019-08-02 4 5 6 7
2019-08-03 8 9 10 11
2019-08-04 12 13 14 15
2019-08-05 16 17 18 19
2019-08-06 20 21 22 23
更改后的數據:
A B C D
2019-08-01 0 1 2 999
2019-08-02 4 5 6 7
2019-08-03 8 9 999 11
2019-08-04 12 13 14 15
2019-08-05 16 17 18 19
2019-08-06 20 21 22 23
根據條件設置
import numpy as np
import pandas as pd
print('根據條件設置:')
df[df.A>0]=999
print(df)
----------執行以上程序,返回的結果為----------
根據條件設置:
A B C D
2019-08-01 0 1 2 999
2019-08-02 999 999 999 999
2019-08-03 999 999 999 999
2019-08-04 999 999 999 999
2019-08-05 999 999 999 999
2019-08-06 999 999 999 999
根據行或列設置
import numpy as np
import pandas as pd
print('根據行或列設置:')
df['C']=np.nan
print(df)
----------執行以上程序,返回的結果為----------
根據行或列設置:
A B C D
2019-08-01 0 1 NaN 999
2019-08-02 999 999 NaN 999
2019-08-03 999 999 NaN 999
2019-08-04 999 999 NaN 999
2019-08-05 999 999 NaN 999
2019-08-06 999 999 NaN 999
添加數據
import numpy as np
import pandas as pd
print('添加數據:')
df['E']=pd.Series([1,2,3,4,5,6],index=pd.date_range('2019-08-03',periods=6))
print(df)
----------執行以上程序,返回的結果為----------
添加數據:
A B C D E
2019-08-01 0 1 NaN 999 NaN
2019-08-02 999 999 NaN 999 NaN
2019-08-03 999 999 NaN 999 1.0
2019-08-04 999 999 NaN 999 2.0
2019-08-05 999 999 NaN 999 3.0
2019-08-06 999 999 NaN 999 4.0
7.Pandas處理丟失數據
處理數據中NaN數據
import numpy as np
import pandas as pd
dates=pd.date_range('2019-08-01',periods=6)
df=pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
df.iloc[0,1]=np.nan
df.iloc[1,2]=np.nan
print('輸出6行4列的數據:')
print(df)
print('\n')
print('使用dropna()函數去掉NaN的行或列:')
print(df.dropna(0,how='any'))
----------執行以上程序,返回的結果為----------
輸出6行4列的數據:
A B C D
2019-08-01 0 NaN 2.0 3
2019-08-02 4 5.0 NaN 7
2019-08-03 8 9.0 10.0 11
2019-08-04 12 13.0 14.0 15
2019-08-05 16 17.0 18.0 19
2019-08-06 20 21.0 22.0 23
使用dropna()函數去掉NaN的行或列:
A B C D
2019-08-03 8 9.0 10.0 11
2019-08-04 12 13.0 14.0 15
2019-08-05 16 17.0 18.0 19
2019-08-06 20 21.0 22.0 23
使用fillna()函數替換NaN值:
A B C D
2019-08-01 0 0.0 2.0 3
2019-08-02 4 5.0 0.0 7
2019-08-03 8 9.0 10.0 11
2019-08-04 12 13.0 14.0 15
2019-08-05 16 17.0 18.0 19
2019-08-06 20 21.0 22.0 23
使用isnull()函數判斷數據是否丟失:
A B C D
2019-08-01 False True False False
2019-08-02 False False True False
2019-08-03 False False False False
2019-08-04 False False False False
2019-08-05 False False False False
2019-08-06 False False False False
8.Pandas導入導出
pandas可以讀取與存取像csv、excel、json、html、pickle等格式的資料,詳細說明請看官方資料
import numpy as np
import pandas as pd
print('讀取csv文件:')
data=pd.read_csv('test2.csv')
print(data)
print('將資料存儲成pickle文件:')
print(data.to_pickle('test3.pickle'))
----------執行以上程序,返回的結果為----------
讀取csv文件:
A B C D
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
將資料存儲成pickle文件:
None
9.Pandas合並數據
axis合並方向
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2=pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
df3=pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])
res=pd.concat([df1,df2,df3],axis=0,ignore_index=True)
print(res)
----------執行以上程序,返回的結果為----------
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
join合並方式
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.ones((3,4))*0,columns=['A','B','C','D'],index=[1,2,3])
df2=pd.DataFrame(np.ones((3,4))*1,columns=['B','C','D','E'],index=[2,3,4])
print('第一個數據為:')
print(df1)
print('\n')
print('第二個數據為:')
print(df2)
print('\n')
print('join行往外合並:相當於全連接')
res=pd.concat([df1,df2],axis=1,join='outer')
第一個數據為:
A B C D
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
第二個數據為:
B C D E
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
join行往外合並:相當於全連接
A B C D B C D E
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
join行相同的進行合並:相當於內連接
A B C D B C D E
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
以df1的序列進行合並:相當於左連接
A B C D B C D E
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
append添加數據
import numpy as np
import pandas as pd
df1=pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2=pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
df3=pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])
s1=pd.Series([1,2,3,4],index=['a','b','c','d'])
print('
')
res=df1.append(df2,ignore_index=True)
print(res)
print('\n')
print('將s1合並到df1的下面,並重置index')
res2=df1.append(s1,ignore_index=True)
print(res2)
----------執行以上程序,返回的結果為----------
將df2合並到df1的下面 並重置index
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
將s1合並到df1的下面,並重置index
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 2.0 3.0 4.0
10.Pandas合並merge
依據一組key合並
import pandas as pd
left=pd.DataFrame({'key':['k0','k1','k2','k3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
print('第一個數據為:')
print(left)
print('\n')
right=pd.DataFrame({'key':['k0','k1','k2','k3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print('第二個數據為:')
print(right)
print('\n')
print('依據key進行merge:')
res=pd.merge(left,right,on='key')
print(res)
----------執行以上程序,返回的結果為----------
第一個數據為:
key A B
0 k0 A0 B0
1 k1 A1 B1
2 k2 A2 B2
3 k3 A3 B3
第二個數據為:
key C D
0 k0 C0 D0
1 k1 C1 D1
2 k2 C2 D2
3 k3 C3 D3
依據key進行merge:
key A B C D
0 k0 A0 B0 C0 D0
1 k1 A1 B1 C1 D1
2 k2 A2 B2 C2 D2
3 k3 A3 B3 C3 D3
依據兩組key合並
import pandas as pd
left=pd.DataFrame({'key1':['k0','k1','k2','k3'],
'key2':['k0','k1','k0','k1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
print('第一個數據為:')
print(left)
print('\n')
right=pd.DataFrame({
'key1':['k0','k1','k2','k3'],
'key2':['k0','k0','k0','k0'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print('第二個數據為:')
print(right)
print('\n')
print('內聯合並')
res=pd.merge(left,right,on=['key1','key2'],how='inner')
print(res)
print('\n')
print('外聯合並')
res2=pd.merge(left,right,on=['key1','key2'],how='outer')
print(res2)
print('\n')
print('左聯合並')
res3=pd.merge(left,right,on=['key1','key2'],how='left')
print(res3)
print('\n')
print('右聯合並')
res4=pd.merge(left,right,on=['key1','key2'],how='right')
print(res4)
----------執行以上程序,返回的結果為----------
第一個數據為:
key1 key2 A B
0 k0 k0 A0 B0
1 k1 k1 A1 B1
2 k2 k0 A2 B2
3 k3 k1 A3 B3
第二個數據為:
key1 key2 C D
0 k0 k0 C0 D0
1 k1 k0 C1 D1
2 k2 k0 C2 D2
3 k3 k0 C3 D3
內聯合並
key1 key2 A B C D
0 k0 k0 A0 B0 C0 D0
1 k2 k0 A2 B2 C2 D2
外聯合並
key1 key2 A B C D
0 k0 k0 A0 B0 C0 D0
1 k1 k1 A1 B1 NaN NaN
2 k2 k0 A2 B2 C2 D2
3 k3 k1 A3 B3 NaN NaN
4 k1 k0 NaN NaN C1 D1
5 k3 k0 NaN NaN C3 D3
左聯合並
key1 key2 A B C D
0 k0 k0 A0 B0 C0 D0
1 k1 k1 A1 B1 NaN NaN
2 k2 k0 A2 B2 C2 D2
3 k3 k1 A3 B3 NaN NaN
右聯合並
key1 key2 A B C D
0 k0 k0 A0 B0 C0 D0
1 k2 k0 A2 B2 C2 D2
2 k1 k0 NaN NaN C1 D1
3 k3 k0 NaN NaN C3 D3
Indicator合並
import pandas as pd
df1=pd.DataFrame({'col1':[0,1],'col_left':['a','b']})
df2=pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
print('第一個數據為:')
print(df1)
print('\n')
print('第二個數據為:')
print(df2)
print('\n')
print('')
res=pd.merge(df1,df2,on='col1',how='outer',indicator=True)
print(res)
print('\n')
----------執行以上程序,返回的結果為----------
第一個數據為:
col1 col_left
0 0 a
1 1 b
第二個數據為:
col1 col_right
0 1 2
1 2 2
2 2 2
依據col1進行合並 並啟用indicator=True輸出每項合並方式:
col1 col_left col_right _merge
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
依據index合並
import numpy as np
import pandas as pd
left=pd.DataFrame({'A':['A0','A1','A2'],
'B':['B0','B1','B2']},
index=['k0','k1','k2'])
right=pd.DataFrame({'C':['C0','C1','C2'],
'D':['D0','D1','D2']},
index=['k0','k2','k3']
)
print('第一個數據為:')
print(left)
print('\n')
print('第二個數據為:')
print(right)
print('\n')
print('
')
res=pd.merge(left,right,left_index=True,right_index=True,how='outer')
print(res)
print('\n')
print('
res2=pd.merge(left,right,left_index=True,right_index=True,how='inner')
print(res2)
print('\n')
----------執行以上程序,返回的結果為----------
第一個數據為:
A B
k0 A0 B0
k1 A1 B1
k2 A2 B2
第二個數據為:
C D
k0 C0 D0
k2 C1 D1
k3 C2 D2
根據index索引進行合並 並選擇外聯合並
A B C D
k0 A0 B0 C0 D0
k1 A1 B1 NaN NaN
k2 A2 B2 C1 D1
k3 NaN NaN C2 D2
根據index索引進行合並 並選擇內聯合並
A B C D
k0 A0 B0 C0 D0
k2 A2 B2 C1 D1