pandas的認識 :一個python的數據分析庫
安裝方式:pip install pandas
pandas 是基於NumPy 的一種工具,該工具是為了解決數據分析任務而創建的。Pandas 納入了大量庫和一些標准的數據模型,提供了高效地操作大型數據集所需的工具。pandas提供了大量能使我們快速便捷地處理數據的函數和方法。你很快就會發現,它是使Python成為強大而高效的數據分析環境的重要因素之一.
- 一個快速、高效的DataFrame對象,用於數據操作和綜合索引;
- 用於在內存數據結構和不同格式之間讀寫數據的工具:CSV和文本文件、Microsoft Excel、SQL數據庫和快速HDF 5格式;
- 智能數據對齊和丟失數據的綜合處理:在計算中獲得基於標簽的自動對齊,並輕松地將凌亂的數據操作為有序的形式;
- 數據集的靈活調整和旋轉;
- 基於智能標簽的切片、花式索引和大型數據集的子集;
- 可以從數據結構中插入和刪除列,以實現大小可變;
- 通過在強大的引擎中聚合或轉換數據,允許對數據集進行拆分應用組合操作;
- 數據集的高性能合並和連接;
- 層次軸索引提供了在低維數據結構中處理高維數據的直觀方法;
- 時間序列-功能:日期范圍生成和頻率轉換、移動窗口統計、移動窗口線性回歸、日期轉換和滯后。甚至在不丟失數據的情況下創建特定領域的時間偏移和加入時間序列;
- 對性能進行了高度優化,用Cython或C編寫了關鍵代碼路徑。
- Python與Pandas在廣泛的學術和商業領域中使用,包括金融,神經科學,經濟學,統計學,廣告,網絡分析,等等。
pandas中文網 https://www.pypandas.cn
數據結構Series和Dataframe
一維數組Series
Series
是一維標記的數組,能夠保存任何數據類型(整數,字符串,浮點數,Python對象等)。軸標簽統稱為索引。
ar = np.random.rand(5) s = pd.Series(ar) print(s) print(s.index) #index查看series的值 print(s.values) #values查看series的values --------------------------------------------------- 0 0.119383 1 0.247409 2 0.248272 3 0.410680 4 0.439547 dtype: float64 RangeIndex(start=0, stop=5, step=1) [0.11938319 0.24740862 0.24827207 0.41068032 0.43954667]
創建series的三種方法:
字典創建
dit = {'a':1,'b':2,'c':3,'f':6} s = pd.Series(dit) print(s) --------------------------------- a 1 b 2 c 3 f 6 dtype: int64
數組創建
ar = np.random.rand(5)*100 s = pd.Series(ar,index=list('abcde'),dtype=np.str) print(s) -------------------------------------------- a 31.644744342854725 b 6.783679074968873 c 6.753556225037693 d 43.71090526035562 e 65.35205915903558 dtype: object
通過標量創建
s = pd.Series(100,index=range(10)) print(s) ------------------------------- 0 100 1 100 2 100 3 100 4 100 5 100 6 100 7 100 8 100 9 100 dtype: int64
name屬性
ar = np.random.rand(2) s = pd.Series(ar) s1 = pd.Series(ar,name='test') print(s,type(s)) print(s1,type(s1)) s2 =s1.rename("abcd") print(s2,type(s2)) print(s1,type(s1)) --------------------------- 0 0.820561 1 0.330791 dtype: float64 <class 'pandas.core.series.Series'> 0 0.820561 1 0.330791 Name: test, dtype: float64 <class 'pandas.core.series.Series'> 0 0.820561 1 0.330791 Name: abcd, dtype: float64 <class 'pandas.core.series.Series'> 0 0.820561 1 0.330791 Name: test, dtype: float64 <class 'pandas.core.series.Series'>
索引
ar = np.random.rand(5) s = pd.Series(ar) print(s[0]) #下標索引 print(s[2]) #下標索引 s1 = pd.Series(ar,index=list('abcde')) print(s1) print(s1['a']) #標簽索引 print(s1[0:3],s1[4]) #切片索引 #布爾型索引 ar = np.random.rand(5)*100 s2 = pd.Series(ar) s2[6]=None print(s2) bs1 = s2>50 bs2 = s2.isnull() bs3 = s2.notnull() print(bs1) print(bs2) print(bs3) print(s2[s2>50]) print(s2[bs3]) ------------------------ 0.61815875542277 0.019856009429792598 a 0.618159 b 0.823132 c 0.019856 d 0.737151 e 0.840799 dtype: float64 0.61815875542277 a 0.618159 b 0.823132 c 0.019856 dtype: float64 0.8407993638916321 0 9.29894 1 84.7848 2 24.4915 3 59.9761 4 91.5569 6 None dtype: object 0 False 1 True 2 False 3 True 4 True 6 False dtype: bool 0 False 1 False 2 False 3 False 4 False 6 True dtype: bool 0 True 1 True 2 True 3 True 4 True 6 False dtype: bool 1 84.7848 3 59.9761 4 91.5569 dtype: object 0 9.29894 1 84.7848 2 24.4915 3 59.9761 4 91.5569 dtype: object
其他屬性
ar = np.random.randint(100,size=10) s = pd.Series(ar,index=list('abcdefgjkl')) print(s) print(s.head()) #查看前5個 print(s.tail()) #查看后5個 s['a','e','f']=100 #修改 s.drop('b',inplace=True) #刪除 s['o'] = 500 #添加 print("++++",s) #重新索引 s1 = pd.Series(np.random.rand(5),index=list('abcde')) s2 = s1.reindex(['b','c','d','e','f']) print(s2) #對齊 d = pd.Series(np.random.rand(3),index=['Tom','Marry','Jam']) d2 = pd.Series(np.random.rand(3),index=['Tom','Lucy','Jam']) print(d) print(d2) print(d2+d) -------------------------------- a 75 b 45 c 86 d 0 e 29 f 8 g 41 j 51 k 30 l 58 dtype: int32 a 75 b 45 c 86 d 0 e 29 dtype: int32 f 8 g 41 j 51 k 30 l 58 dtype: int32 ++++ a 100 c 86 d 0 e 100 f 100 g 41 j 51 k 30 l 58 o 500 dtype: int64 b 0.962842 c 0.061086 d 0.135772 e 0.845562 f NaN dtype: float64 Tom 0.828716 Marry 0.383809 Jam 0.600144 dtype: float64 Tom 0.048050 Lucy 0.379492 Jam 0.072854 dtype: float64 Jam 0.672999 Lucy NaN Marry NaN Tom 0.876766 dtype: float64
二維數組Dataframe
DataFrame是一個二維標記數據結構,具有可能不同類型的列。您可以將其視為電子表格或SQL表,或Series對象的字典。它通常是最常用的pandas對象。與Series一樣,DataFrame接受許多不同類型的輸入:
- 1D ndarray,list,dicts或Series的Dict
- 二維numpy.ndarray
- 結構化或記錄 ndarray
- 一個
Series
- 另一個
DataFrame
除了數據,您還可以選擇傳遞索引(行標簽)和 列(列標簽)參數。如果傳遞索引和/或列,則可以保證生成的DataFrame的索引和/或列。因此, Series 的字典加上特定索引將丟棄與傳遞的索引不匹配的所有數據。
創建DataFrame的5中方式:
由list和數組創建
#由list和數組創建 data = { 'name':['Jack','Mary','Tom'], 'age':[14,15,17], 'gender':['M','W','M'] } fr = pd.DataFrame(data) print(fr) print(type(fr)) print(fr.index,'數據類型是:',type(fr.index)) #行標簽 print(fr.values,'數據類型是:',type(fr.values)) #值 print(fr.columns,'數據類型是:',type(fr.columns)) #列標簽 ----------------------------------------------------------------- name age gender 0 Jack 14 M 1 Mary 15 W 2 Tom 17 M <class 'pandas.core.frame.DataFrame'> RangeIndex(start=0, stop=3, step=1) 數據類型是: <class 'pandas.core.indexes.range.RangeIndex'> [['Jack' 14 'M'] ['Mary' 15 'W'] ['Tom' 17 'M']] 數據類型是: <class 'numpy.ndarray'> Index(['name', 'age', 'gender'], dtype='object') 數據類型是: <class 'pandas.core.indexes.base.Index'>
由Series組成的創建
#由Series組成的創建 data1 = {'one':pd.Series(np.random.rand(2)), 'two':pd.Series(np.random.rand(3)), } print(data1) data2 = {'one':pd.Series(np.random.rand(2),index=['a','b']), 'two':pd.Series(np.random.rand(3),index=['a','b','c']), } print(data2) fr1 = pd.DataFrame(data1) fr2 = pd.DataFrame(data2) print(fr1) print(fr2) ----------------------------- {'one': 0 0.432652 1 0.552177 dtype: float64, 'two': 0 0.946339 1 0.326405 2 0.352883 dtype: float64} {'one': a 0.353147 b 0.176789 dtype: float64, 'two': a 0.121450 b 0.371344 c 0.240906 dtype: float64} one two 0 0.432652 0.946339 1 0.552177 0.326405 2 NaN 0.352883 one two a 0.353147 0.121450 b 0.176789 0.371344 c NaN 0.240906
通過二維數組創建 (常用)
#通過二維數組創建 (常用) ar = np.random.rand(9).reshape(3,3) print(ar) fr3 = pd.DataFrame(ar) fr4 = pd.DataFrame(ar,index=['a','b','c'],columns=['s','h','j']) print(fr3) print(fr4) ------------------------ [[0.80857571 0.31437002 0.00130739] [0.24521627 0.04577992 0.19544072] [0.23923237 0.26033495 0.17534313]] 0 1 2 0 0.808576 0.314370 0.001307 1 0.245216 0.045780 0.195441 2 0.239232 0.260335 0.175343 s h j a 0.808576 0.314370 0.001307 b 0.245216 0.045780 0.195441 c 0.239232 0.260335 0.175343
字典組成的列表
data3 = [{'one':1,'two':2,'three':3},{'four':4,'five':5,'six':6}] fr5 = pd.DataFrame(data3) print(fr5) --------------------- one two three four five six 0 1.0 2.0 3.0 NaN NaN NaN 1 NaN NaN NaN 4.0 5.0 6.0
字典組成的字典
data4 = { 'Tom':{'art':67,'english':98,'china':76}, 'Mary':{'art':45,'english':78,'china':70}, 'Lucy':{'art':58,'english':79}, } fr6 = pd.DataFrame(data4) print(fr6) --------------------------- Tom Mary Lucy art 67 45 58.0 english 98 78 79.0 china 76 70 NaN
索引
import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100, index=['one','two','three'],columns=['a','b','c','d'] ) df2 = pd.DataFrame(np.random.rand(12).reshape(3,4)*100, columns=['a','b','c','d'] ) print(df) print(df2) data = df['a'] data1 = df[['a','b']] print('data',data) print('data1',data1) #選擇列 data3 = df.loc['one'] #按index選擇行 data4 = df.loc[['three','one']] print('data3',data3) print('data4',data4) #選擇行 data5 = df.iloc[-1] #按整數位置選擇行 print('data5',data5) print("單標簽索引/n") print(df.loc['one']) print(df2.loc[1]) print("多標簽索引/n") print(df.loc[['two','one']]) print(df2.loc[[2,1]]) # print("切片索引/n") print(df.loc['one':'two']) print(df2.loc[1:2]) --------------------------- a b c d one 79.285201 73.277718 12.225063 18.830074 two 2.400540 49.604940 80.337070 47.133134 three 17.399693 92.839253 90.041425 75.505320 a b c d 0 47.065633 21.284022 30.118641 85.652279 1 12.201863 48.841603 23.367143 32.276774 2 77.422617 55.812583 56.130735 64.983035 data one 79.285201 two 2.400540 three 17.399693 Name: a, dtype: float64 data1 a b one 79.285201 73.277718 two 2.400540 49.604940 three 17.399693 92.839253 data3 a 79.285201 b 73.277718 c 12.225063 d 18.830074 Name: one, dtype: float64 data4 a b c d three 17.399693 92.839253 90.041425 75.505320 one 79.285201 73.277718 12.225063 18.830074 data5 a 17.399693 b 92.839253 c 90.041425 d 75.505320 Name: three, dtype: float64 單標簽索引/n a 79.285201 b 73.277718 c 12.225063 d 18.830074 Name: one, dtype: float64 a 12.201863 b 48.841603 c 23.367143 d 32.276774 Name: 1, dtype: float64 多標簽索引/n a b c d two 2.400540 49.604940 80.337070 47.133134 one 79.285201 73.277718 12.225063 18.830074 a b c d 2 77.422617 55.812583 56.130735 64.983035 1 12.201863 48.841603 23.367143 32.276774 切片索引/n a b c d one 79.285201 73.277718 12.225063 18.830074 two 2.400540 49.604940 80.337070 47.133134 a b c d 1 12.201863 48.841603 23.367143 32.276774 2 77.422617 55.812583 56.130735 64.983035
布爾值索引
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100, index=['one','two','three'],columns=['a','b','c','d'] ) b1 = df<20 print(b1,type(b1)) print(df[b1]) #不做索引對每一個值進行判斷 b2 = df["a"]<20 print(b2,type(b2)) print(df[b2]) #單行判斷 b3 = df[["a",'b']]<20 print(b3,type(b3)) print(df[b3]) #多行判斷 b4 = df.loc[["one",'two']]<50 print(b4,type(b4)) print(df[b4]) #多行判斷 ------------------------------------- a b c d one True True False False two False False False False three True False False False <class 'pandas.core.frame.DataFrame'> a b c d one 12.319044 16.517952 NaN NaN two NaN NaN NaN NaN three 8.939486 NaN NaN NaN one True two False three True Name: a, dtype: bool <class 'pandas.core.series.Series'> a b c d one 12.319044 16.517952 97.270662 76.200591 three 8.939486 38.428862 25.783585 30.355222 a b one True True two False False three True False <class 'pandas.core.frame.DataFrame'> a b c d one 12.319044 16.517952 NaN NaN two NaN NaN NaN NaN three 8.939486 NaN NaN NaN
多重索引
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100, index=['one','two','three'],columns=['a','b','c','d'] ) print(df['a'].loc['one']) print(df['a'].loc[['one','two']]) print(df[['b','c','d']].iloc[1:2]) print(df[df<50][['a','b']]) -------------------------------- 31.995689334678335 one 31.995689 two 6.516284 Name: a, dtype: float64 b c d two 19.048351 31.111981 60.956516 a b one 31.995689 38.992923 two 6.516284 19.048351 three NaN 31.623816
其他屬性
import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(10).reshape(5,2), columns=['a','b'] ) print(df) print(df.T) #轉置 print(df.head(2)) #前2列 print(df.tail(2)) #后2列 df2 = pd.DataFrame(np.random.rand(16).reshape(4,4), columns=['a','b','c','d'] ) print(df2) df2['c']=100 #修改 df2['e']=10 #添加 print(df2) print(df.drop(0,inplace=True)) #刪除行,inplace 刪除后生成新數據,不改變原數據 print(df.drop(['a'],axis=1)) #刪除列,axis=1 刪除后生成新數據,不改變原數據 #對齊 #排序 print(df.sort_values(['a'],ascending=True)) #升序 print(df.sort_values(['a'],ascending=True)) #降序 ----------------------------------------------------------------- a b 0 0.940085 0.181402 1 0.536894 0.488670 2 0.217216 0.854319 3 0.478155 0.066919 4 0.467400 0.194862 0 1 2 3 4 a 0.940085 0.536894 0.217216 0.478155 0.467400 b 0.181402 0.488670 0.854319 0.066919 0.194862 a b 0 0.940085 0.181402 1 0.536894 0.488670 a b 3 0.478155 0.066919 4 0.467400 0.194862 a b c d 0 0.849237 0.284547 0.353720 0.470520 1 0.294418 0.909727 0.375445 0.975046 2 0.588561 0.386173 0.703177 0.341634 3 0.180870 0.831200 0.392450 0.036837 a b c d e 0 0.849237 0.284547 100 0.470520 10 1 0.294418 0.909727 100 0.975046 10 2 0.588561 0.386173 100 0.341634 10 3 0.180870 0.831200 100 0.036837 10 None b 1 0.488670 2 0.854319 3 0.066919 4 0.194862 a b 2 0.217216 0.854319 4 0.467400 0.194862 3 0.478155 0.066919 1 0.536894 0.488670 a b 2 0.217216 0.854319 4 0.467400 0.194862 3 0.478155 0.066919 1 0.536894 0.488670
時間模塊
datetime.datetime()
t1 = datetime.datetime.now() t2 = datetime.datetime(2016,2,5) t3 = datetime.datetime(2016,2,5,12,30,34) print(t1,type(t1)) print(t2,type(t2)) print(t3,type(t3)) --------------------------- 2019-09-17 12:25:54.780962 <class 'datetime.datetime'> 2016-02-05 00:00:00 <class 'datetime.datetime'> 2016-02-05 12:30:34 <class 'datetime.datetime'>
datetime.delta()
t1 = datetime.datetime(2016,4,6) t2 = datetime.timedelta(10,200) #默認(天,秒) print(t1+t2) --------------------------- 2016-04-16 00:03:20
時間格式的轉化
from dateutil.parser import parse date = "2015 2 20" date1 = "2015-3-25" date2 = "2016/3/8" print(parse(date)) print(parse(date1)) print(parse(date2)) ---------------------------- 2015-02-20 00:00:00 2015-03-25 00:00:00 2016-03-08 00:00:00
pd.timeStamp()時間戳
date1 = "2017-05-01 12:25:12" date2 = datetime.datetime(2017,5,6,14,15,23) t1 = pd.Timestamp(date1) #時間戳 t2 = pd.Timestamp(date2) print(t1,type(t1)) print(t2,type(t2)) print(date2,type(date2)) --------------------------------- 2017-05-01 12:25:12 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2017-05-06 14:15:23 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2017-05-06 14:15:23 <class 'datetime.datetime'>
pd.to_datetime()
date1 = "2017-05-01 12:25:12" date2 = datetime.datetime(2017,5,6,14,15,23) t1 = pd.to_datetime(date1) #時間戳 t2 = pd.to_datetime(date2) print(t1,type(t1)) print(t2,type(t2)) #多個時間數據,會轉化成pandas的Datetime的Index ls_date = ['2017-01-01','2017-01-02','2017-01-03'] t3 = pd.to_datetime(ls_date) print(t3,type(t3)) #當一組數據中夾雜着其他的數組 date3 = ['2017-01-01','2017-01-02','2017-01-03','hello','2018-01-05'] t4 = pd.to_datetime(date3,errors='ignore') #返回原始數據,這里直接是生成一組數據 print(t4,type(t4)) t5 = pd.to_datetime(date3,errors='coerce') #缺失值返回Nat,結果是DatetimeIndex print(t5,type(t5)) ------------------------------------------------ 2017-05-01 12:25:12 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2017-05-06 14:15:23 <class 'pandas._libs.tslibs.timestamps.Timestamp'> DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'> Index(['2017-01-01', '2017-01-02', '2017-01-03', 'hello', '2018-01-05'], dtype='object') <class 'pandas.core.indexes.base.Index'> DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', 'NaT', '2018-01-05'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
date_range(start,end,periods,freq)
''' date_range(start,end,periods,freq) start:開始時間 end:結束時間 periods:偏移量 freq:頻率 默認,天 pd.date_range()默認頻率為日歷日 pd.bdate_range()默認頻率為工作日 ''' date = pd.date_range('2014-01-01','2014-02-01') date1 = pd.date_range(start='2014-01-01',periods=10) date2 = pd.date_range(end='2014-01-01',periods=10) date3 = pd.date_range('2014-01-01','2014-01-02',freq="H") print(date) print(date1) print(date2) print(date3) #normalize 時間參數值正則化到午夜時間戳 date4 = pd.date_range('2019-05-01 12:25:00',periods=2,name='hello',normalize=True) print(date4) print(pd.date_range('20190101','20190105')) #默認左右閉合 print(pd.date_range('20190101','20190105',closed='right')) #右開左閉 print(pd.date_range('20190101','20190105',closed='left')) #左開右閉 print(pd.bdate_range('20190101','20190107')) #默認頻率是工作日 #日期范圍頻率 print(pd.date_range('20190101','20190110')) #默認是天 print(pd.date_range('20190101','20190110',freq='B')) #每工作日 print(pd.date_range('20190101','20190110',freq='H')) #每小時 print(pd.date_range('20190101 12:00','20190110 12:20',freq='T')) #每分鍾 # print(pd.date_range('20190101 12:00:00','20190110 12:20:01',freq='S')) #每秒 # print(pd.date_range('20190101 12:00:00','20190110 12:20:01',freq='L')) #每毫秒(千分之一秒) # print(pd.date_range('20190101 12:00:00','20190110 12:20:01',freq='U')) #每微秒(百萬分之一秒) #星期的縮寫:MON-TUE-WED-THU-FRI-SAT-SUN print(pd.date_range('20190101','20190210',freq='W-MON')) #從指定星期幾開始算起,每周一 print(pd.date_range('20190101','20190210',freq='WOM-2MON')) #每月的第幾個星期幾開始算,這里是每月第二個星期一 #月份縮寫:JAN/FEB/MAR/APR/MAY/JUE/JUL/AUG/SEP/OCT/NOV/DEC print(pd.date_range('2018','2020',freq='M'))#每月最后一個日歷日 print(pd.date_range('2018','2020',freq='Q-DEC'))#Q月,指定月為季度末,每個季度末的最后一個月的最后一個日歷日 print(pd.date_range('2018','2020',freq='A-DEC'))#A月,每年指定月份的最后一個日歷日 print(pd.date_range('2018','2020',freq='BM'))#每月最后一個工作日 print(pd.date_range('2018','2020',freq='BQ-DEC'))#Q月,指定月為季度末,每個季度末的最后一個月的最后一個工作日 print(pd.date_range('2018','2020',freq='BA-DEC'))#A月,每年指定月份的最后一個工作日 print(pd.date_range('2018','2020',freq='MS'))#每月第一個日歷日 print(pd.date_range('2018','2020',freq='QS-DEC'))#Q月,指定月為季度末,每個季度末的最后一個月的第一個日歷日 print(pd.date_range('2018','2020',freq='AS-DEC'))#A月,每年指定月份的第一個日歷日 print(pd.date_range('2018','2020',freq='BMS'))#每月第一個工作日 print(pd.date_range('2018','2020',freq='BQS-DEC'))#Q月,指定月為季度末,每個季度末的最后一個月的第一個工作日 print(pd.date_range('2018','2020',freq='BAS-DEC'))#A月,每年指定月份的第一個工作日 #復合頻率 print(pd.date_range('20180701','20180801',freq='7D'))#7天 print(pd.date_range('20180701','20180801',freq='2H30min'))#2小時30分鍾 print(pd.date_range('2018','2019',freq='2MS'))#2月,每月最后一個日歷日
asfreq() 時間頻率轉化
date = pd.Series( np.random.rand(4), index=pd.date_range('20180101','20180104') ) print(date) print(date.asfreq('4H')) #值是NAN print(date.asfreq('4H',method='ffill')) #用前面值填充 print(date.asfreq('4H',method='bfill')) #用后面的值填充
shift()超前、滯后數據
date = pd.Series( np.random.rand(4), index=pd.date_range('20180101','20180104') ) print(date) print(date.shift(2)) #前移2位 print(date.shift(-2)) #后移2位
period()時期
#創建時期 date = pd.Period('2017',freq='M') print(date) date2 = pd.period_range('2017','2018',freq='M') print(date2,type(date2)) #時間戳與日期之間的轉化 t = pd.date_range('20180101',periods=10,freq='M') t2 = pd.period_range('2018','2019',freq='M') ts = pd.Series(np.random.rand(len(t)),index=t) print(ts) print(ts.to_period()) #時間戳轉日期 ts2 = pd.Series(np.random.rand(len(t2)),index=t2) print(ts2) print(ts2.to_timestamp()) #日期轉時間戳
索引與切片
date = pd.DataFrame( np.random.rand(30).reshape(10,3)*100, index = pd.date_range('20170101','20170106',freq='12H',closed='left'), columns=['value1','value2','value3'] ) print(date) print(date[:4]) #前4行 print(date["20170104"].iloc[1]) #取20170104 12:00:00的值 print(date.loc["20170104":'20170105']) #切片
resample()重采樣
date = pd.Series( np.arange(1,13), index=pd.date_range('20170101',periods=12) ) print(date) # ts = date.resample("5D") ts2 = date.resample("5D").sum() #求和 print(ts,type(ts)) print(ts2,type(ts2)) print(date.resample("5D").mean() ) #求平均數 print(date.resample("5D").max() ) #求最大 print(date.resample("5D").min() ) #求最小 print(date.resample("5D").median() ) #求中值 print(date.resample("5D").first() ) #求第一個 print(date.resample("5D").last() ) #求最后一個 print(date.resample("5D").ohlc() ) #金融中的OHLC樣本 #降采樣 print(date.resample("5D",closed='left').sum() ) print(date.resample("5D",closed='right').sum() ) print(date.resample("5D",label='left').sum() ) print(date.resample("5D",label='right').sum() ) #升采樣及插值 date2 = pd.DataFrame( np.arange(15).reshape(5,3), index=pd.date_range('20170101',periods=5), columns=['a','b','c'] ) print(date2) print(date2.resample("12H").asfreq()) print(date2.resample("12H").ffill()) print(date2.resample("12H").bfill())
通用方法
數值計算和統計基礎
import numpy as np import pandas as pd df1 = pd.DataFrame( {'key1':[4,5,6,np.nan,7], 'key2':[1,2,np.nan,9,7], 'key3':[2,4,5,'j','k'] }) print(df1) print(df1.sum()) print(df1.sum(axis=1)) #axis=1 按行計算 默認是0 print(df1.sum(skipna=True)) #skipna 是否忽略NaN,默認是True,由NaN的值計算結果還是NaN df = pd.DataFrame( {'key1':np.arange(1,11), 'key2':np.random.rand(10)*100} ) print(df) print(df.mean(),'求均值') print(df.count(),'統計每列非NaN的數量') print(df.min(),'最小值') print(df.max(),'最大值') print(df.quantile(q=0.5),'統計分數位,參數q確定位置') print(df.median(),'算數中位數') print(df.std(),'方差') print(df.skew(),'樣本的偏度') print(df.kurt(),'樣本的峰度') df['key1_s']=df['key1'].cumsum() df['key2_s']=df['key2'].cumsum() #樣本的累計和 print(df) df['key2_p']=df['key2'].cumprod() df['key1_p']=df['key1'].cumprod() #樣本的累計積 print(df) s = pd.Series(list('aabcdfgfgtf')) print(s) print(s.unique()) #唯一值 print(s.value_counts(sort=True)) #計算樣本出現的頻率 print(s.isin(['a','o'])) #是否在該series成員里面 print(df.isin([1,4])) #是否在該Dateframe成員里面
文本數據
s = pd.Series(['A','c','D','bbhello','b',np.nan]) df = pd.DataFrame({'key1':list('abcde'),'key2':['abc','AS',np.nan,4,'fa']}) print(df) print(s) print(s.str.count('b'))#統計每行的'b' print(s.str.upper()) #大寫 print(s.str.lower())#小寫 print(s.str.len())#長度 print(s.str.startswish('a'))#判斷起始值 print(s.str.endswish('a'))#判斷結束值 print(s.str.strip())#去空格 print(s.str.replace())#代替 print(s.str.split(','))#分裂 print(s.str[0])#字符索引
合並
合並 pd.merge( left, right, how="inner"交集, how='outer'並集 on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=("_x", "_y"), copy=True, indicator=False, validate=None, )
連接、修補
''' pd.concat( objs, axis=0, 行+行 axis=1 列+列 join="outer", #並集 inner交集 join_axes=None, #指定聯合index ignore_index=False, keys=None, 序列,默認無, levels=None, names=None, verify_integrity=False, sort=None, copy=True,) ''' df1 =pd.DataFrame([[np.nan,3,5],[-4,6,np.nan],[np.nan,4,np.nan]]) df2 =pd.DataFrame([[-2,np.nan,5],[5,8,19]],index=[1,2]) print(df1) print(df2) print(df1.combine_first(df2)) #df1的空值被df2值代替 df1.update(df2) #df2直接覆蓋df1 相同的index的位置 print(df1)
# df1 = pd.DataFrame(np.random.rand(8).reshape(4,2),index=['a','b','c','d'],columns=['values1','values2']) # df2 = pd.DataFrame(np.random.rand(8).reshape(4,2),index=['e','f','g','h'],columns=['values1','values2']) # print(df1) # print(df2) # print(pd.concat([df1,df2])) df1 = pd.DataFrame(np.random.rand(8).reshape(4,2),index=['a','b','c','d'],columns=['values1','values2']) df2 = pd.DataFrame(np.arange(8).reshape(4,2),index=['a','b','c','d'],columns=['values1','values2']) df1['values1']['a','b']=np.nan print(df1) print(df2) print(df1.combine_first(df2))
去重、替換
#去重 s=pd.Series([1,1,2,2,2,3,3,3,4,5,5,56]) print(s) print(s.duplicated()) print(s[s.duplicated() == False]) s_r = s.drop_duplicates() print(s_r) #替換 df=pd.Series(list('abcdeaade')) print(df) print(df.replace('a',1)) print(df.replace(['a','b'],1)) print(df.replace({'a':123,'d':234}))
數據分組
df = pd.DataFrame({ 'A':['foo','bar','foo','bar','foo','bar'], 'B':['one','two','three','one','two','one'], 'C':np.arange(1,7), 'D':np.arange(8,14) }) print(df) a = df.groupby('A').mean() b = df.groupby(['A','B']).mean() c = df.groupby('A')['D'].mean() #以A分組,算D的均值 print(a,type(a)) print(b,type(b)) print(c,type(c)) #分組---可迭代的對象 df1 = pd.DataFrame({'X':['A','B','A','B'],'Y':[1,2,3,4]}) print(df1) print(list(df1.groupby('X'))) #列表 print(list(df1.groupby('X'))[0]) #元組 for n,g in df1.groupby('X'): print(n) print(g) print('++++++++++++') print(df1.groupby('X').get_group('A')) #提取分組后的組 #其他軸上分組 df = pd.DataFrame({ 'key1':['a','b'], 'key2':['one','two'], 'C':np.arange(1,3), 'D':np.arange(8,10) }) print(df) print(df.dtypes) for n,g in df.groupby(df.dtypes,axis=1): print(n) print(g) #通過字典或者Series分組 df = pd.DataFrame(np.arange(16).reshape(4,4),columns=['a','b','c','d']) date = {'a':'one','b':'two','c':'one','d':'two','e':'three'} by= df.groupby(date,axis=1) print(by.sum()) s = pd.Series(date) s_b = s.groupby(s).count() print(s_b) #通過函數分組 df = pd.DataFrame(np.arange(16).reshape(4,4),columns=['a','b','c','d'],index=['abc','bcd','bb','a']) s = df.groupby(len).sum() print(s) #多函數計算 agg() df = pd.DataFrame({ 'A':[1,2,1,2], 'B':np.arange(8,12), 'C':np.arange(1,5), 'D':np.arange(8,12) }) print(df) print(df.groupby('A').agg(['mean',np.sum])) print(df.groupby('A')['B'].agg({'result1':np.mean,'result2':np.sum}))
文件讀取
pd.read_table( obj, 文件路徑 delimiter=',', 用於拆分字符, header=0, 用作列名序號,默認為0 index_col=1 指定某列為行索引,否則自動索引0,1... ) pd.read_csv( obj, engine='python', 使用的分析引擎 可以選擇python或者C encoding='utf8', 指定字符集類型,編碼類型 ) pd.read_excel( obj, sheetname=None, 返回多表使用sheetname=[0,1],默認返回全部 header=0, 用作列名序號,默認為0 index_col=1 指定某列為列索引,否則自動索引0,1... )