python數據分析實戰---Pandas


pandas的認識 :一個python的數據分析庫

安裝方式:pip  install pandas

pandas 是基於NumPy 的一種工具,該工具是為了解決數據分析任務而創建的。Pandas 納入了大量庫和一些標准的數據模型,提供了高效地操作大型數據集所需的工具。pandas提供了大量能使我們快速便捷地處理數據的函數和方法。你很快就會發現,它是使Python成為強大而高效的數據分析環境的重要因素之一.

  • 一個快速、高效的DataFrame對象,用於數據操作和綜合索引;
  • 用於在內存數據結構和不同格式之間讀寫數據的工具:CSV和文本文件、Microsoft Excel、SQL數據庫和快速HDF 5格式;
  • 智能數據對齊和丟失數據的綜合處理:在計算中獲得基於標簽的自動對齊,並輕松地將凌亂的數據操作為有序的形式;
  • 數據集的靈活調整和旋轉;
  • 基於智能標簽的切片、花式索引和大型數據集的子集;
  • 可以從數據結構中插入和刪除列,以實現大小可變;
  • 通過在強大的引擎中聚合或轉換數據,允許對數據集進行拆分應用組合操作;
  • 數據集的高性能合並和連接;
  • 層次軸索引提供了在低維數據結構中處理高維數據的直觀方法;
  • 時間序列-功能:日期范圍生成和頻率轉換、移動窗口統計、移動窗口線性回歸、日期轉換和滯后。甚至在不丟失數據的情況下創建特定領域的時間偏移和加入時間序列;
  • 對性能進行了高度優化,用Cython或C編寫了關鍵代碼路徑。
  • Python與Pandas在廣泛的學術和商業領域中使用,包括金融,神經科學,經濟學,統計學,廣告,網絡分析,等等。

pandas中文網 https://www.pypandas.cn

數據結構Series和Dataframe

一維數組Series

 Series是一維標記的數組,能夠保存任何數據類型(整數,字符串,浮點數,Python對象等)。軸標簽統稱為索引。

ar = np.random.rand(5)
s = pd.Series(ar)
print(s)
print(s.index)   #index查看series的值
print(s.values)  #values查看series的values
---------------------------------------------------
0    0.119383
1    0.247409
2    0.248272
3    0.410680
4    0.439547
dtype: float64
RangeIndex(start=0, stop=5, step=1)
[0.11938319 0.24740862 0.24827207 0.41068032 0.43954667]

創建series的三種方法:

字典創建
dit = {'a':1,'b':2,'c':3,'f':6}
s = pd.Series(dit)
print(s)
---------------------------------
a    1
b    2
c    3
f    6
dtype: int64
數組創建
ar = np.random.rand(5)*100
s = pd.Series(ar,index=list('abcde'),dtype=np.str)
print(s)
--------------------------------------------
a    31.644744342854725
b     6.783679074968873
c     6.753556225037693
d     43.71090526035562
e     65.35205915903558
dtype: object
通過標量創建
s = pd.Series(100,index=range(10))
print(s)
-------------------------------
0    100
1    100
2    100
3    100
4    100
5    100
6    100
7    100
8    100
9    100
dtype: int64

 name屬性

ar = np.random.rand(2)
s = pd.Series(ar)
s1 = pd.Series(ar,name='test')
print(s,type(s))
print(s1,type(s1))

s2 =s1.rename("abcd")
print(s2,type(s2))
print(s1,type(s1))
---------------------------
0    0.820561
1    0.330791
dtype: float64 <class 'pandas.core.series.Series'>
0    0.820561
1    0.330791
Name: test, dtype: float64 <class 'pandas.core.series.Series'>
0    0.820561
1    0.330791
Name: abcd, dtype: float64 <class 'pandas.core.series.Series'>
0    0.820561
1    0.330791
Name: test, dtype: float64 <class 'pandas.core.series.Series'>

索引

ar = np.random.rand(5)
s = pd.Series(ar)
print(s[0])   #下標索引
print(s[2])   #下標索引
s1 = pd.Series(ar,index=list('abcde'))
print(s1)
print(s1['a'])   #標簽索引
print(s1[0:3],s1[4])  #切片索引

#布爾型索引
ar = np.random.rand(5)*100
s2 = pd.Series(ar)
s2[6]=None
print(s2)
bs1 = s2>50
bs2 = s2.isnull()
bs3 = s2.notnull()
print(bs1)
print(bs2)
print(bs3)

print(s2[s2>50])
print(s2[bs3])
------------------------
0.61815875542277
0.019856009429792598
a    0.618159
b    0.823132
c    0.019856
d    0.737151
e    0.840799
dtype: float64
0.61815875542277
a    0.618159
b    0.823132
c    0.019856
dtype: float64 0.8407993638916321
0    9.29894
1    84.7848
2    24.4915
3    59.9761
4    91.5569
6       None
dtype: object
0    False
1     True
2    False
3     True
4     True
6    False
dtype: bool
0    False
1    False
2    False
3    False
4    False
6     True
dtype: bool
0     True
1     True
2     True
3     True
4     True
6    False
dtype: bool
1    84.7848
3    59.9761
4    91.5569
dtype: object
0    9.29894
1    84.7848
2    24.4915
3    59.9761
4    91.5569
dtype: object

 其他屬性

ar = np.random.randint(100,size=10)
s = pd.Series(ar,index=list('abcdefgjkl'))
print(s)
print(s.head())  #查看前5個
print(s.tail())  #查看后5個
s['a','e','f']=100  #修改
s.drop('b',inplace=True) #刪除
s['o'] = 500   #添加
print("++++",s)


#重新索引
s1 = pd.Series(np.random.rand(5),index=list('abcde'))
s2 = s1.reindex(['b','c','d','e','f'])
print(s2)

#對齊
d = pd.Series(np.random.rand(3),index=['Tom','Marry','Jam'])
d2 = pd.Series(np.random.rand(3),index=['Tom','Lucy','Jam'])
print(d)
print(d2)
print(d2+d)
--------------------------------
a    75
b    45
c    86
d     0
e    29
f     8
g    41
j    51
k    30
l    58
dtype: int32
a    75
b    45
c    86
d     0
e    29
dtype: int32
f     8
g    41
j    51
k    30
l    58
dtype: int32
++++ a    100
c     86
d      0
e    100
f    100
g     41
j     51
k     30
l     58
o    500
dtype: int64
b    0.962842
c    0.061086
d    0.135772
e    0.845562
f         NaN
dtype: float64
Tom      0.828716
Marry    0.383809
Jam      0.600144
dtype: float64
Tom     0.048050
Lucy    0.379492
Jam     0.072854
dtype: float64
Jam      0.672999
Lucy          NaN
Marry         NaN
Tom      0.876766
dtype: float64

 

二維數組Dataframe

DataFrame是一個二維標記數據結構,具有可能不同類型的列。您可以將其視為電子表格或SQL表,或Series對象的字典。它通常是最常用的pandas對象。與Series一樣,DataFrame接受許多不同類型的輸入:

  • 1D ndarray,list,dicts或Series的Dict
  • 二維numpy.ndarray
  • 結構化或記錄 ndarray
  • 一個 Series
  • 另一個 DataFrame

除了數據,您還可以選擇傳遞索引(行標簽)和 列(列標簽)參數。如果傳遞索引和/或列,則可以保證生成的DataFrame的索引和/或列。因此, Series 的字典加上特定索引將丟棄與傳遞的索引不匹配的所有數據。

創建DataFrame的5中方式:

  由list和數組創建

#由list和數組創建
data = {
    'name':['Jack','Mary','Tom'],
    'age':[14,15,17],
    'gender':['M','W','M']
}
fr = pd.DataFrame(data)
print(fr)
print(type(fr))
print(fr.index,'數據類型是:',type(fr.index))   #行標簽
print(fr.values,'數據類型是:',type(fr.values)) #值
print(fr.columns,'數據類型是:',type(fr.columns))  #列標簽
-----------------------------------------------------------------
   name  age gender
0  Jack   14      M
1  Mary   15      W
2   Tom   17      M
<class 'pandas.core.frame.DataFrame'>
RangeIndex(start=0, stop=3, step=1) 數據類型是: <class 'pandas.core.indexes.range.RangeIndex'>
[['Jack' 14 'M']
 ['Mary' 15 'W']
 ['Tom' 17 'M']] 數據類型是: <class 'numpy.ndarray'>
Index(['name', 'age', 'gender'], dtype='object') 數據類型是: <class 'pandas.core.indexes.base.Index'>
由Series組成的創建
#由Series組成的創建
data1 = {'one':pd.Series(np.random.rand(2)),
         'two':pd.Series(np.random.rand(3)),
         }
print(data1)
data2 = {'one':pd.Series(np.random.rand(2),index=['a','b']),
         'two':pd.Series(np.random.rand(3),index=['a','b','c']),
         }
print(data2)

fr1 = pd.DataFrame(data1)
fr2 = pd.DataFrame(data2)
print(fr1)
print(fr2)
-----------------------------
{'one': 0    0.432652
1    0.552177
dtype: float64, 'two': 0    0.946339
1    0.326405
2    0.352883
dtype: float64}
{'one': a    0.353147
b    0.176789
dtype: float64, 'two': a    0.121450
b    0.371344
c    0.240906
dtype: float64}
        one       two
0  0.432652  0.946339
1  0.552177  0.326405
2       NaN  0.352883
        one       two
a  0.353147  0.121450
b  0.176789  0.371344
c       NaN  0.240906 
通過二維數組創建 (常用)
#通過二維數組創建 (常用)

ar = np.random.rand(9).reshape(3,3)
print(ar)
fr3 = pd.DataFrame(ar)
fr4 = pd.DataFrame(ar,index=['a','b','c'],columns=['s','h','j'])
print(fr3)
print(fr4)
------------------------
[[0.80857571 0.31437002 0.00130739]
 [0.24521627 0.04577992 0.19544072]
 [0.23923237 0.26033495 0.17534313]]
          0         1         2
0  0.808576  0.314370  0.001307
1  0.245216  0.045780  0.195441
2  0.239232  0.260335  0.175343
          s         h         j
a  0.808576  0.314370  0.001307
b  0.245216  0.045780  0.195441
c  0.239232  0.260335  0.175343 
字典組成的列表
data3 = [{'one':1,'two':2,'three':3},{'four':4,'five':5,'six':6}]
fr5 = pd.DataFrame(data3)
print(fr5)
---------------------
   one  two  three  four  five  six
0  1.0  2.0    3.0   NaN   NaN  NaN
1  NaN  NaN    NaN   4.0   5.0  6.0
字典組成的字典
data4 = {
    'Tom':{'art':67,'english':98,'china':76},
    'Mary':{'art':45,'english':78,'china':70},
    'Lucy':{'art':58,'english':79},
        }
fr6 = pd.DataFrame(data4)
print(fr6)
---------------------------
        Tom  Mary  Lucy
art       67    45  58.0
english   98    78  79.0
china     76    70   NaN

 索引

import  numpy as np
import pandas as pd


df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                  index=['one','two','three'],columns=['a','b','c','d']
                  )
df2 = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                  columns=['a','b','c','d']
                  )

print(df)
print(df2)

data = df['a']
data1 = df[['a','b']]
print('data',data)
print('data1',data1)     #選擇列

data3 = df.loc['one']    #按index選擇行
data4 = df.loc[['three','one']]
print('data3',data3)
print('data4',data4)   #選擇行

data5 = df.iloc[-1]  #按整數位置選擇行
print('data5',data5)

print("單標簽索引/n")
print(df.loc['one'])
print(df2.loc[1])

print("多標簽索引/n")
print(df.loc[['two','one']])
print(df2.loc[[2,1]])
#
print("切片索引/n")
print(df.loc['one':'two'])
print(df2.loc[1:2])
---------------------------
               a          b          c          d
one    79.285201  73.277718  12.225063  18.830074
two     2.400540  49.604940  80.337070  47.133134
three  17.399693  92.839253  90.041425  75.505320
           a          b          c          d
0  47.065633  21.284022  30.118641  85.652279
1  12.201863  48.841603  23.367143  32.276774
2  77.422617  55.812583  56.130735  64.983035
data one      79.285201
two       2.400540
three    17.399693
Name: a, dtype: float64
data1                a          b
one    79.285201  73.277718
two     2.400540  49.604940
three  17.399693  92.839253
data3 a    79.285201
b    73.277718
c    12.225063
d    18.830074
Name: one, dtype: float64
data4                a          b          c          d
three  17.399693  92.839253  90.041425  75.505320
one    79.285201  73.277718  12.225063  18.830074
data5 a    17.399693
b    92.839253
c    90.041425
d    75.505320
Name: three, dtype: float64
單標簽索引/n
a    79.285201
b    73.277718
c    12.225063
d    18.830074
Name: one, dtype: float64
a    12.201863
b    48.841603
c    23.367143
d    32.276774
Name: 1, dtype: float64
多標簽索引/n
             a          b          c          d
two   2.400540  49.604940  80.337070  47.133134
one  79.285201  73.277718  12.225063  18.830074
           a          b          c          d
2  77.422617  55.812583  56.130735  64.983035
1  12.201863  48.841603  23.367143  32.276774
切片索引/n
             a          b          c          d
one  79.285201  73.277718  12.225063  18.830074
two   2.400540  49.604940  80.337070  47.133134
           a          b          c          d
1  12.201863  48.841603  23.367143  32.276774
2  77.422617  55.812583  56.130735  64.983035

布爾值索引

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                  index=['one','two','three'],columns=['a','b','c','d']
                  )
b1 = df<20
print(b1,type(b1))
print(df[b1])   #不做索引對每一個值進行判斷

b2 = df["a"]<20
print(b2,type(b2))
print(df[b2])   #單行判斷

b3 = df[["a",'b']]<20
print(b3,type(b3))
print(df[b3])   #多行判斷

b4 = df.loc[["one",'two']]<50
print(b4,type(b4))
print(df[b4])   #多行判斷

-------------------------------------
           a      b      c      d
one     True   True  False  False
two    False  False  False  False
three   True  False  False  False <class 'pandas.core.frame.DataFrame'>
               a          b   c   d
one    12.319044  16.517952 NaN NaN
two          NaN        NaN NaN NaN
three   8.939486        NaN NaN NaN
one       True
two      False
three     True
Name: a, dtype: bool <class 'pandas.core.series.Series'>
               a          b          c          d
one    12.319044  16.517952  97.270662  76.200591
three   8.939486  38.428862  25.783585  30.355222
           a      b
one     True   True
two    False  False
three   True  False <class 'pandas.core.frame.DataFrame'>
               a          b   c   d
one    12.319044  16.517952 NaN NaN
two          NaN        NaN NaN NaN
three   8.939486        NaN NaN NaN

  多重索引

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                  index=['one','two','three'],columns=['a','b','c','d']
                  )

print(df['a'].loc['one'])
print(df['a'].loc[['one','two']])
print(df[['b','c','d']].iloc[1:2])
print(df[df<50][['a','b']])
--------------------------------
31.995689334678335
one    31.995689
two     6.516284
Name: a, dtype: float64
             b          c          d
two  19.048351  31.111981  60.956516
               a          b
one    31.995689  38.992923
two     6.516284  19.048351
three        NaN  31.623816

 其他屬性

import  numpy as np
import pandas as pd


df = pd.DataFrame(np.random.rand(10).reshape(5,2),
                  columns=['a','b']
                  )

print(df)
print(df.T)   #轉置
print(df.head(2))  #前2列
print(df.tail(2))  #后2列

df2 = pd.DataFrame(np.random.rand(16).reshape(4,4),
                  columns=['a','b','c','d']
                  )
print(df2)
df2['c']=100  #修改
df2['e']=10   #添加
print(df2)


print(df.drop(0,inplace=True))  #刪除行,inplace  刪除后生成新數據,不改變原數據
print(df.drop(['a'],axis=1))  #刪除列,axis=1  刪除后生成新數據,不改變原數據


#對齊


#排序
print(df.sort_values(['a'],ascending=True))  #升序
print(df.sort_values(['a'],ascending=True))  #降序

-----------------------------------------------------------------
          a         b
0  0.940085  0.181402
1  0.536894  0.488670
2  0.217216  0.854319
3  0.478155  0.066919
4  0.467400  0.194862
          0         1         2         3         4
a  0.940085  0.536894  0.217216  0.478155  0.467400
b  0.181402  0.488670  0.854319  0.066919  0.194862
          a         b
0  0.940085  0.181402
1  0.536894  0.488670
          a         b
3  0.478155  0.066919
4  0.467400  0.194862
          a         b         c         d
0  0.849237  0.284547  0.353720  0.470520
1  0.294418  0.909727  0.375445  0.975046
2  0.588561  0.386173  0.703177  0.341634
3  0.180870  0.831200  0.392450  0.036837
          a         b    c         d   e
0  0.849237  0.284547  100  0.470520  10
1  0.294418  0.909727  100  0.975046  10
2  0.588561  0.386173  100  0.341634  10
3  0.180870  0.831200  100  0.036837  10
None
          b
1  0.488670
2  0.854319
3  0.066919
4  0.194862
          a         b
2  0.217216  0.854319
4  0.467400  0.194862
3  0.478155  0.066919
1  0.536894  0.488670
          a         b
2  0.217216  0.854319
4  0.467400  0.194862
3  0.478155  0.066919
1  0.536894  0.488670

時間模塊

datetime.datetime()

t1 = datetime.datetime.now()
t2 = datetime.datetime(2016,2,5)
t3 = datetime.datetime(2016,2,5,12,30,34)
print(t1,type(t1))
print(t2,type(t2))
print(t3,type(t3))
---------------------------
2019-09-17 12:25:54.780962 <class 'datetime.datetime'>
2016-02-05 00:00:00 <class 'datetime.datetime'>
2016-02-05 12:30:34 <class 'datetime.datetime'>

  datetime.delta()

t1 = datetime.datetime(2016,4,6)
t2 = datetime.timedelta(10,200)  #默認(天,秒)
print(t1+t2)
---------------------------
2016-04-16 00:03:20  
時間格式的轉化
from dateutil.parser import parse

date = "2015 2 20"
date1 = "2015-3-25"
date2 = "2016/3/8"
print(parse(date))
print(parse(date1))
print(parse(date2))

----------------------------
2015-02-20 00:00:00
2015-03-25 00:00:00
2016-03-08 00:00:00

  pd.timeStamp()時間戳

date1 = "2017-05-01 12:25:12"
date2 = datetime.datetime(2017,5,6,14,15,23)
t1 = pd.Timestamp(date1)  #時間戳
t2 = pd.Timestamp(date2)
print(t1,type(t1))
print(t2,type(t2))
print(date2,type(date2))
---------------------------------
2017-05-01 12:25:12 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2017-05-06 14:15:23 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2017-05-06 14:15:23 <class 'datetime.datetime'>

  pd.to_datetime()

date1 = "2017-05-01 12:25:12"
date2 = datetime.datetime(2017,5,6,14,15,23)

t1 = pd.to_datetime(date1)  #時間戳
t2 = pd.to_datetime(date2)
print(t1,type(t1))
print(t2,type(t2))


#多個時間數據,會轉化成pandas的Datetime的Index
ls_date = ['2017-01-01','2017-01-02','2017-01-03']
t3 = pd.to_datetime(ls_date)
print(t3,type(t3))

#當一組數據中夾雜着其他的數組
date3 = ['2017-01-01','2017-01-02','2017-01-03','hello','2018-01-05']
t4 = pd.to_datetime(date3,errors='ignore')  #返回原始數據,這里直接是生成一組數據
print(t4,type(t4))

t5 = pd.to_datetime(date3,errors='coerce')  #缺失值返回Nat,結果是DatetimeIndex
print(t5,type(t5))

------------------------------------------------
2017-05-01 12:25:12 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2017-05-06 14:15:23 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Index(['2017-01-01', '2017-01-02', '2017-01-03', 'hello', '2018-01-05'], dtype='object') <class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', 'NaT', '2018-01-05'], dtype='datetime64[ns]', freq=None) <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
date_range(start,end,periods,freq)
'''
date_range(start,end,periods,freq)
    start:開始時間
    end:結束時間
    periods:偏移量
    freq:頻率  默認,天  pd.date_range()默認頻率為日歷日  pd.bdate_range()默認頻率為工作日
'''
date = pd.date_range('2014-01-01','2014-02-01')
date1 = pd.date_range(start='2014-01-01',periods=10)
date2 = pd.date_range(end='2014-01-01',periods=10)
date3 = pd.date_range('2014-01-01','2014-01-02',freq="H")
print(date)
print(date1)
print(date2)
print(date3)

#normalize 時間參數值正則化到午夜時間戳
date4 = pd.date_range('2019-05-01 12:25:00',periods=2,name='hello',normalize=True)
print(date4)

print(pd.date_range('20190101','20190105'))   #默認左右閉合
print(pd.date_range('20190101','20190105',closed='right')) #右開左閉
print(pd.date_range('20190101','20190105',closed='left'))  #左開右閉
print(pd.bdate_range('20190101','20190107'))  #默認頻率是工作日

#日期范圍頻率
print(pd.date_range('20190101','20190110'))  #默認是天
print(pd.date_range('20190101','20190110',freq='B')) #每工作日
print(pd.date_range('20190101','20190110',freq='H'))  #每小時
print(pd.date_range('20190101 12:00','20190110 12:20',freq='T'))  #每分鍾
# print(pd.date_range('20190101 12:00:00','20190110 12:20:01',freq='S'))  #每秒
# print(pd.date_range('20190101 12:00:00','20190110 12:20:01',freq='L'))  #每毫秒(千分之一秒)
# print(pd.date_range('20190101 12:00:00','20190110 12:20:01',freq='U'))  #每微秒(百萬分之一秒)


#星期的縮寫:MON-TUE-WED-THU-FRI-SAT-SUN
print(pd.date_range('20190101','20190210',freq='W-MON'))  #從指定星期幾開始算起,每周一
print(pd.date_range('20190101','20190210',freq='WOM-2MON'))  #每月的第幾個星期幾開始算,這里是每月第二個星期一

#月份縮寫:JAN/FEB/MAR/APR/MAY/JUE/JUL/AUG/SEP/OCT/NOV/DEC
print(pd.date_range('2018','2020',freq='M'))#每月最后一個日歷日
print(pd.date_range('2018','2020',freq='Q-DEC'))#Q月,指定月為季度末,每個季度末的最后一個月的最后一個日歷日
print(pd.date_range('2018','2020',freq='A-DEC'))#A月,每年指定月份的最后一個日歷日

print(pd.date_range('2018','2020',freq='BM'))#每月最后一個工作日
print(pd.date_range('2018','2020',freq='BQ-DEC'))#Q月,指定月為季度末,每個季度末的最后一個月的最后一個工作日
print(pd.date_range('2018','2020',freq='BA-DEC'))#A月,每年指定月份的最后一個工作日

print(pd.date_range('2018','2020',freq='MS'))#每月第一個日歷日
print(pd.date_range('2018','2020',freq='QS-DEC'))#Q月,指定月為季度末,每個季度末的最后一個月的第一個日歷日
print(pd.date_range('2018','2020',freq='AS-DEC'))#A月,每年指定月份的第一個日歷日

print(pd.date_range('2018','2020',freq='BMS'))#每月第一個工作日
print(pd.date_range('2018','2020',freq='BQS-DEC'))#Q月,指定月為季度末,每個季度末的最后一個月的第一個工作日
print(pd.date_range('2018','2020',freq='BAS-DEC'))#A月,每年指定月份的第一個工作日


#復合頻率
print(pd.date_range('20180701','20180801',freq='7D'))#7天
print(pd.date_range('20180701','20180801',freq='2H30min'))#2小時30分鍾
print(pd.date_range('2018','2019',freq='2MS'))#2月,每月最后一個日歷日

  asfreq()  時間頻率轉化

date = pd.Series(
    np.random.rand(4),
    index=pd.date_range('20180101','20180104')
)
print(date)
print(date.asfreq('4H'))  #值是NAN
print(date.asfreq('4H',method='ffill'))  #用前面值填充
print(date.asfreq('4H',method='bfill'))  #用后面的值填充

  shift()超前、滯后數據

date = pd.Series(
    np.random.rand(4),
    index=pd.date_range('20180101','20180104')
)
print(date)
print(date.shift(2))  #前移2位
print(date.shift(-2))  #后移2位

  period()時期

#創建時期
date = pd.Period('2017',freq='M')
print(date)

date2 = pd.period_range('2017','2018',freq='M')
print(date2,type(date2))

#時間戳與日期之間的轉化
t = pd.date_range('20180101',periods=10,freq='M')
t2 = pd.period_range('2018','2019',freq='M')

ts = pd.Series(np.random.rand(len(t)),index=t)
print(ts)
print(ts.to_period())  #時間戳轉日期

ts2 = pd.Series(np.random.rand(len(t2)),index=t2)
print(ts2)
print(ts2.to_timestamp())  #日期轉時間戳

  索引與切片

date = pd.DataFrame(
    np.random.rand(30).reshape(10,3)*100,
    index = pd.date_range('20170101','20170106',freq='12H',closed='left'),
    columns=['value1','value2','value3']
)
print(date)
print(date[:4])  #前4行
print(date["20170104"].iloc[1])   #取20170104 12:00:00的值
print(date.loc["20170104":'20170105'])  #切片

  resample()重采樣

date = pd.Series(
    np.arange(1,13),
    index=pd.date_range('20170101',periods=12)
)
print(date)
#
ts = date.resample("5D")
ts2 = date.resample("5D").sum()  #求和
print(ts,type(ts))
print(ts2,type(ts2))

print(date.resample("5D").mean() ) #求平均數
print(date.resample("5D").max() ) #求最大
print(date.resample("5D").min() ) #求最小
print(date.resample("5D").median() ) #求中值
print(date.resample("5D").first() ) #求第一個
print(date.resample("5D").last() ) #求最后一個
print(date.resample("5D").ohlc() ) #金融中的OHLC樣本


#降采樣
print(date.resample("5D",closed='left').sum() )
print(date.resample("5D",closed='right').sum() )

print(date.resample("5D",label='left').sum() )
print(date.resample("5D",label='right').sum() )


#升采樣及插值
date2 = pd.DataFrame(
    np.arange(15).reshape(5,3),
    index=pd.date_range('20170101',periods=5),
    columns=['a','b','c']
)
print(date2)
print(date2.resample("12H").asfreq())
print(date2.resample("12H").ffill())
print(date2.resample("12H").bfill())

通用方法

數值計算和統計基礎

 

import  numpy as np
import pandas as pd


df1 = pd.DataFrame(
    {'key1':[4,5,6,np.nan,7],
     'key2':[1,2,np.nan,9,7],
     'key3':[2,4,5,'j','k']
     })

print(df1)
print(df1.sum())
print(df1.sum(axis=1))  #axis=1 按行計算  默認是0
print(df1.sum(skipna=True))  #skipna  是否忽略NaN,默認是True,由NaN的值計算結果還是NaN


df = pd.DataFrame(
    {'key1':np.arange(1,11),
     'key2':np.random.rand(10)*100}
)
print(df)
print(df.mean(),'求均值')
print(df.count(),'統計每列非NaN的數量')
print(df.min(),'最小值')
print(df.max(),'最大值')
print(df.quantile(q=0.5),'統計分數位,參數q確定位置')
print(df.median(),'算數中位數')
print(df.std(),'方差')
print(df.skew(),'樣本的偏度')
print(df.kurt(),'樣本的峰度')

df['key1_s']=df['key1'].cumsum()
df['key2_s']=df['key2'].cumsum()   #樣本的累計和
print(df)

df['key2_p']=df['key2'].cumprod()
df['key1_p']=df['key1'].cumprod()  #樣本的累計積
print(df)


s = pd.Series(list('aabcdfgfgtf'))
print(s)
print(s.unique()) #唯一值
print(s.value_counts(sort=True)) #計算樣本出現的頻率

print(s.isin(['a','o']))   #是否在該series成員里面
print(df.isin([1,4]))   #是否在該Dateframe成員里面
   

文本數據

s = pd.Series(['A','c','D','bbhello','b',np.nan])
df = pd.DataFrame({'key1':list('abcde'),'key2':['abc','AS',np.nan,4,'fa']})
print(df)
print(s)

print(s.str.count('b'))#統計每行的'b'
print(s.str.upper()) #大寫
print(s.str.lower())#小寫
print(s.str.len())#長度
print(s.str.startswish('a'))#判斷起始值
print(s.str.endswish('a'))#判斷結束值
print(s.str.strip())#去空格
print(s.str.replace())#代替
print(s.str.split(','))#分裂
print(s.str[0])#字符索引

合並

合並
pd.merge(
    left,
    right,
    how="inner"交集, how='outer'並集
    on=None,
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=False,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None,
)

連接、修補

'''
pd.concat(
    objs,
    axis=0,  行+行  axis=1 列+列
    join="outer",  #並集   inner交集
    join_axes=None,  #指定聯合index
    ignore_index=False,
    keys=None,   序列,默認無,
    levels=None,
    names=None,
    verify_integrity=False,
    sort=None,
    copy=True,)
      
'''
df1 =pd.DataFrame([[np.nan,3,5],[-4,6,np.nan],[np.nan,4,np.nan]])
df2 =pd.DataFrame([[-2,np.nan,5],[5,8,19]],index=[1,2])
print(df1)
print(df2)
print(df1.combine_first(df2))  #df1的空值被df2值代替
df1.update(df2)  #df2直接覆蓋df1 相同的index的位置
print(df1) 
# df1 = pd.DataFrame(np.random.rand(8).reshape(4,2),index=['a','b','c','d'],columns=['values1','values2'])
# df2 = pd.DataFrame(np.random.rand(8).reshape(4,2),index=['e','f','g','h'],columns=['values1','values2'])
# print(df1)
# print(df2)
# print(pd.concat([df1,df2]))

df1 = pd.DataFrame(np.random.rand(8).reshape(4,2),index=['a','b','c','d'],columns=['values1','values2'])
df2 = pd.DataFrame(np.arange(8).reshape(4,2),index=['a','b','c','d'],columns=['values1','values2'])
df1['values1']['a','b']=np.nan
print(df1)
print(df2)
print(df1.combine_first(df2))

去重、替換

#去重
s=pd.Series([1,1,2,2,2,3,3,3,4,5,5,56])
print(s)
print(s.duplicated())
print(s[s.duplicated() == False])

s_r = s.drop_duplicates()
print(s_r)

#替換
df=pd.Series(list('abcdeaade'))
print(df)
print(df.replace('a',1))
print(df.replace(['a','b'],1))
print(df.replace({'a':123,'d':234}))

數據分組

 

df = pd.DataFrame({
    'A':['foo','bar','foo','bar','foo','bar'],
    'B':['one','two','three','one','two','one'],
    'C':np.arange(1,7),
    'D':np.arange(8,14)
})
print(df)
a = df.groupby('A').mean()
b = df.groupby(['A','B']).mean()
c = df.groupby('A')['D'].mean() #以A分組,算D的均值
print(a,type(a))
print(b,type(b))
print(c,type(c))

#分組---可迭代的對象
df1 = pd.DataFrame({'X':['A','B','A','B'],'Y':[1,2,3,4]})
print(df1)
print(list(df1.groupby('X')))  #列表
print(list(df1.groupby('X'))[0])  #元組
for n,g in df1.groupby('X'):
    print(n)
    print(g)
print('++++++++++++')
print(df1.groupby('X').get_group('A'))  #提取分組后的組

#其他軸上分組
df = pd.DataFrame({
    'key1':['a','b'],
    'key2':['one','two'],
    'C':np.arange(1,3),
    'D':np.arange(8,10)
})
print(df)
print(df.dtypes)
for n,g in df.groupby(df.dtypes,axis=1):
    print(n)
    print(g)


#通過字典或者Series分組
df = pd.DataFrame(np.arange(16).reshape(4,4),columns=['a','b','c','d'])
date = {'a':'one','b':'two','c':'one','d':'two','e':'three'}
by= df.groupby(date,axis=1)
print(by.sum())

s = pd.Series(date)
s_b = s.groupby(s).count()
print(s_b)


#通過函數分組
df = pd.DataFrame(np.arange(16).reshape(4,4),columns=['a','b','c','d'],index=['abc','bcd','bb','a'])
s = df.groupby(len).sum()
print(s)

#多函數計算 agg()
df = pd.DataFrame({
    'A':[1,2,1,2],
    'B':np.arange(8,12),
    'C':np.arange(1,5),
    'D':np.arange(8,12)
})
print(df)
print(df.groupby('A').agg(['mean',np.sum]))
print(df.groupby('A')['B'].agg({'result1':np.mean,'result2':np.sum}))

 

文件讀取

 

 pd.read_table(
            obj,   文件路徑
            delimiter=',', 用於拆分字符,
            header=0,   用作列名序號,默認為0
            index_col=1 指定某列為行索引,否則自動索引0,1...
        )
pd.read_csv(
            obj,
            engine='python', 使用的分析引擎 可以選擇python或者C
            encoding='utf8',  指定字符集類型,編碼類型
        ) 
pd.read_excel(
            obj,
            sheetname=None, 返回多表使用sheetname=[0,1],默認返回全部
            header=0,   用作列名序號,默認為0
            index_col=1 指定某列為列索引,否則自動索引0,1...
            
        )

  


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM