[python][數據分析] ARIMA 模型的個人驗證

本文轉載自查看原文 2019-01-31 18:15 1102 python/ 數據分析/ 系統運維

寫在最前：===========

大約兩周前，閱讀《智能運維》這本書，了解到 ARIMA 模型可以根據過去的數據，做預測分析，當時覺得很有意思，但是一直到最近才真正着手實踐，當然我這個實踐也是很粗糙的，更大的意義是通過動手的過程，帶來小小的成就感的同時，檢驗自己能力上的不足。

理論基礎 ===========

ARIMA，Autoregressive Integrated Moving Average model ，差分整合移動平均自回歸模型，又稱整合移動平均自回歸模型（移動也可稱作滑動），時間序列預測分析方法

每個時間序列的數據都有自身的特性，而預測分析的關鍵點在於，確定 ARIMA(p，d，q) 這個模型中 d， p ，q的合適的數值

d，對序列進行差分處理，處理n次之后，n階差分序列比較平穩，沒有上升或者下降的趨勢，那么d就是n這個階數。有時候指數級別的序列無法直接通過差分的方式獲取合適的d值，但是可以通過序列整體取log 對數處理之后的序列進行較少的差分階數，獲取d，比如 https://www.jianshu.com/p/4130bac8ebec 這篇對湖北的產值的分析就是這個思路。

p和 q的取值我就直接上個圖吧，理解很簡單說起來很麻煩

不過，有時候 p q，經過 acf 和pacf 兩個數據集獲取之后，最好經過一下集中模型進行一下殘差檢驗，為什么這么做，這超出我的能力范圍了，建議搭建看我上邊給的 jianshu鏈接文檔。

* AIC=-2 ln(L) + 2 k 中文名字：赤池信息量 akaike information criterion
* BIC=-2 ln(L) + ln(n)*k 中文名字：貝葉斯信息量 bayesian information criterion
* HQ=-2 ln(L) + ln(ln(n))*k hannan-quinn criterion
本文當中代碼這個案例，我目測用 7,1 比較合適，但是殘差測試結果好像8,0會更小，

應用場景======================

--普遍適用於各種以時間作為索引的序列數據，比如銷售額，交通流量統計，醫院患者人數等等，也有用它做股票分析，結果當然也就那樣。我個人理解，數據場景如果約封閉，也就意味着容易受到單一因素的變化而產生巨變這種，明顯不適合，或者時間段太短，周期不固定，其實都會影響該模型的可用性，

python環境關鍵點===================

包括 numpy，pandas，matplotlib等常用數據分析模塊，有一定使用基礎，肯定會幫助理解程序的過程。當然 import statsmodels.api as sm ，這個模型工具包，我也是直接粘貼過來用，里邊的什么函數，以后慢慢研究了。另外就是，過程當中，經常會提示本地某個模塊將會在最新的python3 模塊中被廢棄，實際上，暫時不會影響當前程序的執行。強烈推薦使用 anaconda 3的 python安裝集合，全家桶的方式省去很多麻煩。不用糾結python3和2的共存問題，數據分析類的程序，現在98

% 都是python3了。

其他問題===========================

這次，我發現，不管是np pd 的使用，還是mpl得繪圖的使用，都還有提升的必要，另外，預測分析后邊是經過反復實踐的，一方面是日常的運維工作，也包括日常的生活學習，暫時還是要采取用到什么就學習什么的大體思路，畢竟數據分析和建模這個大坑之大，對於我這個自然選擇號，也如無法掙脫的黑洞。 python的基本還是要多多提升一下。

相關參考：

1 ，https://blog.csdn.net/wenyusuran/article/details/82143978

# 我直接借用了該博主的程序和結論，該博主是從源網站直接閱讀文檔（http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/tsa_arma_0.html）自己進行的實驗，其實為我們這些后來人，排除了很多困難

2，

3，

上午看的 2和3 這兩篇文章，確定 d， p ，q的思路和方法

正文請看代碼和執行結果 ================

我建議還是去看人家博主的文章吧

# -*- coding:utf-8 -*-

from __future__ import print_function
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.api import qqplot

dta=[10930,10318,10595,10972,7706,6756,9092,10551,9722,10913,11151,8186,6422,6337,11649,11652,10310,12043,7937,6476,9662,9570,9981,9331,9449,6773,6304,9355,10477,10148,10395,11261,8713,7299,10424,10795,11069,11602,11427,9095,7707,10767,12136,12812,12006,12528,10329,7818,11719,11683,12603,11495,13670,11337,10232,
13261,13230,15535,16837,19598,14823,11622,19391,18177,19994,14723,15694,13248,9543,12872,13101,15053,12619,13749,10228,9725,14729,12518,14564,15085,14722,11999,9390,13481,14795,15845,15271,14686,11054,10395]

dta=np.array(dta,dtype=np.float)
#dta
dta=pd.Series(dta)
dta.index = pd.Index(sm.tsa.datetools.dates_from_range('2001','2090'))  
dta.plot(figsize=(12,8))
#plt.show()

# 一介差分
'''
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(211)
diff1=dta.diff(1)
diff1.plot(ax=ax1)

fig=plt.figure(figsize=(12,8))
ax2=fig.add_subplot(212)
diff2=diff1.diff(1)
diff2.plot(ax=ax2)
'''
#plt.show()

# 二階效果平穩 后邊看 p，q的 情況 
"""
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(dta,lags=40,ax=ax1)
ax2 = fig.add_subplot(212)
fig=sm.graphics.tsa.plot_pacf(dta,lags=40,ax=ax2)
plt.show()
"""

# 從圖上看 我會基本選定為q=7 或者8，p=1或者0 
arma_mod70 = sm.tsa.ARMA(dta,(7,0)).fit()
print(arma_mod70.aic,arma_mod70.bic,arma_mod70.hqic)
arma_mod71 = sm.tsa.ARMA(dta,(7,1)).fit()
print(arma_mod71.aic,arma_mod71.bic,arma_mod71.hqic)
arma_mod80 = sm.tsa.ARMA(dta,(8,0)).fit()
print(arma_mod80.aic,arma_mod80.bic,arma_mod80.hqic)

# 我的結果也是和轉載者的一樣 其實是8-0 這種模型的擬合效果最好


# 檢驗殘差序列
"""
resid=arma_mod80.resid

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2)
plt.show()
"""

# 預測
predict_dta = arma_mod80.predict('2090', '2100', dynamic=True)
print(predict_dta)

fig,ax=plt.subplots(figsize=(12,8))
ax=dta.ix['2000':].plot(ax=ax)
fig=arma_mod80.plot_predict('2090','2100',dynamic=True,ax=ax,plot_insample=False)
plt.show()

print('ok')

C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\scipy\signal\signaltools.py:1333: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  out_full[ind] += zi
C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\scipy\signal\signaltools.py:1336: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  out = out_full[ind]
C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\scipy\signal\signaltools.py:1342: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  zf = out_full[ind]
C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:646: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  if issubdtype(paramsdtype, float):
C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:650: FutureWarning: Conversion of the second argument of issubdtype from `complex` to `np.complexfloating` is deprecated. In future, it will be treated as `np.complex128 == np.dtype(complex).type`.
  elif issubdtype(paramsdtype, complex):

1619.1918163496919 1641.6901033826643 1628.2644481568852

C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\statsmodels\base\model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

1605.6865601026789 1630.6846568059816 1615.7672621106715

C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\statsmodels\base\model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py:577: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  if issubdtype(paramsdtype, float):
C:\ProgramData\Anaconda2\envs\py3\lib\site-packages\ipykernel_launcher.py:73: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix

1597.9359953758615 1622.9340920791642 1608.016697383854
2090-12-31     9542.819752
2091-12-31    12907.841813
2092-12-31    13980.753562
2093-12-31    14500.615250
2094-12-31    13893.643654
2095-12-31    13249.389429
2096-12-31    10961.289060
2097-12-31    10073.380008
2098-12-31    12683.742111
2099-12-31    13475.897777
2100-12-31    13614.674168
Freq: A-DEC, dtype: float64

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python數據分析 Python數據分析 Python之數據分析常用數據分析模型數據分析——AARRR模型數據分析模型常見的數據分析模型數據分析之漏斗模型數據分析八大模型：詳解PEST模型數據分析八大模型：詳解RFM模型

[python][數據分析] ARIMA 模型的 個人驗證

免責聲明！

[python][數據分析] ARIMA 模型的個人驗證