這一篇,我也不知道都記了個啥,待我整理一下
time series
時間序列數據就是隨着時間發展變化(不變化)的數據
時間序列圖
可以畫共軸的子圖,查看多個屬性隨時間變化的趨勢
# Plot the time series in each dataset
fig, axs = plt.subplots(2, 1, figsize=(5, 10))
data.iloc[:1000].plot(y='data_values', ax=axs[0])
data2.iloc[:1000].plot(y='data_values', ax=axs[1])
plt.show()
我再一次復習pd[[]]可以同時提取數據框的子集
雙中括號提取子列*
from sklearn.svm import LinearSVC
# Construct data for the model
X = data[["petal length (cm)" , "petal width (cm)"]]
y = data[['target']]
# Fit the model
model = LinearSVC()
model.fit(X, y)
還有就是想畫在一個圖的,公用y軸,如上述
reshape
給定數組或者數據框一個新的形狀而不改變數據
其中常見的就是reshape(-1,1):數據集可以變成一列,之前matlab處理圖像時候也有一個函數,忘記叫啥了,把像素歸為一列
predict
一般擬合完模型,進行預測,可直接使用predict
glob
glob模塊的主要方法就是glob,該方法返回所有匹配的文件路徑列表(list);該方法需要一個參數用來指定匹配的路徑字符串(字符串可以為絕對路徑也可以為相對路徑),其返回的文件名只包括當前目錄里的文件名,不包括子文件夾里的文件。
librosa
librosa是一個非常強大的python語音信號處理的第三方庫cnblog
librosa.stft短時傅里葉變化
amplitude_to_db:將普通振幅譜圖轉為db譜圖。
import librosa as lr
from glob import glob
# List all the wav files in the folder
audio_files = glob(data_dir + '/*.wav')
# Read in the first audio file, create the time array
audio, sfreq = lr.load(audio_files[0])
time = np.arange(0, len(audio)) / sfreq
# Plot audio over time
fig, ax = plt.subplots()
ax.plot(time, audio)
ax.set(xlabel='Time (s)', ylabel='Sound Amplitude')
plt.show()
{{uploading-image-900268.png(uploading...)}}
pd.to_datatime()
可以指定時間索引列
# Read in the data
data = pd.read_csv('prices.csv', index_col=0)
# Convert the index of the DataFrame to datetime
data.index = pd.to_datetime(data.index)
print(data.head())
# Loop through each column, plot its values over time
fig, ax = plt.subplots()
for column in data.columns:
data[column].plot(ax=ax, label=column)
ax.legend()
plt.show()
raw data
補充幾個函數
在訓練模型之前需要可視化原始數據
arrange
numpy中的arrange主要用於生成數組,而不是和R一樣進行排序
np.vstack():在豎直方向上堆疊
np.hstack():在水平方向上平鋪
Numpy 中 arange() 主要是用於生成數組
.T.ravel() transposes the array, then unravels it into a 1-D vector for looping
fig,ax = plt.subplots()等價於:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
fig, ax = plt.subplots(1,3),其中參數1和3分別代表子圖的行數和列數,一共有 1x3 個子圖像。函數返回一個figure圖像和子圖ax的array列表。
fig, ax = plt.subplots(1,3,1),最后一個參數1代表第一個子圖。
如果想要設置子圖的寬度和高度可以在函數內加入figsize值
fig, ax = plt.subplots(1,3,figsize=(15,7)),這樣就會有1行3個15x7大小的子圖。csdn
.squeeze()
用法:numpy.squeeze(a,axis = None)
這個函數的作用是去掉矩陣里維度為1的維度
1)a表示輸入的數組;
2)axis用於指定需要刪除的維度,但是指定的維度必須為單維度,否則將會報錯;
3)axis的取值可為None 或 int 或 tuple of ints, 可選。若axis為空,則刪除所有單維度的條目;
4)返回值:數組
5) 不會修改原數組
emmm。。。感覺這個視頻是講語音識別
.rolling 計算窗口數據
為什么要使用滑動窗口數據呢
時點的數據波動較大,某一點的數據不能很好的表現它本身的特性,於是我們就想,能否用一個區間的的數據去表現呢,這樣數據的准確性是不是更好一些呢?因此,引出滑動窗口(移動窗口)的概念,簡單點說,為了提升數據的可靠性,將某個點的取值擴大到包含這個點的一段區間,用區間來進行判斷,這個區間就是窗口。
這里還是不太理解語音處理的東東,我回頭可以補一下,但是這個視頻里面基本都是用語音數據做的栗子
column_stack
將兩個矩陣按列合並
row_stack
s所以有了按行合並
# Calculate stats
means = np.mean(audio_rectified_smooth, axis=0)
stds = np.std(audio_rectified_smooth, axis=0)
maxs = np.max(audio_rectified_smooth, axis=0)
# Create the X and y arrays
X = np.column_stack([means, stds, maxs])
y = labels.reshape([-1, 1])
print(X)
print(y)
# Fit the model and score on testing data
from sklearn.model_selection import cross_val_score
percent_score = cross_val_score(model, X, y, cv=5)
print(np.mean(percent_score))
[[ 0.04353008 0.07578418 0.35248821]
[ 0.04553973 0.0791786 0.35960806]
[ 0.05176417 0.07091632 0.40621971]
[ 0.00803776 0.03164058 0.24943082]
[ 0.01602636 0.03359184 0.28992118]
[ 0.05924991 0.07396314 0.52032055]
[ 0.00519226 0.00319128 0.0277168 ]
[ 0.02953012 0.05827377 0.3217972 ]
[ 0.04101763 0.0733102 0.36731579]
[ 0.05427804 0.07413921 0.35716804]
[ 0.00446486 0.00852908 0.03834081]
[ 0.00312972 0.00461101 0.03460813]
[ 0.00779525 0.01186303 0.13859426]
[ 0.05828067 0.08492477 0.35801228]
[ 0.05789107 0.07687476 0.36183574]
[ 0.03836938 0.05989346 0.26996217]
[ 0.00733513 0.01003482 0.07249809]
[ 0.0009587 0.00297103 0.0412318 ]
[ 0.00869117 0.00679164 0.04508755]
[ 0.04969265 0.0804132 0.36128269]
[ 0.00128295 0.0041209 0.03417803]
[ 0.0140563 0.01755791 0.11556826]
[ 0.01933001 0.03507407 0.17939707]
[ 0.06548801 0.09258776 0.44716162]
[ 0.00685793 0.00780096 0.0634609 ]
[ 0.01374334 0.01643459 0.12155181]
[ 0.00168918 0.00469773 0.03637446]
[ 0.00299758 0.00464594 0.02329937]
[ 0.0142429 0.01727787 0.13869272]
[ 0.03688358 0.06889274 0.35827821]
[ 0.00407022 0.00975748 0.04922746]
[ 0.02563171 0.04603735 0.26422961]
[ 0.0018326 0.00372465 0.01948485]
[ 0.00124721 0.00277674 0.01798531]
[ 0.04088449 0.03965533 0.22090677]
[ 0.01384189 0.02448468 0.22764168]
[ 0.05830874 0.06938113 0.40844402]
[ 0.04416311 0.0767876 0.36022628]
[ 0.01682764 0.03080363 0.19444591]
[ 0.00122269 0.00272168 0.02165299]
[ 0.01820436 0.02154413 0.11244557]
[ 0.0673602 0.07223675 0.41148773]
[ 0.00753291 0.00998298 0.07238273]
[ 0.00766106 0.03013048 0.22280851]
[ 0.03675919 0.06815149 0.3529018 ]
[ 0.0469868 0.07855918 0.35687109]
[ 0.07950674 0.08071161 0.48732442]
[ 0.03037102 0.05754939 0.40854636]
[ 0.0024377 0.00351598 0.02560532]
[ 0.01508713 0.01689354 0.09557459]
[ 0.08335246 0.05728218 0.33726584]
[ 0.02132238 0.04218739 0.26207528]
[ 0.03784088 0.04710566 0.32966906]
[ 0.01718835 0.01782527 0.13329974]
[ 0.03501877 0.03843854 0.31367514]
[ 0.03982823 0.04933426 0.22437602]
[ 0.09964255 0.11005431 0.44836947]
[ 0.00714682 0.01124386 0.05884965]
[ 0.07672997 0.08275399 0.36290682]
[ 0.04954006 0.08006467 0.36331815]]
[['murmur']
['murmur']
['murmur']
['normal']
['normal']
['murmur']
['normal']
['normal']
['murmur']
['murmur']
['normal']
['murmur']
['normal']
['murmur']
['murmur']
['normal']
['normal']
['normal']
['normal']
['murmur']
['normal']
['normal']
['murmur']
['murmur']
['normal']
['normal']
['normal']
['normal']
['murmur']
['murmur']
['murmur']
['normal']
['normal']
['normal']
['normal']
['murmur']
['murmur']
['murmur']
['normal']
['normal']
['normal']
['murmur']
['murmur']
['normal']
['murmur']
['murmur']
['murmur']
['murmur']
['normal']
['normal']
['murmur']
['normal']
['normal']
['murmur']
['murmur']
['normal']
['murmur']
['murmur']
['murmur']
['murmur']]
0.716666666667
.item()
Python 字典 items() 函數作用:以列表返回可遍歷的(鍵, 值) 元組數組。
字典鍵值對,額,不要老是忘記這種簡單的啊。。。
Predicting data over time
畫圖的時候,可以根據不同的索引對應不同的線條顏色,or散點顏色
# Scatterplot with color relating to time
prices.plot.scatter('EBAY', 'YHOO', c=prices.index,
cmap=plt.cm.viridis, colorbar=False)
plt.show()
在做預測之前需要清洗數據
比如outlier:離群值
一般情況下,畫個圖是可以看到
有離群值的時候可以選擇剔除離群值
tight_layout()畫圖的時候有很多重疊的,可以調用此方法
有缺失值對的時候可以選擇插值法
插值法也分很多種
interpolate()
可以借助這個函數進行插值
然后設置不同的參數
1.nearest:最鄰近插值法
2.zero:階梯插值
3.slinear、linear:線性插值
4.quadratic、cubic:2、3階B樣條曲線插值
# Create a function we'll use to interpolate and plot
def interpolate_and_plot(prices, interpolation):
# Create a boolean mask for missing values
missing_values = prices.isna()
# Interpolate the missing values
prices_interp = prices.interpolate(interpolation)
# Plot the results, highlighting the interpolated values in black
fig, ax = plt.subplots(figsize=(10, 5))
prices_interp.plot(color='k', alpha=.6, ax=ax, legend=False)
# Now plot the interpolated values on top in red
prices_interp[missing_values].plot(ax=ax, color='r', lw=3, legend=False)
plt.show()
# Interpolate using the latest non-missing value
interpolation_type = 'zero'
interpolate_and_plot(prices, interpolation_type)
# Interpolate linearly
interpolation_type = 'linear'
interpolate_and_plot(prices, interpolation_type)
# Interpolate with a quadratic function
interpolation_type = 'quadratic'
interpolate_and_plot(prices, interpolation_type)
處理離群值的方法
def replace_outliers(series):
# Calculate the absolute difference of each timepoint from the series mean
absolute_differences_from_mean = np.abs(series - np.mean(series))
# Calculate a mask for the differences that are > 3 standard deviations from the mean
this_mask = absolute_differences_from_mean > (np.std(series) * 3)
# Replace these values with the median accross the data
series[this_mask] = np.nanmedian(series)
return series
# Apply your preprocessing function to the timeseries and plot the results
prices_perc = prices_perc.apply(replace_outliers)
prices_perc.loc["2014":"2015"].plot()
plt.show()
functools.partial
functools模塊用於高階函數:作用於或返回其他函數的函數。一般而言,任何可調用對象都可以作為本模塊用途的函數來處理。
functools.partial返回的是一個可調用的partial對象,使用方法是partial(func,*args,**kw),func是必須要傳入的,而且至少需要一個args或是kw參數。隨風的山羊
# Import partial from functools
from functools import partial
percentiles = [1, 10, 25, 50, 75, 90, 99]
# Use a list comprehension to create a partial function for each quantile
percentile_functions = [partial(np.percentile, q=percentile) for percentile in percentiles]
# Calculate each of these quantiles on the data using a rolling window
prices_perc_rolling = prices_perc.rolling(20, min_periods=5, closed='right')
features_percentiles = prices_perc_rolling.aggregate(percentile_functions)
# Plot a subset of the result
ax = features_percentiles.loc[:"2011-01"].plot(cmap=plt.cm.viridis)
ax.legend(percentiles, loc=(1.01, .5))
plt.show()
可以進一步查看時間
# Extract date features from the data, add them as columns
prices_perc['day_of_week'] = prices_perc.index.dayofweek
prices_perc['week_of_year'] = prices_perc.index.weekofyear
prices_perc['month_of_year'] = prices_perc.index.month
# Print prices_perc
print(prices_perc)
具體解釋參考這篇
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.dayofweek.html
差分,滯后
shift()
pandas里面的平移函數
滯后有時候是為了平穩化數據,,有時候是為了減少內生性
fillna()
填補缺失值
# Replace missing values with the median for each column
X = prices_perc_shifted.fillna(np.nanmedian(prices_perc_shifted))
y = prices_perc.fillna(np.nanmedian(prices_perc))
# Fit the model
model = Ridge()
model.fit(X, y)
時間序列的交叉驗證
k折交叉驗證
- ShuffleSplit
sklearn.model_selection.ShuffleSplit類用於將樣本集合隨機“打散”后划分為訓練集、測試集(可理解為驗證集,下同)csdn
class sklearn.model_selection.ShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None)
n_splits:int, 划分訓練集、測試集的次數,默認為10
test_size:float, int, None, default=0.1; 測試集比例或樣本數量,該值為[0.0, 1.0]內的浮點數時,表示測試集占總樣本的比例;該值為整型值時,表示具體的測試集樣本數量;train_size不設定具體數值時,該值取默認值0.1,train_size設定具體數值時,test_size取剩余部分
train_size:float, int, None; 訓練集比例或樣本數量,該值為[0.0, 1.0]內的浮點數時,表示訓練集占總樣本的比例;該值為整型值時,表示具體的訓練集樣本數量;該值為None(默認值)時,訓練集取總體樣本除去測試集的部分
random_state:int, RandomState instance or None;隨機種子值,默認為Nonecsdn
KFold
折交叉驗證:sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)
思路:將訓練/測試數據集划分n_splits個互斥子集,每次用其中一個子集當作驗證集,剩下的n_splits-1個作為訓練集,進行n_splits次訓練和測試,得到n_splits個結果
注意點:對於不能均等份的數據集,其前n_samples % n_splits子集擁有n_samples // n_splits + 1個樣本,其余子集都只有n_samples // n_splits樣本
參數說明:
n_splits:表示划分幾等份
shuffle:在每次划分時,是否進行洗牌
洗牌就是打亂樣本
①若為Falses時,其效果等同於random_state等於整數,每次划分的結果相同
②若為True時,每次划分的結果都不一樣,表示經過洗牌,隨機取樣的
random_state:隨機種子數
屬性:
①get_n_splits(X=None, y=None, groups=None):獲取參數n_splits的值
②split(X, y=None, groups=None):將數據集划分成訓練集和測試集,返回索引生成器
通過一個不能均等划分的栗子,設置不同參數值,觀察其結果csdn
TimeSeriesSplit
時間序列划分數據
class sklearn.model_selection.TimeSeriesSplit(n_splits=5, max_train_size=None)
TimeSeriesSplit是k-fold的一個變體,它首先返回k折作為訓練數據集,並且 (k+1) 折作為測試數據集。請注意,與標准的交叉驗證方法不同,連續的訓練集是超越前者的超集。另外,它將所有的剩余數據添加到第一個訓練分區,它總是用來訓練模型。這個類可以用來交叉驗證以固定時間間隔觀察到的時間序列數據樣本。(機器學習中的交叉驗證
)博客園
enumerate
enumerate()是python的內置函數、適用於python2.x和python3.x
enumerate在字典上是枚舉、列舉的意思
enumerate參數為可遍歷/可迭代的對象(如列表、字符串)
enumerate多用於在for循環中得到計數,利用它可以同時獲得索引和值,即需要index和value值的時候可以使用enumerate
enumerate()返回的是一個enumerate對象cnblog
datacamp
# Import TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit
# Create time-series cross-validation object
cv = TimeSeriesSplit(n_splits=10)
# Iterate through CV splits
fig, ax = plt.subplots()
for ii, (tr, tt) in enumerate(cv.split(X, y)):
# Plot the training data on each iteration, to see the behavior of the CV
ax.plot(tr, ii + y[tr])
ax.set(title='Training data on each CV iteration', ylabel='CV iteration')
plt.show()
resample
重采樣
交叉驗證
from sklearn.model_selection import cross_val_score
# Generate scores for each split to see how the model performs over time
scores = cross_val_score(model, X, y, cv=cv, scoring=my_pearsonr)
# Convert to a Pandas Series object
scores_series = pd.Series(scores, index=times_scores, name='score')
# Bootstrap a rolling confidence interval for the mean score
scores_lo = scores_series.rolling(20).aggregate(partial(bootstrap_interval, percentiles=2.5))
scores_hi = scores_series.rolling(20).aggregate(partial(bootstrap_interval, percentiles=97.5))
# Plot the results
fig, ax = plt.subplots()
scores_lo.plot(ax=ax, label="Lower confidence interval")
scores_hi.plot(ax=ax, label="Upper confidence interval")
ax.legend()
plt.show()