歸一化
數據標准化(歸一化)處理是數據挖掘的一項基礎工作,不同評價指標往往具有不同的量綱和量綱單位,這樣的情況會影響到數據分析的結果,為了消除指標之間的量綱影響,需要進行數據標准化處理,以解決數據指標之間的可比性。原始數據經過數據標准化處理后,各指標處於同一數量級,適合進行綜合對比評價。
歸一化的幾種方法
MinMaxScaler
也稱為離差標准化,是對原始數據的線性變換,使結果值映射到[0 - 1]之間。轉換函數如下:
MaxAbsScaler
與上述標准化方法相似,但是它通過除以最大值將訓練集縮放至[-1,1]。這意味着數據已經以0為中心或者是含有非常非常多0的稀疏數據。
StandardScaler
計算訓練集的平均值和標准差,以便測試數據集使用相同的變換
實驗
實驗方法
我們通過比較在不同標准化方法下,四種機器學習中的經典模型的均方誤差(mean-square error, MSE)的大小來得出不同標准化或不標准化影響
實驗代碼及其結果
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('huodian.csv')
data = data.sort_values(by='time',ascending=True)
data.reset_index(inplace=True,drop=True)
target = data['T1AOMW_AV']#target即是Y
del data['T1AOMW_AV']#在原data中刪去Y
# 找出存在缺失值的列
All_NaN = pd.DataFrame(data.isnull().sum()).reset_index()
All_NaN.columns = ['name','times']
All_NaN.describe()
times | |
---|---|
count | 170.0 |
mean | 0.0 |
std | 0.0 |
min | 0.0 |
25% | 0.0 |
50% | 0.0 |
75% | 0.0 |
max | 0.0 |
# 去掉數據中變化較小的特征
feature_describe_T = data.describe().T
unstd_feature = feature_describe_T[feature_describe_T['std']>=1].index
data = data[unstd_feature]
#刪除無關變量
del data['time']
test_data = data[:5000]
#切分數據集
data1 = data[5000:16060]
target1 = target[5000:16060]
data2 = data[16060:]
target2 = target[16060:]
import scipy.stats as stats
dict_corr = {
'spearman' : [],
'pearson' : [],
'kendall' : [],
'columns' : []
}
#對每一列求各項系數
for i in data.columns:
corr_pear,pval = stats.pearsonr(data[i],target)
corr_spear,pval = stats.spearmanr(data[i],target)
corr_kendall,pval = stats.kendalltau(data[i],target)
dict_corr['pearson'].append(abs(corr_pear))
dict_corr['spearman'].append(abs(corr_spear))
dict_corr['kendall'].append(abs(corr_kendall))
dict_corr['columns'].append(i)
# 篩選新屬性
dict_corr =pd.DataFrame(dict_corr)
new_fea = list(dict_corr[(dict_corr['pearson']>0.1) & (dict_corr['spearman']>0.15) & (dict_corr['kendall']>0.15)&(dict_corr['pearson']<0.93) & (dict_corr['spearman']<0.93) & (dict_corr['kendall']<0.93)]['columns'].values)
#new_fea = list(dict_corr[(dict_corr['pearson']<0.63) & (dict_corr['spearman']<0.69) & (dict_corr['kendall']<0.63)]['columns'].values)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.preprocessing import MinMaxScaler,StandardScaler,MaxAbsScaler
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import mean_squared_error as mse
from sklearn.svm import SVR
import warnings
warnings.filterwarnings("ignore")
##分割數據集和測試集
X_train, X_test, y_train, y_test = train_test_split(data[new_fea],target,test_size=0.25,random_state=12345)
print('without normalization:')
estimator_lr = Lasso(alpha=0.5).fit(X_train,y_train)
predict_lr = estimator_lr.predict(X_test)
print('Lssao:',mse(predict_lr,y_test))
estimator_rg = Ridge(alpha=0.5).fit(X_train,y_train)
predict_rg = estimator_rg.predict(X_test)
print('Ridge:',mse(predict_rg,y_test))
estimator_svr = SVR(kernel='rbf',C=100,epsilon=0.1).fit(X_train,y_train)
predict_svr = estimator_svr.predict(X_test)
print('SVR:',mse(predict_svr,y_test))
estimator_RF = RandomForestRegressor().fit(X_train,y_train)
predict_RF = estimator_RF.predict(X_test)
print('RF:',mse(predict_RF,y_test))
mm = MinMaxScaler()
mm_x_train = mm.fit_transform(X_train)
mm_x_test = mm.transform(X_test)
print('MinMaxScaler:')
estimator_lr = Lasso(alpha=0.5).fit(mm_x_train,y_train)
predict_lr = estimator_lr.predict(mm_x_test)
print('Lssao:',mse(predict_lr,y_test))
estimator_rg = Ridge(alpha=0.5).fit(mm_x_train,y_train)
predict_rg = estimator_rg.predict(mm_x_test)
print('Ridge:',mse(predict_rg,y_test))
estimator_svr = SVR(kernel='rbf',C=100,epsilon=0.1).fit(mm_x_train,y_train)
predict_svr = estimator_svr.predict(mm_x_test)
print('SVR:',mse(predict_svr,y_test))
estimator_RF = RandomForestRegressor().fit(mm_x_train,y_train)
predict_RF = estimator_RF.predict(mm_x_test)
print('RF:',mse(predict_RF,y_test))
ma = MaxAbsScaler()
ma_x_train = ma.fit_transform(X_train)
ma_x_test = ma.transform(X_test)
print('MaxAbsScaler:')
estimator_lr = Lasso(alpha=0.5).fit(ma_x_train,y_train)
predict_lr = estimator_lr.predict(ma_x_test)
print('Lssao:',mse(predict_lr,y_test))
estimator_rg = Ridge(alpha=0.5).fit(ma_x_train,y_train)
predict_rg = estimator_rg.predict(ma_x_test)
print('Ridge:',mse(predict_rg,y_test))
estimator_svr = SVR(kernel='rbf',C=100,epsilon=0.1).fit(ma_x_train,y_train)
predict_svr = estimator_svr.predict(ma_x_test)
print('SVR:',mse(predict_svr,y_test))
estimator_RF = RandomForestRegressor().fit(ma_x_train,y_train)
predict_RF = estimator_RF.predict(ma_x_test)
print('RF:',mse(predict_RF,y_test))
ss = StandardScaler()
ss_x_train = ss.fit_transform(X_train)
ss_x_test = ss.transform(X_test)
print('StandardScaler:')
estimator_lr = Lasso(alpha=0.5).fit(ss_x_train,y_train)
predict_lr = estimator_lr.predict(ss_x_test)
print('Lssao:',mse(predict_lr,y_test))
estimator_rg = Ridge(alpha=0.5).fit(ss_x_train,y_train)
predict_rg = estimator_rg.predict(ss_x_test)
print('Ridge:',mse(predict_rg,y_test))
estimator_svr = SVR(kernel='rbf',C=100,epsilon=0.1).fit(ss_x_train,y_train)
predict_svr = estimator_svr.predict(ss_x_test)
print('SVR:',mse(predict_svr,y_test))
estimator_RF = RandomForestRegressor().fit(ss_x_train,y_train)
predict_RF = estimator_RF.predict(ss_x_test)
print('RF:',mse(predict_RF,y_test))
without normalization:
Lssao: 64.48569344896079
Ridge: 52.32215979123271
SVR: 2562.6181533319277
RF: 11.342877117923145
MinMaxScaler:
Lssao: 110.64816111661362
Ridge: 55.430338750636416
SVR: 37.81036885831256
RF: 10.204243317509082
MaxAbsScaler:
Lssao: 257.7066786267883
Ridge: 63.91979829622576
SVR: 69.74587878254961
RF: 11.721070230746417
StandardScaler:
Lssao: 81.70216554870805
Ridge: 52.5282264448465
SVR: 7.996381635964344
RF: 9.615276857782204
實驗結果分析
通過對比不難發現,對於Lssao模型,在歸一化之后其MSE有較明顯的增大,對於Ridge除MaxAbsScaler外歸一化的影響均不大,對於SVR假如不對其進行歸一化,其MSE會非常大,而使用Standerscaler效果最好,而不同的歸一化方法,或是否歸一化,對RF影響不大
原因分析
svm實質上選擇的是分割兩類數據最遠的超平面,由於錯分類造成了影響,不進行歸一化會造成對平面的影響,導致得到的划分平面不准確測試集成功率低