數據規范化
均值-方差規范化、極差規范化
由於變量或指標的單位不同,造成有些指標數據值非常大,而有些非常小,在模型運算過程中大的數據會把小的數據覆蓋掉,造成模型失真。因此,需要對這些數據做規范化處理,或者說去量綱化。
均值-方差規范化:是指變量或者指標數據減去其均值再除以標准差得到的數據。新數據均值為0,方差為1。其公式如下:
𝑥∗=𝑥−𝑚𝑒𝑎𝑛(𝑥)𝑠𝑡𝑑(𝑥)x^∗=(x-mean(x))/(std(x))
極差規范化: 是指變量或是指標數據減去其最小值,再除以最大值與最小值之差,得到新的數據。新數據取值范圍再[0,1]。其計算公式為:
𝑥∗=𝑥−min(𝑥)max𝑥−min(𝑥)x^∗=(x-min(x))/(max(x)-min(x))
1、讀取數據
#讀取數據
import numpy as np
data=np.load('data.npy')
data=data[:,1:]
data
array([[ 17. , 66.17647059, 32. , 1614.96618125,
13.15625 ],
[ 8. , 68.6875 , 36. , 143.56458056,
3.80555556],
[ 16. , 65.84375 , 43. , 1344.13137674,
12.69767442],
...,
[ 10. , 67.95 , 24. , 115.87417083,
2.79166667],
[ 21. , 66.5 , 41. , 538.71289268,
20.31707317],
[ 11. , 78.27272727, 9. , 62.98323333,
9.44444444]])
#個人愛好,jupyter內看着舒服
import pandas as pd
data = pd.DataFrame(data)
data
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 17.0 | 66.176471 | 32.0 | 1614.966181 | 13.156250 |
1 | 8.0 | 68.687500 | 36.0 | 143.564581 | 3.805556 |
2 | 16.0 | 65.843750 | 43.0 | 1344.131377 | 12.697674 |
3 | 2.0 | 75.000000 | 2.0 | 0.365700 | 1.000000 |
4 | 27.0 | 65.740741 | 60.0 | 991.953787 | 11.100000 |
... | ... | ... | ... | ... | ... |
830 | 35.0 | 66.057143 | 44.0 | 127.945364 | 12.250000 |
831 | 14.0 | 69.714286 | 7.0 | 32.219643 | 15.571429 |
832 | 10.0 | 67.950000 | 24.0 | 115.874171 | 2.791667 |
833 | 21.0 | 66.500000 | 41.0 | 538.712893 | 20.317073 |
834 | 11.0 | 78.272727 | 9.0 | 62.983233 | 9.444444 |
835 rows × 5 columns
2、導入預處理庫
#導入預處理庫
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
3、預處理空值
#均值填充空值
#利用Imputer 創建填充對象imp_mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean',verbose=0)
#fit_transform一步完成調取結果
imp_mean = imp_mean.fit_transform(data)
imp_mean
array([[ 17. , 66.17647059, 32. , 1614.96618125,
13.15625 ],
[ 8. , 68.6875 , 36. , 143.56458056,
3.80555556],
[ 16. , 65.84375 , 43. , 1344.13137674,
12.69767442],
...,
[ 10. , 67.95 , 24. , 115.87417083,
2.79166667],
[ 21. , 66.5 , 41. , 538.71289268,
20.31707317],
[ 11. , 78.27272727, 9. , 62.98323333,
9.44444444]])
#看着舒服
pd.DataFrame(imp_mean)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 17.0 | 66.176471 | 32.0 | 1614.966181 | 13.156250 |
1 | 8.0 | 68.687500 | 36.0 | 143.564581 | 3.805556 |
2 | 16.0 | 65.843750 | 43.0 | 1344.131377 | 12.697674 |
3 | 2.0 | 75.000000 | 2.0 | 0.365700 | 1.000000 |
4 | 27.0 | 65.740741 | 60.0 | 991.953787 | 11.100000 |
... | ... | ... | ... | ... | ... |
830 | 35.0 | 66.057143 | 44.0 | 127.945364 | 12.250000 |
831 | 14.0 | 69.714286 | 7.0 | 32.219643 | 15.571429 |
832 | 10.0 | 67.950000 | 24.0 | 115.874171 | 2.791667 |
833 | 21.0 | 66.500000 | 41.0 | 538.712893 | 20.317073 |
834 | 11.0 | 78.272727 | 9.0 | 62.983233 | 9.444444 |
835 rows × 5 columns
4、均值-方差規范化(Z-Score規范化)
X1 = imp_mean
scaled_x = preprocessing.scale(X1)
pd.DataFrame(scaled_x)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.200258 | -0.827606 | 0.055546 | 2.843538 | 0.769541 |
1 | -0.689187 | -0.092243 | 0.206625 | -0.185541 | -0.651561 |
2 | 0.101431 | -0.925045 | 0.471013 | 2.285988 | 0.699848 |
3 | -1.282151 | 1.756395 | -1.077545 | -0.480335 | -1.077944 |
4 | 1.188531 | -0.955211 | 1.113098 | 1.560983 | 0.457036 |
... | ... | ... | ... | ... | ... |
830 | 1.979150 | -0.862552 | 0.508783 | -0.217695 | 0.631811 |
831 | -0.096223 | 0.208455 | -0.888696 | -0.414760 | 1.136596 |
832 | -0.491533 | -0.308222 | -0.246611 | -0.242546 | -0.805650 |
833 | 0.595568 | -0.732860 | 0.395474 | 0.627925 | 1.857831 |
834 | -0.392705 | 2.714824 | -0.813157 | -0.351429 | 0.205428 |
835 rows × 5 columns
5、極差規范化(Min-max 規范化)
X2 = imp_mean
# 將數據進行 [0,1] 規范化
min_max_scaler = preprocessing.MinMaxScaler()
minmax_x = min_max_scaler.fit_transform(X2)
pd.DataFrame(minmax_x)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.380952 | 0.044063 | 0.251969 | 0.339418 | 0.138139 |
1 | 0.166667 | 0.171583 | 0.283465 | 0.030155 | 0.031881 |
2 | 0.357143 | 0.027166 | 0.338583 | 0.282493 | 0.132928 |
3 | 0.023810 | 0.492158 | 0.015748 | 0.000057 | 0.000000 |
4 | 0.619048 | 0.021935 | 0.472441 | 0.208472 | 0.114773 |
... | ... | ... | ... | ... | ... |
830 | 0.809524 | 0.038003 | 0.346457 | 0.026872 | 0.127841 |
831 | 0.309524 | 0.223728 | 0.055118 | 0.006753 | 0.165584 |
832 | 0.214286 | 0.134130 | 0.188976 | 0.024335 | 0.020360 |
833 | 0.476190 | 0.060493 | 0.322835 | 0.113208 | 0.219512 |
834 | 0.238095 | 0.658361 | 0.070866 | 0.013218 | 0.095960 |
835 rows × 5 columns