2、數據規范化


數據規范化

均值-方差規范化、極差規范化

由於變量或指標的單位不同,造成有些指標數據值非常大,而有些非常小,在模型運算過程中大的數據會把小的數據覆蓋掉,造成模型失真。因此,需要對這些數據做規范化處理,或者說去量綱化。

均值-方差規范化:是指變量或者指標數據減去其均值再除以標准差得到的數據。新數據均值為0,方差為1。其公式如下:

                    𝑥∗=𝑥−𝑚𝑒𝑎𝑛(𝑥)𝑠𝑡𝑑(𝑥)x^∗=(x-mean(x))/(std(x))

極差規范化: 是指變量或是指標數據減去其最小值,再除以最大值與最小值之差,得到新的數據。新數據取值范圍再[0,1]。其計算公式為:

                    𝑥∗=𝑥−min⁡(𝑥)max𝑥−min⁡(𝑥)x^∗=(x-min⁡(x))/(max⁡(x)-min⁡(x))

1、讀取數據

#讀取數據
import numpy as np
data=np.load('data.npy')
data=data[:,1:]
data
array([[  17.        ,   66.17647059,   32.        , 1614.96618125,
          13.15625   ],
       [   8.        ,   68.6875    ,   36.        ,  143.56458056,
           3.80555556],
       [  16.        ,   65.84375   ,   43.        , 1344.13137674,
          12.69767442],
       ...,
       [  10.        ,   67.95      ,   24.        ,  115.87417083,
           2.79166667],
       [  21.        ,   66.5       ,   41.        ,  538.71289268,
          20.31707317],
       [  11.        ,   78.27272727,    9.        ,   62.98323333,
           9.44444444]])
#個人愛好,jupyter內看着舒服
import pandas as pd
data = pd.DataFrame(data)
data
0 1 2 3 4
0 17.0 66.176471 32.0 1614.966181 13.156250
1 8.0 68.687500 36.0 143.564581 3.805556
2 16.0 65.843750 43.0 1344.131377 12.697674
3 2.0 75.000000 2.0 0.365700 1.000000
4 27.0 65.740741 60.0 991.953787 11.100000
... ... ... ... ... ...
830 35.0 66.057143 44.0 127.945364 12.250000
831 14.0 69.714286 7.0 32.219643 15.571429
832 10.0 67.950000 24.0 115.874171 2.791667
833 21.0 66.500000 41.0 538.712893 20.317073
834 11.0 78.272727 9.0 62.983233 9.444444

835 rows × 5 columns

2、導入預處理庫

#導入預處理庫
from sklearn.impute import SimpleImputer
from sklearn import preprocessing

3、預處理空值

#均值填充空值
#利用Imputer 創建填充對象imp_mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean',verbose=0)

#fit_transform一步完成調取結果
imp_mean = imp_mean.fit_transform(data) 
imp_mean
array([[  17.        ,   66.17647059,   32.        , 1614.96618125,
          13.15625   ],
       [   8.        ,   68.6875    ,   36.        ,  143.56458056,
           3.80555556],
       [  16.        ,   65.84375   ,   43.        , 1344.13137674,
          12.69767442],
       ...,
       [  10.        ,   67.95      ,   24.        ,  115.87417083,
           2.79166667],
       [  21.        ,   66.5       ,   41.        ,  538.71289268,
          20.31707317],
       [  11.        ,   78.27272727,    9.        ,   62.98323333,
           9.44444444]])
#看着舒服
pd.DataFrame(imp_mean)
0 1 2 3 4
0 17.0 66.176471 32.0 1614.966181 13.156250
1 8.0 68.687500 36.0 143.564581 3.805556
2 16.0 65.843750 43.0 1344.131377 12.697674
3 2.0 75.000000 2.0 0.365700 1.000000
4 27.0 65.740741 60.0 991.953787 11.100000
... ... ... ... ... ...
830 35.0 66.057143 44.0 127.945364 12.250000
831 14.0 69.714286 7.0 32.219643 15.571429
832 10.0 67.950000 24.0 115.874171 2.791667
833 21.0 66.500000 41.0 538.712893 20.317073
834 11.0 78.272727 9.0 62.983233 9.444444

835 rows × 5 columns

4、均值-方差規范化(Z-Score規范化)

X1 = imp_mean
scaled_x = preprocessing.scale(X1)
pd.DataFrame(scaled_x)
0 1 2 3 4
0 0.200258 -0.827606 0.055546 2.843538 0.769541
1 -0.689187 -0.092243 0.206625 -0.185541 -0.651561
2 0.101431 -0.925045 0.471013 2.285988 0.699848
3 -1.282151 1.756395 -1.077545 -0.480335 -1.077944
4 1.188531 -0.955211 1.113098 1.560983 0.457036
... ... ... ... ... ...
830 1.979150 -0.862552 0.508783 -0.217695 0.631811
831 -0.096223 0.208455 -0.888696 -0.414760 1.136596
832 -0.491533 -0.308222 -0.246611 -0.242546 -0.805650
833 0.595568 -0.732860 0.395474 0.627925 1.857831
834 -0.392705 2.714824 -0.813157 -0.351429 0.205428

835 rows × 5 columns

5、極差規范化(Min-max 規范化)

X2 = imp_mean

# 將數據進行 [0,1] 規范化
min_max_scaler = preprocessing.MinMaxScaler()

minmax_x = min_max_scaler.fit_transform(X2)
pd.DataFrame(minmax_x)
0 1 2 3 4
0 0.380952 0.044063 0.251969 0.339418 0.138139
1 0.166667 0.171583 0.283465 0.030155 0.031881
2 0.357143 0.027166 0.338583 0.282493 0.132928
3 0.023810 0.492158 0.015748 0.000057 0.000000
4 0.619048 0.021935 0.472441 0.208472 0.114773
... ... ... ... ... ...
830 0.809524 0.038003 0.346457 0.026872 0.127841
831 0.309524 0.223728 0.055118 0.006753 0.165584
832 0.214286 0.134130 0.188976 0.024335 0.020360
833 0.476190 0.060493 0.322835 0.113208 0.219512
834 0.238095 0.658361 0.070866 0.013218 0.095960

835 rows × 5 columns



免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM