數據歸一化


數據歸一化


將所有的數據映射到同一尺度。

​ 首先,為什么需要數據歸一化?舉個簡答的例子。樣本間的距離時間所主導,這樣在樣本1以[1, 200]輸入到模型中去的時候,由於200可能會直接忽略到1的存在,因此我們需要將數據進行歸一化。比如將天數轉換為占比1年的比例,200/365=0.5479, 100/365=0.2740。

mark

一、最值歸一化

最值歸一化(Normalization):把所有數據映射到0-1之間。適用於分布有明顯邊界的情況,受 outliner影響較大。

xscale=(x-xmin)/(xmax-xmin)

import numpy as np
import matplotlib.pyplot as plt

x = np.random.randint(0, 100, size=100)
x

輸出結果:

array([84, 18, 75, 75, 78, 30, 39, 33, 29, 30, 48, 77, 54, 30,  1, 32, 91,
       60, 73, 78, 89, 16, 71, 47, 87, 43, 24, 67, 70, 50, 58, 56, 69, 11,
       19, 97, 64, 53, 37, 18, 84, 77,  6,  3, 91, 48, 14,  6, 70, 36, 93,
       43, 78, 78, 73, 18, 96, 58, 77, 78, 29, 96, 75, 59, 58, 19, 65, 90,
       67, 73, 72,  1, 89, 70, 59, 96, 42, 73, 58,  8, 61, 65, 78, 86, 98,
       94, 52,  1, 59, 86, 44, 28, 87,  2, 91, 75, 19, 91, 46, 92])
(x-np.min(x)) / (np.max(x) - np.min(x))

輸出結果:

array([0.8556701 , 0.17525773, 0.7628866 , 0.7628866 , 0.79381443,
       0.29896907, 0.39175258, 0.32989691, 0.28865979, 0.29896907,
       0.48453608, 0.78350515, 0.54639175, 0.29896907, 0.        ,
       0.31958763, 0.92783505, 0.60824742, 0.74226804, 0.79381443,
       0.90721649, 0.15463918, 0.72164948, 0.4742268 , 0.88659794,
       0.43298969, 0.2371134 , 0.68041237, 0.71134021, 0.50515464,
       0.58762887, 0.56701031, 0.70103093, 0.10309278, 0.18556701,
       0.98969072, 0.64948454, 0.53608247, 0.37113402, 0.17525773,
       0.8556701 , 0.78350515, 0.05154639, 0.02061856, 0.92783505,
       0.48453608, 0.13402062, 0.05154639, 0.71134021, 0.36082474,
       0.94845361, 0.43298969, 0.79381443, 0.79381443, 0.74226804,
       0.17525773, 0.97938144, 0.58762887, 0.78350515, 0.79381443,
       0.28865979, 0.97938144, 0.7628866 , 0.59793814, 0.58762887,
       0.18556701, 0.65979381, 0.91752577, 0.68041237, 0.74226804,
       0.73195876, 0.        , 0.90721649, 0.71134021, 0.59793814,
       0.97938144, 0.42268041, 0.74226804, 0.58762887, 0.07216495,
       0.6185567 , 0.65979381, 0.79381443, 0.87628866, 1.        ,
       0.95876289, 0.5257732 , 0.        , 0.59793814, 0.87628866,
       0.44329897, 0.27835052, 0.88659794, 0.01030928, 0.92783505,
       0.7628866 , 0.18556701, 0.92783505, 0.46391753, 0.93814433])
X = np.random.randint(0, 100, (50, 2))
X[:10, :]
X = np.array(X, dtype=float)
X[:, 0] = (X[:, 0] - np.min(X[:, 0])) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:, 0]
X[:, 1] = (X[:, 1] - np.min(X[:, 1])) / (np.max(X[:, 1]) - np.min(X[:, 1]))
X[:, 1]
X[:10, :]
plt.scatter(X[:,0], X[:,1])
plt.show()
np.mean(X[:,0])
np.std(X[:, 0])
np.mean(X[:,1])
np.std(X[:, 1])

二、均值方差歸一化

均值方差歸一化(standardization):把所有數據歸一化到均值為0方差為1的分布中。適用於數據分 布沒有明顯的邊界,有可能存在極端的數據值。

xscale=(x-xmean)/s

x2 = np.random.randint(0, 100, (50, 2))
x2 = np.array(x2, dtype=float)
x2[:, 0] = (x2[:,0] - np.mean(x2[:,0])) / np.std(x2[:,0])
x2[:, 1] = (x2[:,1] - np.mean(x2[:,1])) / np.std(x2[:,1])
plt.scatter(x2[:,0], x2[:,1])
plt.show() 
np.mean(x2[:,0])
np.std(x2[:,0])
np.mean(x2[:,1])
np.std(x2[:,1])

三、對訓練集和測試集都進行歸一化?

​ 我們得到數據集訓練模型之前,首先會把數據集進行切分,分成訓練集和測試集,如果需要對數據進行歸一化,我們可以很容易地通過訓練集得到其均值和方差,最大值最小值。但是測試集呢?如何對測試集進行數據歸一化呢?

​ 正常情況下,測試數據集是模擬真實環境的,既然是真實環境,我們就很可能無法得到所有的測試集。因此當有一個新的數據需要進行預測時,我們需要使用訓練集的均值方差,最大值最小值對測試集數據進行歸一化。在scikit-learn中封裝了Scaler保存訓練數據集中的均值和方差等關鍵信息。

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler	

iris = datasets.load_iris()

x = iris.data
y = iris.target
x[:10, :]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=666)
standarscaler = StandardScaler()
standarscaler.fit(x_train)
standarscaler.mean_
standarscaler.scale_
standarscaler.transform(x_train)
x_train = standarscaler.transform(x_train)
x_train
x_test_standard = standarscaler.transform(x_test)
x_test_standard

​ 接下來測試一下數據歸一化之后KNN的性能:

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(x_train, y_train)
knn_clf.fit(x_test_standard, y_test)
knn_clf.score(x_test_standard, y_test)

輸出結果:1.0

​ 如果訓練集進行了歸一化,測試集不做歸一化試試?

knn_clf.score(x_test, y_test)

輸出結果:0.3333333333333333

四、使用面向對象自己編寫均值方差歸一化

from sklearn.preprocessing import StandardScaler # 在sklearn中
import numpy as np


class StandardScale(object):

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, x):
        "根據訓練集x獲得數據的均值和方差"
        assert x.ndim == 2, "the dimension of x must be 2"

        self.mean_ = np.array([np.mean(x[:, i]) for i in range(x.shape[1])])
        self.scale_ = np.array([np.std(x[:, i]) for i in range(x.shape[1])])

        return self

    def transform(self, x):
        "將x進行均值方差歸一化"
        assert x.ndim == 2, "the dimension of x must be 2"
        assert self.mean_ is not None and self.scale_ is not None, \
        "must fit before transform"
        assert x.shape[1] == len(self.mean_), \
        "the feature number of x must be equal to mean_ and scale_"

        res_x = np.empty(shape=x.shape, dtype=float)
        for col in range(x.shape[1]):
            res_x[:, col] = (x[:, col] - self.mean_[col]) / self.scale_[col]

        return res_x

五、使用面向對象自己編寫最值歸一化

from sklearn.preprocessing import MinMaxScaler # 在sklearn中
import numpy as np

class MinMaxScale(object):

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, x):
        "根據訓練集x獲得數據的均值和方差"
        assert x.ndim == 2, "the dimension of x must be 2"

        self.mean_ = np.array([np.mean(x[:, i]) for i in range(x.shape[1])])
        self.scale_ = np.array([np.std(x[:, i]) for i in range(x.shape[1])])
        self.min_ = np.array([np.min(x[:, i]) for i in range(x.shape[1])])
        self.max_ = np.array([np.max(x[:, i]) for i in range(x.shape[1])])

        return self

    def transform(self, x):
        "將x進行均值方差歸一化"
        assert x.ndim == 2, "the dimension of x must be 2"
        assert self.mean_ is not None and self.scale_ is not None, \
        "must fit before transform"
        assert x.shape[1] == len(self.mean_), \
        "the feature number of x must be equal to mean_ and scale_"

        res_x = np.empty(shape=x.shape, dtype=float)
        for col in range(x.shape[1]):
            res_x[:, col] = (x[:, col] - self.min_[col]) / (self.max_[col] - self.min_[col])

        return res_x

其實,還有更多的數據歸一化的方式,后續再進行完善!


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM