機器學習sklearn(六): 數據處理(三)數值型數據處理(一)歸一化( MinMaxScaler/MaxAbsScaler)


來源:https://www.cntofu.com/book/170/docs/59.md

1 將特征縮放至特定范圍內

一種標准化是將特征縮放到給定的最小值和最大值之間,通常在零和一之間,或者也可以將每個特征的最大絕對值轉換至單位大小。可以分別使用 MinMaxScaler 和 MaxAbsScaler 實現。

使用這種縮放的目的包括實現特征極小方差的魯棒性以及在稀疏矩陣中保留零元素。

以下是一個將簡單的數據矩陣縮放到[0, 1]的例子:

>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

同樣的轉換實例可以被用與在訓練過程中不可見的測試數據:實現和訓練數據一致的縮放和移位操作:

>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[-1.5       ,  0\.        ,  1.66666667]])

可以檢查縮放器(scaler)屬性,來觀察在訓練集中學習到的轉換操作的基本性質:

>>> min_max_scaler.scale_                             
array([ 0.5       ,  0.5       ,  0.33...])

>>> min_max_scaler.min_                               
array([ 0\.        ,  0.5       ,  0.33...])

如果給 MinMaxScaler 提供一個明確的 feature_range=(min, max) ,完整的公式是:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std * (max - min) + min

類 MaxAbsScaler 的工作原理非常相似,但是它只通過除以每個特征的最大值將訓練數據特征縮放至 [-1, 1] 范圍內,這就意味着,訓練數據應該是已經零中心化或者是稀疏數據。 例子::用先前例子的數據實現最大絕對值縮放操作。

以下是使用上例中數據運用這個縮放器的例子:

>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> max_abs_scaler = preprocessing.MaxAbsScaler()
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
>>> X_train_maxabs
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])
>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
>>> X_test_maxabs
array([[-1.5, -1. ,  2. ]])
>>> max_abs_scaler.scale_
array([2.,  1.,  2.])

2 縮放稀疏(矩陣)數據

中心化稀疏(矩陣)數據會破壞數據的稀疏結構,因此很少有一個比較明智的實現方式。但是縮放稀疏輸入是有意義的,尤其是當幾個特征在不同的量級范圍時。

MaxAbsScaler 以及 maxabs_scale 是專為縮放數據而設計的,並且是縮放數據的推薦方法。但是, scale 和 StandardScaler 也能夠接受 scipy.sparse 作為輸入,只要參數 with_mean=False 被准確傳入它的構造器。否則會出現 ValueError 的錯誤,因為默認的中心化會破壞稀疏性,並且經常會因為分配過多的內存而使執行崩潰。 RobustScaler 不能適應稀疏輸入,但你可以在稀疏輸入使用 transform 方法。

注意,縮放器同時接受壓縮的稀疏行和稀疏列(參見 scipy.sparse.csr_matrix 以及 scipy.sparse.csc_matrix )。任何其他稀疏輸入將會 轉化為壓縮稀疏行表示 。為了避免不必要的內存復制,建議在上游(早期)選擇CSR或CSC表示。

最后,最后,如果已經中心化的數據並不是很大,使用 toarray 方法將輸入的稀疏矩陣顯式轉換為數組是另一種選擇。

3. 縮放有離群值的數據

如果你的數據包含許多異常值,使用均值和方差縮放可能並不是一個很好的選擇。這種情況下,你可以使用 robust_scale 以及 RobustScaler 作為替代品。它們對你的數據的中心和范圍使用更有魯棒性的估計。

參考:

更多關於中心化和縮放數據的重要性討論在此FAQ中提及: Should I normalize/standardize/rescale the data?

Scaling vs Whitening 有時候獨立地中心化和縮放數據是不夠的,因為下游的機器學習模型能夠對特征之間的線性依賴做出一些假設(這對模型的學習過程來說是不利的)。

要解決這個問題,你可以使用 sklearn.decomposition.PCA 或 sklearn.decomposition.RandomizedPCA 並指定參數 whiten=True 來更多移除特征間的線性關聯。

在回歸中縮放目標變量

scale 以及 StandardScaler 可以直接處理一維數組。在回歸中,縮放目標/相應變量時非常有用。

4. 核矩陣的中心化

如果你有一個核矩陣 K ,它計算由函數 phi 定義的特征空間的點積,那么一個 KernelCenterer 類能夠轉化這個核矩陣,通過移除特征空間的平均值,使它包含由函數 phi 定義的內部產物。

5 Normalization

歸一化 是 縮放單個樣本以具有單位范數 的過程。如果你計划使用二次形式(如點積或任何其他核函數)來量化任何樣本間的相似度,則此過程將非常有用。

這個觀點基於 向量空間模型(Vector Space Model) ,經常在文本分類和內容聚類中使用.

函數 normalize 提供了一個快速簡單的方法在類似數組的數據集上執行操作,使用 l1 或 l2 范式:

>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')

>>> X_normalized                                      
array([[ 0.40..., -0.40...,  0.81...],
 [ 1\.  ...,  0\.  ...,  0\.  ...],
 [ 0\.  ...,  0.70..., -0.70...]])

preprocessing 預處理模塊提供的 Normalizer 工具類使用 Transformer API 實現了相同的操作(即使在這種情況下, fit 方法是無用的:該類是無狀態的,因為該操作獨立對待樣本).

因此這個類適用於 sklearn.pipeline.Pipeline 的早期步驟:

>>> normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
>>> normalizer
Normalizer(copy=True, norm='l2')

在這之后歸一化實例可以被使用在樣本向量中,像任何其他轉換器一樣:

>>> normalizer.transform(X)                            
array([[ 0.40..., -0.40...,  0.81...],
 [ 1\.  ...,  0\.  ...,  0\.  ...],
 [ 0\.  ...,  0.70..., -0.70...]])

>>> normalizer.transform([[-1.,  1., 0.]])             
array([[-0.70...,  0.70...,  0\.  ...]])

稀疏(數據)輸入

函數 normalize 以及類 Normalizer 接收 來自scipy.sparse的密集類數組數據和稀疏矩陣 作為輸入。

對於稀疏輸入,在被提交給高效Cython例程前,數據被 轉化為壓縮的稀疏行形式 (參見 scipy.sparse.csr_matrix )。為了避免不必要的內存復制,推薦在上游選擇CSR表示。

 

相關api:

class sklearn.preprocessing.MinMaxScaler(feature_range=0, 1*copy=Trueclip=False)

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) X_scaled = X_std * (max - min) + min 

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Read more in the User Guide.

Parameters
feature_range tuple (min, max), default=(0, 1)

Desired range of transformed data.

copy bool, default=True

Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

clip bool, default=False

Set to True to clip transformed values of held-out data to provided feature range.

New in version 0.24.

Attributes
min_ ndarray of shape (n_features,)

Per feature adjustment for minimum. Equivalent to min X.min(axis=0) self.scale_

scale_ ndarray of shape (n_features,)

Per feature relative scaling of the data. Equivalent to (max min) (X.max(axis=0) X.min(axis=0))

New in version 0.17: scale_ attribute.

data_min_ ndarray of shape (n_features,)

Per feature minimum seen in the data

New in version 0.17: data_min_

data_max_ ndarray of shape (n_features,)

Per feature maximum seen in the data

New in version 0.17: data_max_

data_range_ ndarray of shape (n_features,)

Per feature range (data_max_ data_min_) seen in the data

New in version 0.17: data_range_

n_samples_seen_ int

The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across partial_fit calls.

 

Methods

fit(X[, y])

Compute the minimum and maximum to be used for later scaling.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Undo the scaling of X according to feature_range.

partial_fit(X[, y])

Online computation of min and max on X for later scaling.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Scale features of X according to feature_range.

>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler()
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]

class sklearn.preprocessing.MaxAbsScaler(*copy=True)

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

This scaler can also be applied to sparse CSR or CSC matrices.

New in version 0.17.

Parameters
copy bool, default=True

Set to False to perform inplace scaling and avoid a copy (if the input is already a numpy array).

Attributes
scale_ ndarray of shape (n_features,)

Per feature relative scaling of the data.

New in version 0.17: scale_ attribute.

max_abs_ ndarray of shape (n_features,)

Per feature maximum absolute value.

n_samples_seen_ int

The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.

 

Methods

fit(X[, y])

Compute the maximum absolute value to be used for later scaling.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Scale back the data to the original representation

partial_fit(X[, y])

Online computation of max absolute value of X for later scaling.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Scale the data

Examples

>>> from sklearn.preprocessing import MaxAbsScaler
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> transformer = MaxAbsScaler().fit(X)
>>> transformer
MaxAbsScaler()
>>> transformer.transform(X)
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

 

class sklearn.preprocessing.Normalizer(norm='l2'*copy=True)

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

Read more in the User Guide.

Parameters
norm {‘l1’, ‘l2’, ‘max’}, default=’l2’

The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.

copy bool, default=True

set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

Methods

fit(X[, y])

Do nothing and return the estimator unchanged

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

transform(X[, copy])

Scale each non zero row of X to unit norm

Examples

>>> from sklearn.preprocessing import Normalizer
>>> X = [[4, 1, 2, 2],
...      [1, 3, 9, 3],
...      [5, 7, 5, 1]]
>>> transformer = Normalizer().fit(X)  # fit does nothing.
>>> transformer
Normalizer()
>>> transformer.transform(X)
array([[0.8, 0.2, 0.4, 0.4],
       [0.1, 0.3, 0.9, 0.3],
       [0.5, 0.7, 0.5, 0.1]])

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM