sklearn中的數據預處理----good!! 標准化 歸一化 在何時使用


RESCALING attribute data to values to scale the range in [0, 1] or [−1, 1] is useful for the optimization algorithms, such as gradient descent, that are used within machine learning algorithms that weight inputs (e.g. regression and neural networks). Rescaling is also used for algorithms that use distance measurements for example K-Nearest-Neighbors (KNN). Rescaling like this is sometimes called "normalization". MinMaxScaler class in python skikit-learn does this.

NORMALIZING attribute data is used to rescale components of a feature vector to have the complete vector length of 1. This is "scaling by unit length". This usually means dividing each component of the feature vector by the Euclidiean length of the vector but can also be Manhattan or other distance measurements. This pre-processing rescaling method is useful for sparse attribute features and algorithms using distance to learn such as KNN. Python scikit-learn Normalizer class can be used for this.

STANDARDIZING attribute data is also a preprocessing method but it assumes a Gaussian distribution of input features. It "standardizes" to a mean of 0 and a standard deviation of 1. This works better with linear regression, logistic regression and linear discriminate analysis. Python StandardScaler class in scikit-learn works for this.

 

from:skearn DOC:

Examples using sklearn.preprocessing.StandardScaler

 

from:https://www.programcreek.com/python/example/82501/sklearn.preprocessing.MinMaxScaler-----maybe svm is wrong!

Python sklearn.preprocessing.MinMaxScaler() Examples

The following are 50 code examples for showing how to use sklearn.preprocessing.MinMaxScaler(). They are extracted from open source Python projects. You can vote up the examples you like or vote down the exmaples you don't like. You can also save this page to your account.

 
Example 1
Project: sef   Author: passalis   File: classification.py View Source Project 7 votes vote down vote up
def evaluate_svm(train_data, train_labels, test_data, test_labels, n_jobs=-1): """ Evaluates a representation using a Linear SVM It uses 3-fold cross validation for selecting the C parameter :param train_data: :param train_labels: :param test_data: :param test_labels: :param n_jobs: :return: the test accuracy """ # Scale data to 0-1 scaler = MinMaxScaler() train_data = scaler.fit_transform(train_data) test_data = scaler.transform(test_data) parameters = {'kernel': ['linear'], 'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]} model = svm.SVC(max_iter=10000) clf = grid_search.GridSearchCV(model, parameters, n_jobs=n_jobs, cv=3) clf.fit(train_data, train_labels) lin_svm_test = clf.score(test_data, test_labels) return lin_svm_test 
 
Example 2
Project: golden_touch   Author: at553   File: predict.py View Source Project 6 votes vote down vote up
def train_model(self): # scale scaler = MinMaxScaler(feature_range=(0, 1)) dataset = scaler.fit_transform(self.data) # split into train and test sets train_size = int(len(dataset) * 0.95) train, test = dataset[0:train_size, :], dataset[train_size:len(dataset), :] look_back = 5 trainX, trainY = self.create_dataset(train, look_back) # reshape input to be [samples, time steps, features] trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1])) # create and fit the LSTM network model = Sequential() model.add(LSTM(6, input_dim=look_back)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') model.fit(trainX, trainY, nb_epoch=100, batch_size=1, verbose=2) return model 

官方的dbscan聚類使用 StandardScaler

http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py

Demo of DBSCAN clustering algorithm

Finds core samples of high density and expands clusters from them.

../../_images/sphx_glr_plot_dbscan_001.png

Out:

Estimated number of clusters: 3 Homogeneity: 0.953 Completeness: 0.883 V-measure: 0.917 Adjusted Rand Index: 0.952 Adjusted Mutual Information: 0.883 Silhouette Coefficient: 0.626 
 
print(__doc__) import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler # ############################################################################# # Generate sample data centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) # ############################################################################# # Compute DBSCAN db = DBSCAN(eps=0.3, min_samples=10).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) print('Estimated number of clusters: %d' % n_clusters_) print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels)) print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels)) print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels)) print("Adjusted Rand Index: %0.3f" % metrics.adjusted_rand_score(labels_true, labels)) print("Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(labels_true, labels)) print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels)) # ############################################################################# # Plot result import matplotlib.pyplot as plt # Black removed and is used for noise instead. unique_labels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] class_member_mask = (labels == k) xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('Estimated number of clusters: %d' % n_clusters_) plt.show() 

 https://chrisalbon.com/machine_learning/clustering/k-means_clustering/ 這里的iris聚類也用到了

k-Means Clustering

20 Dec 2017

Preliminaries

# Load libraries from sklearn import datasets from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans

Load Iris Flower Dataset

# Load data iris = datasets.load_iris() X = iris.data

Standardize Features

# Standarize features scaler = StandardScaler() X_std = scaler.fit_transform(X)

Conduct k-Means Clustering

# Create k-mean object clt = KMeans(n_clusters=3, random_state=0, n_jobs=-1) # Train model model = clt.fit(X_std)

Show Each Observation’s Cluster Membership

# View predict class model.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2,
       0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0,
       2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)

Create New Observation

# Create new observation new_observation = [[0.8, 0.8, 0.8, 0.8]]

Predict Observation’s Cluster

# Predict observation's cluster model.predict(new_observation)
array([0], dtype=int32)

View Centers Of Each Cluster

# View cluster centers model.cluster_centers_
array([[ 1.13597027,  0.09659843,  0.996271  ,  1.01717187],
       [-1.01457897,  0.84230679, -1.30487835, -1.25512862],
       [-0.05021989, -0.88029181,  0.34753171,  0.28206327]])

詳細見:詳見http://d0evi1.com/sklearn/preprocessing/

標准化

變換后各維特征有0均值,單位方差。也叫z-score規范化(零均值規范化)。計算方式是將特征值減去均值,除以標准差。

  1. from sklearn.preprocessing import scale
  2. X = np.array([[ 1., -1., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])
  3. scale(X)

一般會把train和test集放在一起做標准化,或者在train集上做標准化后,用同樣的標准化器去標准化test集,此時可以用scaler

  1. from sklearn.preprocessing import StandardScaler
  2. scaler = StandardScaler().fit(train)
  3. scaler.transform(train)
  4. scaler.transform(test)

最小-最大規范化

最小-最大規范化對原始數據進行線性變換,變換到[0,1]區間(也可以是其他固定最小最大值的區間)

  1. min_max_scaler = sklearn.preprocessing.MinMaxScaler()
  2. min_max_scaler.fit_transform(X_train)

規范化:正則化

規范化是將不同變化范圍的值映射到相同的固定范圍,常見的是[0,1],此時也稱為歸一化。《機器學習》周志華

  1. X = [[ 1, -1, 2],[ 2, 0, 0], [ 0, 1, -1]]
  2. sklearn.preprocessing.normalize(X, norm='l2')
  1. array([[ 0.40, -0.40, 0.81], [ 1, 0, 0], [ 0, 0.70, -0.70]])

可以發現對於每一個樣本都有,0.4^2+0.4^2+0.81^2=1,這就是L2 norm,變換后每個樣本的各維特征的平方和為1。類似地,L1 norm則是變換后每個樣本的各維特征的絕對值和為1。還有max norm,則是將每個樣本的各維特征除以該樣本各維特征的最大值。
在度量樣本之間相似性時,如果使用的是二次型kernel,需要做Normalization


特征二值化

給定閾值,將特征轉換為0/1

  1. binarizer = sklearn.preprocessing.Binarizer(threshold=1.1)
  2. binarizer.transform(X)

標簽二值化

  1. from sklearn import preprocessing
  2. lb = preprocessing.LabelBinarizer()
  3. lb.fit([1, 2, 6, 4, 2])
  4. lb.classes_
  5. array([1, 2, 4, 6])
  6. lb.transform([1, 6])#必須[1, 2, 6, 4, 2]里面
  7. array([[1, 0, 0, 0],
  8. [0, 0, 0, 1]])

類別特征編碼

標簽編碼

含有異常值

  1. sklearn.preprocessing.robust_scale

生成多項式

原始特征
image

轉化后
image

 

    1. poly = sklearn.preprocessing.PolynomialFeatures(2)
    2. poly.fit_transform(X)

 

http://shujuren.org/article/234.html

 

 

一、標准化(Z-Score),或者去除均值和方差縮放

公式為:(X-mean)/std  計算時對每個屬性/每列分別進行。

將數據按期屬性(按列進行)減去其均值,並處以其方差。得到的結果是,對於每個屬性/每列來說所有數據都聚集在0附近,方差為1。

實現時,有兩種不同的方式:

  • 使用sklearn.preprocessing.scale()函數,可以直接將給定數據進行標准化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1 ., - 1 .,  2 .],
...               [ 2 .,  0 .,  0 .],
...               [ 0 .,  1 ., - 1 .]])
>>> X_scaled = preprocessing.scale(X)
 
>>> X_scaled                                         
array([[ 0 .  ..., - 1.22 ...,  1.33 ...],
        [ 1.22 ...,  0 .  ..., - 0.26 ...],
        [- 1.22 ...,  1.22 ..., - 1.06 ...]])
 
>>>#處理后數據的均值和方差
>>> X_scaled.mean(axis= 0 )
array([ 0 .,  0 .,  0 .])
 
>>> X_scaled.std(axis= 0 )
array([ 1 .,  1 .,  1 .])
  • 使用sklearn.preprocessing.StandardScaler類,使用該類的好處在於可以保存訓練集中的參數(均值、方差)直接使用其對象轉換測試集數據。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>> scaler = preprocessing.StandardScaler().fit(X)
>>> scaler
StandardScaler(copy = True , with_mean = True , with_std = True )
 
>>> scaler.mean_                                     
array([ 1. ...,  0. ...,  0.33 ...])
 
>>> scaler.std_                                      
array([ 0.81 ...,  0.81 ...,  1.24 ...])
 
>>> scaler.transform(X)                              
array([[ 0.  ..., - 1.22 ...,  1.33 ...],
        [ 1.22 ...,  0.  ..., - 0.26 ...],
        [ - 1.22 ...,  1.22 ..., - 1.06 ...]])
 
 
>>> #可以直接使用訓練集對測試集數據進行轉換
>>> scaler.transform([[ - 1. 1. , 0. ]])               
array([[ - 2.44 ...,  1.22 ..., - 0.26 ...]])

 

二、將屬性縮放到一個指定范圍

除了上述介紹的方法之外,另一種常用的方法是將屬性縮放到一個指定的最大和最小值(通常是1-0)之間,這可以通過preprocessing.MinMaxScaler類實現。

使用這種方法的目的包括:

1、對於方差非常小的屬性可以增強其穩定性。

2、維持稀疏矩陣中為0的條目。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>> X_train = np.array([[ 1. , - 1. 2. ],
...                     [ 2. 0. 0. ],
...                     [ 0. 1. , - 1. ]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       0.        1.        ],
        [ 1.        0.5       0.33333333 ],
        [ 0.        1.        0.        ]])
 
>>> #將相同的縮放應用到測試集數據中
>>> X_test = np.array([[ - 3. , - 1. 4. ]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[ - 1.5       0.        1.66666667 ]])
 
 
>>> #縮放因子等屬性
>>> min_max_scaler.scale_                            
array([ 0.5       0.5       0.33 ...])
 
>>> min_max_scaler.min_                              
array([ 0.        0.5       0.33 ...])

當然,在構造類對象的時候也可以直接指定最大最小值的范圍:feature_range=(min, max),此時應用的公式變為:

 

X_std=(X-X.min(axis=0))/(X.max(axis=0)-X.min(axis=0))

X_scaled=X_std/(max-min)+min

 

三、正則化(Normalization)

正則化的過程是將每個樣本縮放到單位范數(每個樣本的范數為1),如果后面要使用如二次型(點積)或者其它核方法計算兩個樣本之間的相似性這個方法會很有用。

Normalization主要思想是對每個樣本計算其p-范數,然后對該樣本中每個元素除以該范數,這樣處理的結果是使得每個處理后樣本的p-范數(l1-norm,l2-norm)等於1。

             p-范數的計算公式:||X||p=(|x1|^p+|x2|^p+...+|xn|^p)^1/p

方法主要應用於文本分類和聚類中。例如,對於兩個TF-IDF向量的l2-norm進行點積,就可以得到這兩個向量的余弦相似性。

1、可以使用preprocessing.normalize()函數對指定數據進行轉換:

1
2
3
4
5
6
7
8
9
>>> X = [[ 1. , - 1. 2. ],
...      [ 2. 0. 0. ],
...      [ 0. 1. , - 1. ]]
>>> X_normalized = preprocessing.normalize(X, norm = 'l2' )
 
>>> X_normalized                                     
array([[ 0.40 ..., - 0.40 ...,  0.81 ...],
        [ 1.  ...,  0.  ...,  0.  ...],
        [ 0.  ...,  0.70 ..., - 0.70 ...]])

 

2、可以使用processing.Normalizer()類實現對訓練集和測試集的擬合和轉換:

1
2
3
4
5
6
7
8
9
10
11
12
>>> normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
>>> normalizer
Normalizer(copy = True , norm = 'l2' )
 
>>>
>>> normalizer.transform(X)                           
array([[ 0.40 ..., - 0.40 ...,  0.81 ...],
        [ 1.  ...,  0.  ...,  0.  ...],
        [ 0.  ...,  0.70 ..., - 0.70 ...]])
 
>>> normalizer.transform([[ - 1. 1. , 0. ]])            
array([[ - 0.70 ...,  0.70 ...,  0.  ...]])

 

補充:


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM