數據預處理時進行特征值的放縮,應該在訓練集合測試集上進行相同的放縮,換言之放縮的標准都應該是在測試集上學習到的。
下面展示最大最下放縮的效果:
from matplotlib.pyplot import as plt from sklearn.datasets import make_blobs from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split # 構造數據
X, _ = make_blobs(n_samples=60, centers=5, random_state=7, cluster_std=2 ) X_train, X_test = train_test_split(X, random_state=9, test_size=0.1 ) fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(13, 4) ) # 繪制未經放縮的數據的訓練集和測試集
axes[0].scatter(X_train[:, 0], X_train[:, 1], c='b', label='Training set', s=60 ) axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^', c='r', label='Test set', s=60 ) axes[0].legend(loc=1) axes[0].set_title('Original Data') # 利用 MinMaxScaler 放縮數據
scaler = MinMaxScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) # 可視化正確放縮的數據
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c='b', label='Training set', s=60 ) axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^', c='r', label='Test set', s=60 ) axes[1].set_title('Scaled Data') # 對測試集進行單獨放縮
test_scaler = MinMaxScaler() test_scaler.fit(X_test) X_test_scaled_badly = test_scaler.transform(X_test) # 可視化錯誤放縮的數據
axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c='b', label='Training set', s=60 ) axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1], marker='^', c='r', label='Test set', s=60 ) axes[2].set_title('Improperly Scaled Data') # 為每幅圖添加坐標軸標題
for ax in axes: ax.set_xlabel('Feature 0') ax.set_ylabel('Feature 1') plt.show()
前兩張圖看起來一樣,但是坐標刻度發生了變化,這便是特征放縮的效果。第三張圖是對測試集的放縮是錯誤的,人為改變了數據的排列。