1.數據標准化(Standardization or Mean Removal and Variance Scaling)
進行標准化縮放的數據均值為0,具有單位方差。
from sklearn import preprocessing X = [[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]] X_scaled = preprocessing.scale(X) print X_scaled #[[ 0. -1.22474487 1.33630621] # [ 1.22474487 0. -0.26726124] # [-1.22474487 1.22474487 -1.06904497]] print X_scaled.mean(axis = 0) print X_scaled.std(axis = 0) #[ 0. 0. 0.] #[ 1. 1. 1.]
同樣我們也可以通過preprocessing模塊提供的Scaler(StandardScaler 0.15以后版本)工具類來實現這個功能:
scaler = preprocessing.StandardScaler().fit(X) print scaler #StandardScaler(copy=True, with_mean=True, with_std=True) print scaler.mean_ #[ 1. 0. 0.33333333] print scaler.scale_#之前版本scaler.std_ #[ 0.81649658 0.81649658 1.24721913] print scaler.transform(X) #[[ 0. -1.22474487 1.33630621] # [ 1.22474487 0. -0.26726124] # [-1.22474487 1.22474487 -1.06904497]]
注:上述代碼與下面代碼等價
scaler = preprocessing.StandardScaler().fit_transform(X) print scaler #[[ 0. -1.22474487 1.33630621] # [ 1.22474487 0. -0.26726124] # [-1.22474487 1.22474487 -1.06904497]] print scaler.mean(axis = 0) #[ 0. 0. 0.] print scaler.std(axis = 0) #[ 1. 1. 1.]
2.數據規范化(Normalization)
把數據集中的每個樣本所有數值縮放到(-1,1)之間。
X = [[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]] X_normalized = preprocessing.normalize(X) print X_normalized #[[ 0.40824829 -0.40824829 0.81649658] # [ 1. 0. 0. ] # [ 0. 0.70710678 -0.70710678]]
等價於:
normalizer = preprocessing.Normalizer().fit(X) print normalizer #Normalizer(copy=True, norm='l2') print normalizer.transform(X) #[[ 0.40824829 -0.40824829 0.81649658] # [ 1. 0. 0. ] # [ 0. 0.70710678 -0.70710678]]
注:上述代碼與下面代碼等價
normalizer = preprocessing.Normalizer().fit_transform(X) print normalizer #[[ 0.40824829 -0.40824829 0.81649658] # [ 1. 0. 0. ] # [ 0. 0.70710678 -0.70710678]]
3.二進制化(Binarization)
將數值型數據轉化為布爾型的二值數據,可以設置一個閾值(threshold)。
X = [[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]] binarizer = preprocessing.Binarizer().fit(X) # 默認閾值為0.0 print binarizer #Binarizer(copy=True, threshold=0.0) print binarizer.transform(X) #[[ 1. 0. 1.] # [ 1. 0. 0.] # [ 0. 1. 0.]] binarizer = preprocessing.Binarizer(threshold=1.1) # 設定閾值為1.1 print binarizer.transform(X) #[[ 0. 0. 1.] # [ 1. 0. 0.] # [ 0. 0. 0.]]
4.標簽預處理(Label preprocessing)
4.1)標簽二值化(Label binarization)
LabelBinarizer通常用於通過一個多類標簽(label)列表,創建一個label指示器矩陣.
lb = preprocessing.LabelBinarizer() print lb.fit([1, 2, 6, 4, 2]) #LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False) print lb.classes_ #[1 2 4 6] print lb.transform([1, 6]) #[[1 0 0 0] # [0 0 0 1]]
4.2)標簽編碼(Label encoding)
le = preprocessing.LabelEncoder() print le.fit([1, 2, 2, 6]) #LabelEncoder() print le.classes_ #[1 2 6] print le.transform([1, 1, 2, 6]) #[0 0 1 2] print le.inverse_transform([0, 0, 1, 2]) #[1 1 2 6]
也可以用於非數值類型的標簽到數值類型標簽的轉化:
le = preprocessing.LabelEncoder() print le.fit(["paris", "paris", "tokyo", "amsterdam"]) #LabelEncoder() print list(le.classes_) #['amsterdam', 'paris', 'tokyo'] print le.transform(["tokyo", "tokyo", "paris"]) #[2 2 1] print list(le.inverse_transform([2, 2, 1])) #['tokyo', 'tokyo', 'paris']