Python下的機器學習工具sklearn--數據預處理

本文轉載自查看原文 2016-04-26 22:31 1907 Python

1.數據標准化（Standardization or Mean Removal and Variance Scaling）

進行標准化縮放的數據均值為0，具有單位方差。

from sklearn import preprocessing
X = [[1., -1., 2.], 
     [2., 0., 0.],
     [0., 1., -1.]]
X_scaled = preprocessing.scale(X)
print X_scaled
#[[ 0.         -1.22474487  1.33630621]
# [ 1.22474487  0.         -0.26726124]
# [-1.22474487  1.22474487 -1.06904497]]
print X_scaled.mean(axis = 0)
print X_scaled.std(axis = 0)
#[ 0.  0.  0.]
#[ 1.  1.  1.]

同樣我們也可以通過preprocessing模塊提供的Scaler（StandardScaler 0.15以后版本）工具類來實現這個功能：

scaler = preprocessing.StandardScaler().fit(X)
print scaler
#StandardScaler(copy=True, with_mean=True, with_std=True)
print scaler.mean_
#[ 1.          0.          0.33333333]
print scaler.scale_#之前版本scaler.std_
#[ 0.81649658  0.81649658  1.24721913]
print scaler.transform(X)
#[[ 0.         -1.22474487  1.33630621]
# [ 1.22474487  0.         -0.26726124]
# [-1.22474487  1.22474487 -1.06904497]]

注：上述代碼與下面代碼等價

scaler = preprocessing.StandardScaler().fit_transform(X)
print scaler
#[[ 0.         -1.22474487  1.33630621]
# [ 1.22474487  0.         -0.26726124]
# [-1.22474487  1.22474487 -1.06904497]]
print scaler.mean(axis = 0)
#[ 0.  0.  0.]
print scaler.std(axis = 0)
#[ 1.  1.  1.]

2.數據規范化（Normalization）

把數據集中的每個樣本所有數值縮放到(-1,1)之間。

X = [[1., -1., 2.], 
     [2., 0., 0.],
     [0., 1., -1.]]
X_normalized = preprocessing.normalize(X)
print X_normalized
#[[ 0.40824829 -0.40824829  0.81649658]
# [ 1.          0.          0.        ]
# [ 0.          0.70710678 -0.70710678]]

等價於：

normalizer = preprocessing.Normalizer().fit(X)
print normalizer
#Normalizer(copy=True, norm='l2')
print normalizer.transform(X)
#[[ 0.40824829 -0.40824829  0.81649658]
# [ 1.          0.          0.        ]
# [ 0.          0.70710678 -0.70710678]]

注：上述代碼與下面代碼等價

normalizer = preprocessing.Normalizer().fit_transform(X)
print normalizer
#[[ 0.40824829 -0.40824829  0.81649658]
# [ 1.          0.          0.        ]
# [ 0.          0.70710678 -0.70710678]]

3.二進制化（Binarization）

將數值型數據轉化為布爾型的二值數據，可以設置一個閾值（threshold）。

X = [[1., -1., 2.], 
     [2., 0., 0.],
     [0., 1., -1.]]
binarizer = preprocessing.Binarizer().fit(X) # 默認閾值為0.0 
print binarizer
#Binarizer(copy=True, threshold=0.0)
print binarizer.transform(X)
#[[ 1.  0.  1.]
# [ 1.  0.  0.]
# [ 0.  1.  0.]]

binarizer = preprocessing.Binarizer(threshold=1.1) # 設定閾值為1.1
print binarizer.transform(X)
#[[ 0.  0.  1.]
# [ 1.  0.  0.]
# [ 0.  0.  0.]]

4.標簽預處理（Label preprocessing）

4.1）標簽二值化（Label binarization）

LabelBinarizer通常用於通過一個多類標簽（label）列表，創建一個label指示器矩陣.

lb = preprocessing.LabelBinarizer()
print lb.fit([1, 2, 6, 4, 2])
#LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
print lb.classes_
#[1 2 4 6]
print lb.transform([1, 6])
#[[1 0 0 0]
# [0 0 0 1]]

4.2）標簽編碼（Label encoding）

le = preprocessing.LabelEncoder()
print le.fit([1, 2, 2, 6])
#LabelEncoder()
print le.classes_
#[1 2 6]
print le.transform([1, 1, 2, 6])
#[0 0 1 2]
print le.inverse_transform([0, 0, 1, 2])
#[1 1 2 6]

也可以用於非數值類型的標簽到數值類型標簽的轉化：

le = preprocessing.LabelEncoder() 
print le.fit(["paris", "paris", "tokyo", "amsterdam"])
#LabelEncoder()
print list(le.classes_)
#['amsterdam', 'paris', 'tokyo']
print le.transform(["tokyo", "tokyo", "paris"])
#[2 2 1]
print list(le.inverse_transform([2, 2, 1]))
#['tokyo', 'tokyo', 'paris']

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習之數據預處理，Pandas讀取excel數據機器學習 | 特征工程（一）- 數據預處理特征提取（機器學習數據預處理） [機器學習]-[數據預處理]-中心化縮放 KNN（一）吳裕雄 python 機器學習——數據預處理標准化MinMaxScaler模型吳裕雄 python 機器學習——數據預處理標准化MaxAbsScaler模型吳裕雄 python 機器學習——數據預處理標准化StandardScaler模型吳裕雄 python 機器學習——數據預處理流水線Pipeline模型 Python數據預處理：機器學習、人工智能通用技術（1） sklearn數據預處理