(原創)(二)機器學習筆記之數據預處理


數據預處理

數據預處理一般包括:

(1) 數據標准化

這是最常用的數據預處理,把某個特征的所有樣本轉換成均值為0,方差為1。

將數據轉換成標准正態分布的方法:

對每維特征單獨處理:

clip_image002

其中,

clip_image004

可以調用sklearn.preprocessing中的StandardScaler()進行數據的標准化。

(2) 數據歸一化

把某個特征的所有樣本取值限定在規定范圍內(一般為[-1,1]或者[0,1])。

歸一化得方法為:

clip_image006

可以調用sklearn.preprocessing中的MinMaxScaler()將數據限定在[0,1]范圍,調用MaxAbsScaler()將數據限定在[-1,1]范圍。

(3) 數據正規化

把某個特征的所有樣本的模長轉換為1。方法為:

clip_image008

可以調用sklearn.preprocessing中的Normalizer()實現

(4) 數據二值化

把數據的特征取值根據閾值轉為為0或者1。

(5) 數據缺值處理

對於缺失的特征數據,進行數據填補,一般填補的方法有:均值,中位數,眾數填補等。

(6) 數據離群點處理

刪除離群點數據。

(7) 數據類型轉換

如果數據的特征不是數值型特征,則需要轉換為數值型。

1.導入必要的工具包

數據處理工具包為:Numpy,SciPy,pandas,其中SciPy,pandas是基於Numpy進一步的封裝 
數據可視化工具包為:Matplotlib,Seaborn,其中Seaborn是基於Matplotlib進一步的封裝

import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import r2_score %matplotlib inline

2.讀取數據

dpath = './data/' data = pd.read_csv(dpath +"boston_housing.csv") data.head() data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): CRIM 506 non-null float64 ZN 506 non-null int64 INDUS 506 non-null float64 CHAS 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null int64 TAX 506 non-null int64 PTRATIO 506 non-null int64 B 506 non-null float64 LSTAT 506 non-null float64 MEDV 506 non-null float64 dtypes: float64(9), int64(5) memory usage: 55.4 KB 

3.將數據分割訓練數據與測試數據

刪去某行或者某列:

DataFrame.drop(labels, axis=0, level=None, inplace=False, errors=’raise’)

labels : single label or list-like 
axis : int or axis name 
level : int or level name, default None For MultiIndex 
inplace : bool, default False. If True, do operation inplace and return None. 
errors : {‘ignore’, ‘raise’}, default ‘raise’,If‘ignore, suppress error and existing labels are dropped. 
Returns: dropped : type of caller

y = data['MEDV'] # 獲取列名為'MEDV'的列的數據 #print y X = data.drop('MEDV', axis=1) # 從axis=1軸(列)中刪去列名為'MEDV'的列 X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null int64 INDUS 506 non-null float64 CHAS 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null int64 TAX 506 non-null int64 PTRATIO 506 non-null int64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(8), int64(5) memory usage: 51.5 KB 

4.采樣訓練樣本和測試樣本

sklearn.cross_validation.train_test_split(*arrays, **options)

*arrays : sequence of indexables with same length / shape[0] 
              Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. 
test_size : float, int, or None (default is None) 
               If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the
               test split. If int, represents the absolute number of test samples. If None, the value is automatically
               set to the complement of the train size. If train size is also None, test size is set to 0.25. 
train_size : float, int, or None (default is None)。 
                 If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include
                 in the train split. If int, represents the absolute number of train samples. If None, the value is
                 automatically set to the complement of the test size. 
random_state : int or RandomState。Pseudo-random number generator state used for random sampling. 
                stratify : array-like or None (default is None)

X_train,X_test,y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25)

X:輸入特征, 
y:輸入標簽, 
random_state:隨機種子, 
test_size:測試樣本數占比,為默認為0.25 
[X_train, y_train] 和 [X_test, y_test]是一對,分別對應分割之后的訓練數據和訓練標簽,測試數據和訓練標簽

from sklearn.cross_validation import train_test_split # 隨機采樣25%的數據構建測試樣本,其余作為訓練樣本 # X:輸入特征,y:輸入標簽,random_state隨機種子為27, test_size:測試樣本數占比,如果train_size=NULL,則為默認的0.25 # 輸出為訓練樣本和測試樣本的DataFrame數據 X_train,X_test,y_train, y_test = train_test_split(X, y, random_state=27, test_size=0.25) print X_train.shape print y_train.shape print X_test.shape print y_test.shape
(379, 13) (379L,) (127, 13) (127L,) 

5.數據預處理

數據標准化: 

初始化:

sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)

with_mean : boolean, True by default.If True, center the data before scaling. 
with_std : boolean, True by default.If True, scale the data to unit variance (or equivalently, unit
                standard deviation). 
copy : boolean, optional, default True.If False, try to avoid a copy and do inplace scaling instead.

方法:

X_new = fit_transform(X, y=None, **fit_params) 進行mean和std計算,並進行數據的標准化

X : numpy array of shape [n_samples, n_features].Training set. 
y : numpy array of shape [n_samples].Target values. 
X_new : numpy array of shape [n_samples, n_features_new].Transformed array.

X_new = transform(X, y=None, copy=None) 使用已經計算的mean和std進行數據的標准化

X : array-like, shape [n_samples, n_features].The data used to scale along the features axis. 
X_new : numpy array of shape [n_samples, n_features_new].Transformed array.

# 數據標准化 from sklearn.preprocessing import StandardScaler # 分別初始化對特征和目標值的標准化器 ss_X = StandardScaler() ss_y = StandardScaler() # 分別對訓練和測試數據的特征以及目標值進行標准化處理 X_train = ss_X.fit_transform(X_train) # 先計算均值和方差,再進行變換 X_test = ss_X.transform(X_test) # 利用上面計算好的均值和方差,直接進行轉換 y_train = ss_y.fit_transform(y_train) y_test = ss_y.transform(y_test) print X_train
[[-0.37683627 -0.50304409 2.48277286 ..., 0.86555269 -0.13431739 1.60921499] [ 5.13573477 -0.50304409 1.0607873 ..., 0.86555269 -2.93693892 3.44576006] [-0.37346431 0.01751212 -0.44822848 ..., -1.31269744 0.33223834 2.45055308] ..., [-0.39101613 -0.50304409 -1.13119458 ..., -0.87704742 0.28632785 -0.36708256] [-0.38897021 -0.50304409 -1.2462515 ..., -0.44139739 0.38012111 0.19898553] [-0.31120842 -0.50304409 -0.40840109 ..., 1.30120272 0.37957325 -0.18215757]]


人工智能從入門到專家教程資料:https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.38270209gU11fS&id=562189023765


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM