機器學習中數據清洗&預處理

本文轉載自查看原文 2019-03-02 20:29 1419 機器學習&深度學習/ 數據處理/ Python

數據預處理是建立機器學習模型的第一步，對最終結果有決定性的作用：如果你的數據集沒有完成數據清洗和預處理，那么你的模型很可能也不會有效

第一步，導入數據

進行學習的第一步，我們需要將數據導入程序以進行下一步處理

加載 nii 文件並轉為 numpy 數組

import nibabel as nib
from skimage import transform
import os
import numpy as np

img = nib.load(img_file)  
img = img.get_fdata()  
img = transform.resize(img[:, :, :, 0], (256, 256, 5))  
img = np.squeeze(img)  
train_img[i - 1, :, :, :] = img[:, :, :]

第二步，數據預處理

Python提供了多種多樣的庫來完成數據處理的的工作，最流行的三個基礎的庫有：Numpy、Matplotlib 和 Pandas。Numpy 是滿足所有數學運算所需要的庫，由於代碼是基於數學公式運行的，因此就會使用到它。Maplotlib（具體而言，Matplotlib.pyplot）則是滿足繪圖所需要的庫。Pandas 則是最好的導入並處理數據集的一個庫。對於數據預處理而言，Pandas 和 Numpy 基本是必需的

在導入庫時，如果庫名較長，最好能賦予其縮寫形式，以便在之后的使用中可以使用簡寫。如

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

導入數據

import pandas as pd

def read_data(file_name : str):
    suffix = file_name.split('.')
    if suffix[1] == "csv":
        dataset = pd.read_csv(file_name)
        return dataset
    return None

讀取的數據為

	animal	age	worth	friendly
0	cat	3	1200.0	yes
1	dog	4	2400.0	yes
2	dog	3	7000.0	no
3	cat	2	3400.0	yes
4	moose	6	4000.0	no
5	moose	3	NaN	yes

將數據划分為因變量和自變量($ y = f(x)$)

dataset = read_data("data.csv")  # pandas.core.frame.DataFrame
print(dataset)
x = dataset.iloc[:, :-1].values  # 將Dataframe轉為數組,且不包括最后一列
y = dataset.iloc[:, 3].values  # dataset最后一列

\[x = \begin{bmatrix} {'cat'} & {3} & {1200.0} \\ {'dog'} & {4} & {2400.0} \\ {'dog'} & {3} & {7000.0} \\ {'cat'} & {2} & {3400.0} \\ {'moose'} & {6} & {4000.0} \\ {'moose'} & {3} & {nan} \end{bmatrix} \\ y = ['yes', 'yes', 'no', 'yes', 'no', 'yes'] \]

可見 $x$ 中是有一項數據是缺失的，此時可以使用 scikit-learn 預處理模型中的 imputer 類來填充缺失項

from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = np.nan, strategy = 'mean', axis = 0) # 使用均值填充缺失數據
imputer = imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

其中 missing_values 指定了待填充的缺失項值， strategy 指定填充策略，此處填充策略使用的是均值填充，也可以使用中值，眾數等策略

填充結果

\[\begin{bmatrix} {'cat'} & {3} & {1200.0} \\ {'dog'} & {4} & {2400.0} \\ {'dog'} & {3} & {7000.0} \\ {'cat'} & {2} & {3400.0} \\ {'moose'} & {6} & {4000.0} \\ {'moose'} & {3} & {3600.0} \\ \end{bmatrix} \]

這種填充適用於數字的填充，如果是屬性填充，我們可以將屬性數據編碼為數值。此時我們可以使用 sklearn.preprocessing 所提供的 LabelEncoder 類

from sklearn.preprocessing import LabelEncoder

print(y)
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)
print(y)

編碼結果

\[y = ['yes', 'yes', 'no', 'yes', 'no', 'yes'] \\ \Downarrow \\ y = [1, 1, 0, 1, 0, 1] \]

訓練集與測試集的划分

此時我們可以使用 sklearn.model_selection.train_test_split 來進行划分

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

進行測試集與訓練集划分的一種常見的方法是將數據集按 80/20 進行划分，其中 80% 的數據用作訓練，20% 的數據用作測試，由 test_size = 0.2 指明，random_state 指定是否隨機划分

特征縮放

當我們的數據跨度很大的話或者在某些情況下（如：學習時，模型可能會因數據的大小而給予不同的權重，而我們並不需要如此的情況），我們可以將數據特征進行縮放，使用 sklearn.preprocessing.StandardScaler

from sklearn.preprocessing import StandardScaler

x[:, 0] = labelencoder.fit_transform(x[:, 0]) # 將屬性變為數字
print(x_train)
sc_x = StandardScaler() #
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
print(x_train)

結果

\[\begin{bmatrix} {1} & {4.0} & {2400.0} \\ {0} & {2.0} & {3400.0} \\ {0} & {3.0} & {1200.0} \\ {2} & {6.0} & {4000.0} \end{bmatrix} \]

\[\Downarrow \]

\[\begin{bmatrix} {0.30151134} & {0.16903085} & {-0.32961713} \\ {-0.90453403} & {-1.18321596} & {0.61214609} \\ {-0.90453403} & {-0.50709255} & {-1.45973299} \\ {1.50755672} & {1.52127766} & {1.17720402} \end{bmatrix} \]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習中的數據清洗與特征工程機器學習-數據清洗機器學習：數據清洗及工具OpenRefine 機器學習——數據清洗和特征選擇機器學習之數據清洗與特征提取機器學習基礎與實踐（一）----數據清洗機器學習：數據清洗和特征選擇機器學習之數據預處理機器學習——數據預處理機器學習的數據預處理