機器學習中處理MNIST數據集相當於編程語言中的"hello world",其中訓練集中包含60000 個examples, 測試集中包含10000個examples。數據為像素28*28=784的圖像,標簽為0-9等10個數字標簽。
為方便處理,我們希望輸出的數據為(x_train,y_train),(x_test,y_test)四個數組,其中x_train包含了60000個維度為784的向量表示圖像,將標簽進行one-hot編碼,比如將數字標簽2編碼為[0,0,1,0,0,0,0,0,0,0]這樣的數組,因此y_test包含60000個維度為10的向量表示對應的標簽。如下:
下面介紹幾種讀取MNIST的方法。
本地文件讀取
讀取.gz壓縮文件
去MNIST官網下載數據集,即四個.gz文件,如下
#!/usr/bin/env python
# coding=utf-8
''' @Author: John @Email: johnjim0816@gmail.com @Date: 2020-05-21 23:36:58 @LastEditor: John @LastEditTime: 2020-05-22 07:24:45 @Discription: @Environment: python 3.7.7 '''
import numpy as np
from struct import unpack
import gzip
def __read_image(path):
with gzip.open(path, 'rb') as f:
magic, num, rows, cols = unpack('>4I', f.read(16))
img=np.frombuffer(f.read(), dtype=np.uint8).reshape(num, 28*28)
return img
def __read_label(path):
with gzip.open(path, 'rb') as f:
magic, num = unpack('>2I', f.read(8))
lab = np.frombuffer(f.read(), dtype=np.uint8)
# print(lab[1])
return lab
def __normalize_image(image):
img = image.astype(np.float32) / 255.0
return img
def __one_hot_label(label):
lab = np.zeros((label.size, 10))
for i, row in enumerate(lab):
row[label[i]] = 1
return lab
def load_mnist(x_train_path, y_train_path, x_test_path, y_test_path, normalize=True, one_hot=True):
'''讀入MNIST數據集 Parameters ---------- normalize : 將圖像的像素值正規化為0.0~1.0 one_hot_label : one_hot為True的情況下,標簽作為one-hot數組返回 one-hot數組是指[0,0,1,0,0,0,0,0,0,0]這樣的數組 Returns ---------- (訓練圖像, 訓練標簽), (測試圖像, 測試標簽) '''
image = {
'train' : __read_image(x_train_path),
'test' : __read_image(x_test_path)
}
label = {
'train' : __read_label(y_train_path),
'test' : __read_label(y_test_path)
}
if normalize:
for key in ('train', 'test'):
image[key] = __normalize_image(image[key])
if one_hot:
for key in ('train', 'test'):
label[key] = __one_hot_label(label[key])
return (image['train'], label['train']), (image['test'], label['test'])
x_train_path='./Mnist/train-images-idx3-ubyte.gz'
y_train_path='./Mnist/train-labels-idx1-ubyte.gz'
x_test_path='./Mnist/t10k-images-idx3-ubyte.gz'
y_test_path='./Mnist/t10k-labels-idx1-ubyte.gz'
(x_train,y_train),(x_test,y_test)=load_mnist(x_train_path, y_train_path, x_test_path, y_test_path)
讀取解壓的文件
即將四個.gz文件解壓,這種讀取方式有很多種,如下:
- 使用np.fromfile讀取
- 使用idx2numpy模塊讀取
- 使用array讀取
在線讀取
使用tensorflow讀取
tensor中的keras模塊已經集成了mnist相關處理方式,如下:
from keras.datasets import mnist
from keras.utils import np_utils
import numpy as np
def load_data(): # categorical_crossentropy
(x_train, y_train), (x_test, y_test) = mnist.load_data()
number = 10000
x_train = x_train[0:number]
y_train = y_train[0:number]
x_train = x_train.reshape(number, 28 * 28)
x_test = x_test.reshape(x_test.shape[0], 28 * 28)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# convert class vectors to binary class matrices
y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)
x_test = np.random.normal(x_test) # 加噪聲
x_train,x_test= x_train / 255,x_test / 255
return (x_train, y_train), (x_test, y_test)
(x_train, y_train), (x_test, y_test) = load_data()
使用python-mnist模塊
python中也集成了相關的在線模塊,點擊查看方法