從 http://yann.lecun.com/exdb/mnist/ 可以下載原始的文件。
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
The training set contains 60000 examples, and the test set 10000 examples.
The first 5000 examples of the test set are taken from the original NIST training set. The last 5000 are taken from the original NIST test set. The first 5000 are cleaner and easier than the last 5000.
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
The labels values are 0 to 9.
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
........
xxxx unsigned byte ?? pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
TEST SET LABEL FILE (t10k-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 10000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
The labels values are 0 to 9.
TEST SET IMAGE FILE (t10k-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 10000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
........
xxxx unsigned byte ?? pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
THE IDX FILE FORMAT
the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.
The basic format is
magic number
size in dimension 0
size in dimension 1
size in dimension 2
.....
size in dimension N
data
The magic number is an integer (MSB first). The first 2 bytes are always 0.
The third byte codes the type of the data:
0x08: unsigned byte
0x09: signed byte
0x0B: short (2 bytes)
0x0C: int (4 bytes)
0x0D: float (4 bytes)
0x0E: double (8 bytes)
The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices....
The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).
The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.
python 讀取 mnist 文件其實就是 python 怎么讀取 binnary file。mnist 的結構如下,選取 train-images-idx3-ubyte
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
........
xxxx unsigned byte ?? pixel
也就是之前我們要讀取4個 32 bit integer. 試過很多方法,覺得最方便的,至少對我來說還是使用 struct.unpack_from()
filename
=
'train-images.idx3-ubyte'
binfile
=
open
(filename ,
'rb'
)
buf
=
binfile.read()
|
先使用二進制方式把文件都讀進來
index
=
0
magic, numImages , numRows , numColumns
=
struct.unpack_from(
'>IIII'
, buf , index)
index
+
=
struct.calcsize(
'>IIII'
)
|
然后使用struc.unpack_from
'>IIII'是說使用大端法讀取4個unsinged int32
然后讀取一個圖片測試是否讀取成功
im
=
struct.unpack_from(
'>784B'
,buf, index)
index
+
=
struct.calcsize(
'>784B'
)
im
=
np.array(im)
im
=
im.reshape(
28
,
28
)
fig
=
plt.figure()
plotwindow
=
fig.add_subplot(
111
)
plt.imshow(im , cmap
=
'gray'
)
plt.show()
|
'>784B'的意思就是用大端法讀取784個unsigned byte
完整代碼如下,讀取其中第一個圖像:
import numpy as np #python 3.7 import struct import matplotlib.pyplot as plt filename = 'train-images.idx3-ubyte' binfile = open(filename, 'rb') buf = binfile.read() index = 0 magic, numImages, numRows, numColumns = struct.unpack_from('>IIII', buf, index) index += struct.calcsize('>IIII') im = struct.unpack_from('>784B', buf, index) index += struct.calcsize('>784B') im = np.array(im) im = im.reshape(28, 28) fig = plt.figure() plotwindow = fig.add_subplot(111) plt.imshow(im, cmap='gray') plt.show() ### ### https://www.cnblogs.com/x1957/archive/2012/06/02/2531503.html ###
另外一個實例,讀取全部圖像:
## from https://www.jianshu.com/p/84f72791806f # encoding: utf-8 """ @author: monitor1379 @contact: yy4f5da2@hotmail.com @site: www.monitor1379.com @version: 1.0 @license: Apache Licence @file: mnist_decoder.py @time: 2016/8/16 20:03 對MNIST手寫數字數據文件轉換為bmp圖片文件格式。 數據集下載地址為http://yann.lecun.com/exdb/mnist。 相關格式轉換見官網以及代碼注釋。 ======================== 關於IDX文件格式的解析規則: ======================== THE IDX FILE FORMAT the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types. The basic format is magic number size in dimension 0 size in dimension 1 size in dimension 2 ..... size in dimension N data The magic number is an integer (MSB first). The first 2 bytes are always 0. The third byte codes the type of the data: 0x08: unsigned byte 0x09: signed byte 0x0B: short (2 bytes) 0x0C: int (4 bytes) 0x0D: float (4 bytes) 0x0E: double (8 bytes) The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices.... The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors). The data is stored like in a C array, i.e. the index in the last dimension changes the fastest. """ import numpy as np import struct import matplotlib.pyplot as plt # 訓練集文件 train_images_idx3_ubyte_file = 'train-images.idx3-ubyte' # 訓練集標簽文件 train_labels_idx1_ubyte_file = 'train-labels.idx1-ubyte' # 測試集文件 test_images_idx3_ubyte_file = 't10k-images.idx3-ubyte' # 測試集標簽文件 test_labels_idx1_ubyte_file = 't10k-labels.idx1-ubyte' def decode_idx3_ubyte(idx3_ubyte_file): """ 解析idx3文件的通用函數 :param idx3_ubyte_file: idx3文件路徑 :return: 數據集 """ # 讀取二進制數據 bin_data = open(idx3_ubyte_file, 'rb').read() # 解析文件頭信息,依次為魔數、圖片數量、每張圖片高、每張圖片寬 offset = 0 fmt_header = '>iiii' magic_number, num_images, num_rows, num_cols = struct.unpack_from(fmt_header, bin_data, offset) print('魔數:%d, 圖片數量: %d張, 圖片大小: %d*%d' % (magic_number, num_images, num_rows, num_cols)) # 解析數據集 image_size = num_rows * num_cols offset += struct.calcsize(fmt_header) fmt_image = '>' + str(image_size) + 'B' images = np.empty((num_images, num_rows, num_cols)) for i in range(num_images): if (i + 1) % 10000 == 0: print('已解析 %d' % (i + 1) + '張') images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape((num_rows, num_cols)) offset += struct.calcsize(fmt_image) return images def decode_idx1_ubyte(idx1_ubyte_file): """ 解析idx1文件的通用函數 :param idx1_ubyte_file: idx1文件路徑 :return: 數據集 """ # 讀取二進制數據 bin_data = open(idx1_ubyte_file, 'rb').read() # 解析文件頭信息,依次為魔數和標簽數 offset = 0 fmt_header = '>ii' magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset) print('魔數:%d, 圖片數量: %d張' % (magic_number, num_images)) # 解析數據集 offset += struct.calcsize(fmt_header) fmt_image = '>B' labels = np.empty(num_images) for i in range(num_images): if (i + 1) % 10000 == 0: print('已解析 %d' % (i + 1) + '張') labels[i] = struct.unpack_from(fmt_image, bin_data, offset)[0] offset += struct.calcsize(fmt_image) return labels def load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file): """ TRAINING SET IMAGE FILE (train-images-idx3-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000803(2051) magic number 0004 32 bit integer 60000 number of images 0008 32 bit integer 28 number of rows 0012 32 bit integer 28 number of columns 0016 unsigned byte ?? pixel 0017 unsigned byte ?? pixel ........ xxxx unsigned byte ?? pixel Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). :param idx_ubyte_file: idx文件路徑 :return: n*row*col維np.array對象,n為圖片數量 """ return decode_idx3_ubyte(idx_ubyte_file) def load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file): """ TRAINING SET LABEL FILE (train-labels-idx1-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000801(2049) magic number (MSB first) 0004 32 bit integer 60000 number of items 0008 unsigned byte ?? label 0009 unsigned byte ?? label ........ xxxx unsigned byte ?? label The labels values are 0 to 9. :param idx_ubyte_file: idx文件路徑 :return: n*1維np.array對象,n為圖片數量 """ return decode_idx1_ubyte(idx_ubyte_file) def load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file): """ TEST SET IMAGE FILE (t10k-images-idx3-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000803(2051) magic number 0004 32 bit integer 10000 number of images 0008 32 bit integer 28 number of rows 0012 32 bit integer 28 number of columns 0016 unsigned byte ?? pixel 0017 unsigned byte ?? pixel ........ xxxx unsigned byte ?? pixel Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). :param idx_ubyte_file: idx文件路徑 :return: n*row*col維np.array對象,n為圖片數量 """ return decode_idx3_ubyte(idx_ubyte_file) def load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file): """ TEST SET LABEL FILE (t10k-labels-idx1-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000801(2049) magic number (MSB first) 0004 32 bit integer 10000 number of items 0008 unsigned byte ?? label 0009 unsigned byte ?? label ........ xxxx unsigned byte ?? label The labels values are 0 to 9. :param idx_ubyte_file: idx文件路徑 :return: n*1維np.array對象,n為圖片數量 """ return decode_idx1_ubyte(idx_ubyte_file) def run(): train_images = load_train_images() train_labels = load_train_labels() # test_images = load_test_images() # test_labels = load_test_labels() # 查看前十個數據及其標簽以讀取是否正確 for i in range(3): print(train_labels[i]) plt.imshow(train_images[i], cmap='gray') plt.show() print('done') if __name__ == '__main__': run()
另外一個實例:
## https://www.e-learn.cn/content/wangluowenzhang/615391 import os import numpy as np import matplotlib.pyplot as plt def load_data(data_path): ''' 函數功能:導出MNIST數據 輸入: data_path 傳入數據所在路徑(解壓后的數據) 輸出: train_data 輸出data train_label 輸出label ''' f_data = open(os.path.join(data_path, 'train-images.idx3-ubyte')) loaded_data = np.fromfile(file=f_data, dtype=np.uint8) # 前16個字符為說明符,需要跳過 train_data = loaded_data[16:].reshape((-1, 784)).astype(np.float) f_label = open(os.path.join(data_path, 'train-labels.idx1-ubyte')) loaded_label = np.fromfile(file=f_label, dtype=np.uint8) # 前8個字符為說明符,需要跳過 train_label = loaded_label[8:].reshape((-1)).astype(np.float) return train_data, train_label if __name__ == '__main__': train_data, train_label = load_data('./') ## path of files # 把下載好的minst數據集 放在xxxxxxx/minst文件夾里面;填入路徑即可 print(np.shape(train_data)) # (60000, 784) print(np.shape(train_label)) # (60000,) for i in range(5): img = train_data[i].reshape(28, 28) # 變成二維圖片 plt.imshow(img) plt.show() print(train_label[i]) # 輸出了前五個圖片 及其標簽
From:
https://www.cnblogs.com/x1957/archive/2012/06/02/2531503.html
REF:
https://www.jianshu.com/p/84f72791806f
https://www.e-learn.cn/content/wangluowenzhang/615391
https://www.jianshu.com/p/84f72791806f