注：本系列所有博客將持續更新並發布在github上，您可以通過github下載本系列所有文章筆記文件。

在整個機器學習過程中，除了訓練模型外，應該就屬數據預處理過程消耗的精力最多，數據預處理過程需要完成的任務包括數據讀取、過濾、轉換等等。為了將用戶從繁雜的預處理操作中解放處理，更多地將精力放在算法建模上，TensorFlow中提供了data模塊，這一模塊以多種方式提供了數據讀取、數據處理、數據保存等功能。本文重點是data模塊中的Dataset對象。

1 創建¶

對於創建Dataset對象，官方文檔中總結為兩種方式，我將這兩種方式細化后總結為4中方式：

（1）通過Dataset中的range()方法創建包含一定序列的Dataset對象。

range()

range()方法是Dataset內部定義的一個的靜態方法，可以直接通過類名調用。另外，Dataset中的range()方法與Python本身內置的range()方法接受參數形式是一致的，可以接受range(begin)、range(begin, end)、range（begin, end, step）等多種方式傳參。

In [1]:

import tensorflow as tf
import numpy as np

In [2]:

dataset1 = tf.data.Dataset.range(5)
type(dataset1)

Out[2]:

tensorflow.python.data.ops.dataset_ops.RangeDataset

注：RangeDataset是Dataset的一個子類。 Dataset對象屬於可迭代對象，可通過循環進行遍歷：

In [3]:

for i in dataset1:
    print(i)
    print(i.numpy())

tf.Tensor(0, shape=(), dtype=int64)
0
tf.Tensor(1, shape=(), dtype=int64)
1
tf.Tensor(2, shape=(), dtype=int64)
2
tf.Tensor(3, shape=(), dtype=int64)
3
tf.Tensor(4, shape=(), dtype=int64)
4

可以看到，range()方法創建的Dataset對象內部每一個元素都以Tensor對象的形式存在，可以通過numpy()方法訪問真實值。

from_generator()

如果你覺得range()方法不夠靈活，功能不夠強大，那么你可以嘗試使用from_generator()方法。from_generator()方法接收一個可調用的生成器函數最為參數，在遍歷from_generator()方法返回的Dataset對象過程中不斷生成新的數據，減少內存占用，這在大數據集中很有用。

In [4]:

def count(stop):
  i = 0
  while i<stop:
    print('第%s次調用……'%i)
    yield i
    i += 1

In [5]:

dataset2 = tf.data.Dataset.from_generator(count, args=[3], output_types=tf.int32, output_shapes = (), )

In [6]:

a = iter(dataset2)

In [7]:

next(a)

第0次調用……

Out[7]:

<tf.Tensor: id=46, shape=(), dtype=int32, numpy=0>

In [8]:

next(a)

第1次調用……

Out[8]:

<tf.Tensor: id=47, shape=(), dtype=int32, numpy=1>

In [9]:

for i in dataset2:
    print(i)
    print(i.numpy())

第0次調用……
tf.Tensor(0, shape=(), dtype=int32)
0
第1次調用……
tf.Tensor(1, shape=(), dtype=int32)
1
第2次調用……
tf.Tensor(2, shape=(), dtype=int32)
2

（2）通過接收其他類型的集合類對象創建Dataset對象。這里所說的集合類型對象包含Python內置的list、tuple，numpy中的ndarray等等。這種創建Dataset對象的方法大多通過from_tensors()和from_tensor_slices()兩個方法實現。這兩個方法很常用，重點說一說。

from_tensors()
from_tensors()方法接受一個集合類型對象作為參數，返回值為一個TensorDataset類型對象，對象內容、shape因傳入參數類型而異。

當接收參數為list或Tensor對象時，返回的情況是一樣的，因為TensorFlow內部會將list先轉為Tensor對象，然后實例化一個Dataset對象：

In [10]:

a = [0,1,2,3,4]
dataset1 = tf.data.Dataset.from_tensors(a)
dataset1_n = tf.data.Dataset.from_tensors(np.array(a))
dataset1_t = tf.data.Dataset.from_tensors(tf.constant(a))

In [11]:

dataset1,next(iter(dataset1))

Out[11]:

(<TensorDataset shapes: (5,), types: tf.int32>,
 <tf.Tensor: id=67, shape=(5,), dtype=int32, numpy=array([0, 1, 2, 3, 4], dtype=int32)>)

In [12]:

dataset1_n,next(iter(dataset1_n))

Out[12]:

(<TensorDataset shapes: (5,), types: tf.int64>,
 <tf.Tensor: id=73, shape=(5,), dtype=int64, numpy=array([0, 1, 2, 3, 4])>)

In [13]:

dataset1_t,next(iter(dataset1_t))

Out[13]:

(<TensorDataset shapes: (5,), types: tf.int32>,
 <tf.Tensor: id=79, shape=(5,), dtype=int32, numpy=array([0, 1, 2, 3, 4], dtype=int32)>)

多維結構也是一樣的：

In [14]:

a = [0,1,2,3,4]
b = [5,6,7,8,9]
dataset2 = tf.data.Dataset.from_tensors([a,b])
dataset2_n = tf.data.Dataset.from_tensors(np.array([a,b]))
dataset2_t = tf.data.Dataset.from_tensors(tf.constant([a,b]))

In [15]:

dataset2,next(iter(dataset2))

Out[15]:

(<TensorDataset shapes: (2, 5), types: tf.int32>,
 <tf.Tensor: id=91, shape=(2, 5), dtype=int32, numpy=
 array([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]], dtype=int32)>)

In [16]:

dataset2_n,next(iter(dataset2_n))

Out[16]:

(<TensorDataset shapes: (2, 5), types: tf.int64>,
 <tf.Tensor: id=97, shape=(2, 5), dtype=int64, numpy=
 array([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])>)

In [17]:

dataset2_t,next(iter(dataset2_t))

Out[17]:

(<TensorDataset shapes: (2, 5), types: tf.int32>,
 <tf.Tensor: id=103, shape=(2, 5), dtype=int32, numpy=
 array([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]], dtype=int32)>)

當接收參數為數組就不一樣了，此時Dataset內部內容為一個tuple，tuple的元素是原來tuple元素轉換為的Tensor對象：

In [18]:

a = [0,1,2,3,4]
b = [5,6,7,8,9]
dataset3 = tf.data.Dataset.from_tensors((a,b))

In [19]:

for i in dataset3:
    print(type(i))
    print(i)
    for j in i:
        print(j)

<class 'tuple'>
(<tf.Tensor: id=112, shape=(5,), dtype=int32, numpy=array([0, 1, 2, 3, 4], dtype=int32)>, <tf.Tensor: id=113, shape=(5,), dtype=int32, numpy=array([5, 6, 7, 8, 9], dtype=int32)>)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int32)
tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int32)

from_tensor_slices()
from_tensor_slices()方法返回一個TensorSliceDataset類對象，TensorSliceDataset對象比from_tensors()方法返回的TensorDataset對象支持更加豐富的操作，例如batch操作等，因此在實際應用中更加廣泛。返回的TensorSliceDataset對象內容、shape因傳入參數類型而異。

當傳入一個list時，時將list中元素逐個轉換為Tensor對象然后依次放入Dataset中，所以Dataset中有多個Tensor對象：

In [20]:

a = [0,1,2,3,4]
dataset1 = tf.data.Dataset.from_tensor_slices(a)

In [21]:

dataset1

Out[21]:

<TensorSliceDataset shapes: (), types: tf.int32>

In [22]:

for i,elem in enumerate(dataset1):
    print(i, '-->', elem)

0 --> tf.Tensor(0, shape=(), dtype=int32)
1 --> tf.Tensor(1, shape=(), dtype=int32)
2 --> tf.Tensor(2, shape=(), dtype=int32)
3 --> tf.Tensor(3, shape=(), dtype=int32)
4 --> tf.Tensor(4, shape=(), dtype=int32)

In [23]:

a = [0,1,2,3,4]
b = [5,6,7,8,9]
dataset2 = tf.data.Dataset.from_tensor_slices([a,b])

In [24]:

dataset2

Out[24]:

<TensorSliceDataset shapes: (5,), types: tf.int32>

In [25]:

for i,elem in enumerate(dataset2):
    print(i, '-->', elem)

0 --> tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int32)
1 --> tf.Tensor([5 6 7 8 9], shape=(5,), dtype=int32)

當傳入參數為tuple時，會將tuple中各元素轉換為Tensor對象，然后將第一維度對應位置的切片進行重新組合成一個tuple依次放入到Dataset中，所以在返回的Dataset中有多個tuple。這種形式在對訓練集和測試集進行重新組合是非常實用。

In [26]:

a = [0,1,2,3,4]
b = [5,6,7,8,9]
dataset1 = tf.data.Dataset.from_tensor_slices((a,b))

In [27]:

dataset1

Out[27]:

<TensorSliceDataset shapes: ((), ()), types: (tf.int32, tf.int32)>

In [28]:

for i in dataset1:
    print(i)

(<tf.Tensor: id=143, shape=(), dtype=int32, numpy=0>, <tf.Tensor: id=144, shape=(), dtype=int32, numpy=5>)
(<tf.Tensor: id=145, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=146, shape=(), dtype=int32, numpy=6>)
(<tf.Tensor: id=147, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=148, shape=(), dtype=int32, numpy=7>)
(<tf.Tensor: id=149, shape=(), dtype=int32, numpy=3>, <tf.Tensor: id=150, shape=(), dtype=int32, numpy=8>)
(<tf.Tensor: id=151, shape=(), dtype=int32, numpy=4>, <tf.Tensor: id=152, shape=(), dtype=int32, numpy=9>)

In [29]:

c = ['a','b','c','d','e']
dataset3 = tf.data.Dataset.from_tensor_slices((a,b,c))

In [30]:

dataset3

Out[30]:

<TensorSliceDataset shapes: ((), (), ()), types: (tf.int32, tf.int32, tf.string)>

In [31]:

for i in dataset3:
    print(i)

(<tf.Tensor: id=162, shape=(), dtype=int32, numpy=0>, <tf.Tensor: id=163, shape=(), dtype=int32, numpy=5>, <tf.Tensor: id=164, shape=(), dtype=string, numpy=b'a'>)
(<tf.Tensor: id=165, shape=(), dtype=int32, numpy=1>, <tf.Tensor: id=166, shape=(), dtype=int32, numpy=6>, <tf.Tensor: id=167, shape=(), dtype=string, numpy=b'b'>)
(<tf.Tensor: id=168, shape=(), dtype=int32, numpy=2>, <tf.Tensor: id=169, shape=(), dtype=int32, numpy=7>, <tf.Tensor: id=170, shape=(), dtype=string, numpy=b'c'>)
(<tf.Tensor: id=171, shape=(), dtype=int32, numpy=3>, <tf.Tensor: id=172, shape=(), dtype=int32, numpy=8>, <tf.Tensor: id=173, shape=(), dtype=string, numpy=b'd'>)
(<tf.Tensor: id=174, shape=(), dtype=int32, numpy=4>, <tf.Tensor: id=175, shape=(), dtype=int32, numpy=9>, <tf.Tensor: id=176, shape=(), dtype=string, numpy=b'e'>)

對比總結一下from_generator(）、from_tensor()、from_tensor_slices()這三個方法：

from_tensors()在形式上與from_tensor_slices()很相似，但其實from_tensors()方法出場頻率上比from_tensor_slices()差太多，因為from_tensor_slices()的功能更加符合實際需求，且返回的TensorSliceDataset對象也提供更多的數據處理功能。from_tensors()方法在接受list類型參數時，將整個list轉換為Tensor對象放入Dataset中，當接受參數為tuple時，將tuple內元素轉換為Tensor對象，然后將這個tuple放入Dataset中。
from_generator(）方法接受一個可調用的生成器函數作為參數，在遍歷Dataset對象時，通過通用生成器函數繼續生成新的數據供訓練和測試模型使用，這在大數據集合中很實用。
from_tensor_slices()方法接受參數為list時，將list各元素依次轉換為Tensor對象，然后依次放入Dataset中；更為常見的情況是接受的參數為tuple，在這種情況下，要求tuple中各元素第一維度長度必須相等，from_tensor_slices()方法會將tuple各元素第一維度進行拆解，然后將對應位置的元素進行重組成一個個tuple依次放入Dataset中，這一功能在重新組合數據集屬性和標簽時很有用。另外，from_tensor_slices()方法返回的TensorSliceDataset對象支持batch、shuffle等等功能對數據進一步處理。

（3）通過讀取磁盤中的文件（文本、圖片等等）來創建Dataset。tf.data中提供了TextLineDataset、TFRecordDataset等對象來實現此功能。這部分內容比較多，也比較重要，我打算后續用專門一篇博客來總結這部分內容。

2 功能函數¶

（1）take()

功能：用於返回一個新的Dataset對象，新的Dataset對象包含的數據是原Dataset對象的子集。

參數：

count：整型，用於指定前count條數據用於創建新的Dataset對象，如果count為-1或大於原Dataset對象的size,則用原Dataset對象的全部數據創建新的對象。

In [32]:

dataset = tf.data.Dataset.range(10)
dataset_take = dataset.take(5)

In [33]:

for i in dataset_take:
    print(i)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)

（2）batch()

功能：將Dataset中連續的數據分割成批。

參數：

batch_size：在單個批次中合並的此數據集的連續元素數。
drop_remainder：如果最后一批的數據量少於指定的batch_size，是否拋棄最后一批，默認為False，表示不拋棄。

In [34]:

dataset = tf.data.Dataset.range(11)
dataset_batch = dataset.batch(3)

In [35]:

for i in dataset_batch:
    print(i)

tf.Tensor([0 1 2], shape=(3,), dtype=int64)
tf.Tensor([3 4 5], shape=(3,), dtype=int64)
tf.Tensor([6 7 8], shape=(3,), dtype=int64)
tf.Tensor([ 9 10], shape=(2,), dtype=int64)

In [36]:

dataset_batch = dataset.batch(3,drop_remainder=True)

In [37]:

for i in dataset_batch:
    print(i)

tf.Tensor([0 1 2], shape=(3,), dtype=int64)
tf.Tensor([3 4 5], shape=(3,), dtype=int64)
tf.Tensor([6 7 8], shape=(3,), dtype=int64)

In [38]:

train_x = tf.random.uniform((10,3),maxval=100, dtype=tf.int32)
train_y = tf.range(10)

In [39]:

dataset = tf.data.Dataset.from_tensor_slices((train_x, train_y))

In [40]:

for i in dataset.take(3):
    print(i)

(<tf.Tensor: id=236, shape=(3,), dtype=int32, numpy=array([81, 53, 85], dtype=int32)>, <tf.Tensor: id=237, shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: id=238, shape=(3,), dtype=int32, numpy=array([13,  7, 25], dtype=int32)>, <tf.Tensor: id=239, shape=(), dtype=int32, numpy=1>)
(<tf.Tensor: id=240, shape=(3,), dtype=int32, numpy=array([83, 25, 55], dtype=int32)>, <tf.Tensor: id=241, shape=(), dtype=int32, numpy=2>)

In [41]:

dataset_batch = dataset.batch(4)

In [42]:

for i in dataset_batch:
    print(i)

(<tf.Tensor: id=250, shape=(4, 3), dtype=int32, numpy=
array([[81, 53, 85],
       [13,  7, 25],
       [83, 25, 55],
       [53, 41, 11]], dtype=int32)>, <tf.Tensor: id=251, shape=(4,), dtype=int32, numpy=array([0, 1, 2, 3], dtype=int32)>)
(<tf.Tensor: id=252, shape=(4, 3), dtype=int32, numpy=
array([[41, 58, 39],
       [44, 68, 55],
       [52, 34, 22],
       [66, 39,  5]], dtype=int32)>, <tf.Tensor: id=253, shape=(4,), dtype=int32, numpy=array([4, 5, 6, 7], dtype=int32)>)
(<tf.Tensor: id=254, shape=(2, 3), dtype=int32, numpy=
array([[73,  8, 20],
       [67, 71, 98]], dtype=int32)>, <tf.Tensor: id=255, shape=(2,), dtype=int32, numpy=array([8, 9], dtype=int32)>)

為什么在訓練模型時要將Dataset分割成一個個batch呢？

對於小數據集是否使用batch關系不大，但是對於大數據集如果不分割成batch意味着將這個數據集一次性輸入模型中，容易造成內存爆炸。
通過並行化提高內存的利用率。就是盡量讓你的GPU滿載運行，提高訓練速度。
單個epoch的迭代次數減少了，參數的調整也慢了，假如要達到相同的識別精度，需要更多的epoch。
適當Batch Size使得梯度下降方向更加准確。

（3）padded_batch()

功能： batch()的進階版，可以對shape不一致的連續元素進行分批。

參數：

batch_size：在單個批次中合並的此數據集的連續元素個數。
padded_shapes：tf.TensorShape或其他描述tf.int64矢量張量對象，表示在批處理之前每個輸入元素的各個組件應填充到的形狀。如果參數中有None，則表示將填充為每個批次中該尺寸的最大尺寸。
padding_values：要用於各個組件的填充值。默認值0用於數字類型，字符串類型則默認為空字符。
drop_remainder：如果最后一批的數據量少於指定的batch_size，是否拋棄最后一批，默認為False，表示不拋棄。

In [43]:

dataset = tf.data.Dataset.range(10)

In [44]:

dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))

In [45]:

dataset_padded = dataset.padded_batch(4, padded_shapes=(None,))

In [46]:

for batch in dataset_padded:
    print(batch.numpy())
    print('---------------------')

[[0 0 0]
 [1 0 0]
 [2 2 0]
 [3 3 3]]
---------------------
[[4 4 4 4 0 0 0]
 [5 5 5 5 5 0 0]
 [6 6 6 6 6 6 0]
 [7 7 7 7 7 7 7]]
---------------------
[[8 8 8 8 8 8 8 8 0]
 [9 9 9 9 9 9 9 9 9]]
---------------------

In [47]:

dataset_padded = dataset.padded_batch(4, padded_shapes=(10,),padding_values=tf.constant(9,dtype=tf.int64))  # 修改填充形狀和填充元素

In [48]:

for batch in dataset_padded:
    print(batch.numpy())
    print('---------------------')

[[9 9 9 9 9 9 9 9 9 9]
 [1 9 9 9 9 9 9 9 9 9]
 [2 2 9 9 9 9 9 9 9 9]
 [3 3 3 9 9 9 9 9 9 9]]
---------------------
[[4 4 4 4 9 9 9 9 9 9]
 [5 5 5 5 5 9 9 9 9 9]
 [6 6 6 6 6 6 9 9 9 9]
 [7 7 7 7 7 7 7 9 9 9]]
---------------------
[[8 8 8 8 8 8 8 8 9 9]
 [9 9 9 9 9 9 9 9 9 9]]
---------------------

（4）map()

功能：以dataset中每一位元素為參數執行pap_func()方法，這一功能在數據預處理中修改dataset中元素是很實用。

參數：

map_func:回調方法。

In [49]:

def change_dtype(t):  # 將類型修改為int32
    return tf.cast(t,dtype=tf.int32)

In [50]:

dataset = tf.data.Dataset.range(3)

In [51]:

for i in dataset:
    print(i)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)

In [52]:

dataset_map = dataset.map(change_dtype)

In [53]:

for i in dataset_map:
    print(i)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)

map_func的參數必須對應dataset中的元素類型，例如，如果dataset中元素是tuple，map_func可以這么定義：

In [54]:

def change_dtype_2(t1,t2):
    return t1/10,tf.cast(t2,dtype=tf.int32)*(-1)  # 第一位元素除以10，第二為元素乘以-1

In [55]:

dataset = tf.data.Dataset.from_tensor_slices((tf.range(3),tf.range(3)))

In [56]:

dataset_map = dataset.map(change_dtype_2)

In [57]:

for i in dataset_map:
    print(i)

(<tf.Tensor: id=347, shape=(), dtype=float64, numpy=0.0>, <tf.Tensor: id=348, shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: id=349, shape=(), dtype=float64, numpy=0.1>, <tf.Tensor: id=350, shape=(), dtype=int32, numpy=-1>)
(<tf.Tensor: id=351, shape=(), dtype=float64, numpy=0.2>, <tf.Tensor: id=352, shape=(), dtype=int32, numpy=-2>)

（5）filter()

功能：對Dataset中每一個執行指定過濾方法進行過濾，返回過濾后的Dataset對象

參數：

predicate：過濾方法，返回值必須為True或False

In [58]:

dataset = tf.data.Dataset.range(5)

In [59]:

def filter_func(t):  # 過濾出值為偶數的元素
    if t % 2 == 0:
        return True
    else:
        return False

In [60]:

dataset_filter = dataset.filter(filter_func)

In [61]:

for i in dataset_filter:
    print(i)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)

（6）shuffle()

功能：隨機打亂數據

參數：

buffer_size：緩沖區大小，姑且認為是混亂程度吧，當值為1時，完全不打亂，當值為整個Dataset元素總數時，完全打亂。
seed：將用於創建分布的隨機種子。
reshuffle_each_iteration：如果為true，則表示每次迭代數據集時都應進行偽隨機重排，默認為True。

In [62]:

dataset = tf.data.Dataset.range(5)

In [63]:

dataset_s = dataset.shuffle(1)

In [64]:

for i in dataset_s:
    print(i)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)

In [65]:

dataset_s = dataset.shuffle(5)

In [66]:

for i in dataset_s:
    print(i)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)

（7）repeat()

功能：對Dataset中的數據進行重復，以創建新的Dataset

參數：

count：重復次數，默認為None，表示不重復，當值為-1時，表示無限重復。

In [67]:

dataset = tf.data.Dataset.range(3)

In [68]:

dataset_repeat = dataset.repeat(3)

In [69]:

for i in dataset_repeat:
    print(i)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)

TensorFlow2.0（6）：數據預處理中的Dataset

1 創建¶

2 功能函數¶

免責聲明！