詳解Tensorflow數據讀取有三種方式（next_batch）

本文轉載自查看原文 2018-07-22 16:45 6064 Python/ TensorFlow

轉自：https://blog.csdn.net/lujiandong1/article/details/53376802

Tensorflow數據讀取有三種方式：

Preloaded data: 預加載數據
Feeding: Python產生數據，再把數據喂給后端。
Reading from file: 從文件中直接讀取

這三種有讀取方式有什么區別呢？我們首先要知道TensorFlow(TF)是怎么樣工作的。

TF的核心是用C++寫的，這樣的好處是運行快，缺點是調用不靈活。而Python恰好相反，所以結合兩種語言的優勢。涉及計算的核心算子和運行框架是用C++寫的，並提供API給Python。Python調用這些API，設計訓練模型(Graph)，再將設計好的Graph給后端去執行。簡而言之，Python的角色是Design，C++是Run。

一、預加載數據：

[python]view plain copy
               
           
import tensorflow as tf  
# 設計Graph  
x1 = tf.constant([2, 3, 4])  
x2 = tf.constant([4, 0, 1])  
y = tf.add(x1, x2)  
# 打開一個session --> 計算y  
with tf.Session() as sess:  
    print sess.run(y)  

二、python產生數據，再將數據喂給后端

[python]view plain copy
               
           
import tensorflow as tf  
# 設計Graph  
x1 = tf.placeholder(tf.int16)  
x2 = tf.placeholder(tf.int16)  
y = tf.add(x1, x2)  
# 用Python產生數據  
li1 = [2, 3, 4]  
li2 = [4, 0, 1]  
# 打開一個session --> 喂數據 --> 計算y  
with tf.Session() as sess:  
    print sess.run(y, feed_dict={x1: li1, x2: li2})  

說明：在這里x1, x2只是占位符，沒有具體的值，那么運行的時候去哪取值呢？這時候就要用到sess.run()中的feed_dict參數，將Python產生的數據喂給后端，並計算y。
這兩種方案的缺點：

1、預加載：將數據直接內嵌到Graph中，再把Graph傳入Session中運行。當數據量比較大時，Graph的傳輸會遇到效率問題。

2、用占位符替代數據，待運行的時候填充數據。

前兩種方法很方便，但是遇到大型數據的時候就會很吃力，即使是Feeding，中間環節的增加也是不小的開銷，比如數據類型轉換等等。最優的方案就是在Graph定義好文件讀取的方法，讓TF自己去從文件中讀取數據，並解碼成可使用的樣本集。

三、從文件中讀取，簡單來說就是將數據讀取模塊的圖搭好

1、准備數據，構造三個文件,A.csv,B.csv,C.csv

[python]view plain copy
               
           
$ echo -e "Alpha1,A1\nAlpha2,A2\nAlpha3,A3" > A.csv  
$ echo -e "Bee1,B1\nBee2,B2\nBee3,B3" > B.csv  
$ echo -e "Sea1,C1\nSea2,C2\nSea3,C3" > C.csv  

2、單個Reader，單個樣本

[python]view plain copy
               
           
#-*- coding:utf-8 -*-  
import tensorflow as tf  
# 生成一個先入先出隊列和一個QueueRunner,生成文件名隊列  
filenames = ['A.csv', 'B.csv', 'C.csv']  
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)  
# 定義Reader  
reader = tf.TextLineReader()  
key, value = reader.read(filename_queue)  
# 定義Decoder  
example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']])  
#example_batch, label_batch = tf.train.shuffle_batch([example,label], batch_size=1, capacity=200, min_after_dequeue=100, num_threads=2)  
# 運行Graph  
with tf.Session() as sess:  
    coord = tf.train.Coordinator()  #創建一個協調器，管理線程  
    threads = tf.train.start_queue_runners(coord=coord)  #啟動QueueRunner, 此時文件名隊列已經進隊。  
    for i in range(10):  
        print example.eval(),label.eval()  
    coord.request_stop()  
    coord.join(threads)  

說明：這里沒有使用tf.train.shuffle_batch，會導致生成的樣本和label之間對應不上，亂序了。生成結果如下：

Alpha1 A2
Alpha3 B1
Bee2 B3
Sea1 C2
Sea3 A1
Alpha2 A3
Bee1 B2
Bee3 C1
Sea2 C3
Alpha1 A2

解決方案：用tf.train.shuffle_batch,那么生成的結果就能夠對應上。

[python]view plain copy
               
           
#-*- coding:utf-8 -*-  
import tensorflow as tf  
# 生成一個先入先出隊列和一個QueueRunner,生成文件名隊列  
filenames = ['A.csv', 'B.csv', 'C.csv']  
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)  
# 定義Reader  
reader = tf.TextLineReader()  
key, value = reader.read(filename_queue)  
# 定義Decoder  
example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']])  
example_batch, label_batch = tf.train.shuffle_batch([example,label], batch_size=1, capacity=200, min_after_dequeue=100, num_threads=2)  
# 運行Graph  
with tf.Session() as sess:  
    coord = tf.train.Coordinator()  #創建一個協調器，管理線程  
    threads = tf.train.start_queue_runners(coord=coord)  #啟動QueueRunner, 此時文件名隊列已經進隊。  
    for i in range(10):  
        e_val,l_val = sess.run([example_batch, label_batch])  
        print e_val,l_val  
    coord.request_stop()  
    coord.join(threads)  

3、單個Reader,多個樣本,主要也是通過tf.train.shuffle_batch來實現

[python]view plain copy
               
           
#-*- coding:utf-8 -*-  
import tensorflow as tf  
filenames = ['A.csv', 'B.csv', 'C.csv']  
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)  
reader = tf.TextLineReader()  
key, value = reader.read(filename_queue)  
example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']])  
# 使用tf.train.batch()會多加了一個樣本隊列和一個QueueRunner。  
#Decoder解后數據會進入這個隊列，再批量出隊。  
# 雖然這里只有一個Reader，但可以設置多線程，相應增加線程數會提高讀取速度，但並不是線程越多越好。  
example_batch, label_batch = tf.train.batch(  
      [example, label], batch_size=5)  
with tf.Session() as sess:  
    coord = tf.train.Coordinator()  
    threads = tf.train.start_queue_runners(coord=coord)  
    for i in range(10):  
        e_val,l_val = sess.run([example_batch,label_batch])  
        print e_val,l_val  
    coord.request_stop()  
    coord.join(threads)  

說明：下面這種寫法，提取出來的batch_size個樣本，特征和label之間也是不同步的

[python]view plain copy
               
           
#-*- coding:utf-8 -*-  
import tensorflow as tf  
filenames = ['A.csv', 'B.csv', 'C.csv']  
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)  
reader = tf.TextLineReader()  
key, value = reader.read(filename_queue)  
example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']])  
# 使用tf.train.batch()會多加了一個樣本隊列和一個QueueRunner。  
#Decoder解后數據會進入這個隊列，再批量出隊。  
# 雖然這里只有一個Reader，但可以設置多線程，相應增加線程數會提高讀取速度，但並不是線程越多越好。  
example_batch, label_batch = tf.train.batch(  
      [example, label], batch_size=5)  
with tf.Session() as sess:  
    coord = tf.train.Coordinator()  
    threads = tf.train.start_queue_runners(coord=coord)  
    for i in range(10):  
        print example_batch.eval(), label_batch.eval()  
    coord.request_stop()  
    coord.join(threads)  

說明：輸出結果如下：可以看出feature和label之間是不對應的

['Alpha1' 'Alpha2' 'Alpha3' 'Bee1' 'Bee2'] ['B3' 'C1' 'C2' 'C3' 'A1']
['Alpha2' 'Alpha3' 'Bee1' 'Bee2' 'Bee3'] ['C1' 'C2' 'C3' 'A1' 'A2']
['Alpha3' 'Bee1' 'Bee2' 'Bee3' 'Sea1'] ['C2' 'C3' 'A1' 'A2' 'A3']

4、多個reader，多個樣本

[python]view plain copy
               
           
#-*- coding:utf-8 -*-  
import tensorflow as tf  
filenames = ['A.csv', 'B.csv', 'C.csv']  
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)  
reader = tf.TextLineReader()  
key, value = reader.read(filename_queue)  
record_defaults = [['null'], ['null']]  
#定義了多種解碼器,每個解碼器跟一個reader相連  
example_list = [tf.decode_csv(value, record_defaults=record_defaults)  
                  for _ in range(2)]  # Reader設置為2  
# 使用tf.train.batch_join()，可以使用多個reader，並行讀取數據。每個Reader使用一個線程。  
example_batch, label_batch = tf.train.batch_join(  
      example_list, batch_size=5)  
with tf.Session() as sess:  
    coord = tf.train.Coordinator()  
    threads = tf.train.start_queue_runners(coord=coord)  
    for i in range(10):  
        e_val,l_val = sess.run([example_batch,label_batch])  
        print e_val,l_val  
    coord.request_stop()  
    coord.join(threads)  

tf.train.batch與tf.train.shuffle_batch函數是單個Reader讀取，但是可以多線程。tf.train.batch_join與tf.train.shuffle_batch_join可設置多Reader讀取，每個Reader使用一個線程。至於兩種方法的效率，單Reader時，2個線程就達到了速度的極限。多Reader時，2個Reader就達到了極限。所以並不是線程越多越快，甚至更多的線程反而會使效率下降。

5、迭代控制，設置epoch參數，指定我們的樣本在訓練的時候只能被用多少輪

[python]view plain copy
               
           
#-*- coding:utf-8 -*-  
import tensorflow as tf  
filenames = ['A.csv', 'B.csv', 'C.csv']  
#num_epoch: 設置迭代數  
filename_queue = tf.train.string_input_producer(filenames, shuffle=False,num_epochs=3)  
reader = tf.TextLineReader()  
key, value = reader.read(filename_queue)  
record_defaults = [['null'], ['null']]  
#定義了多種解碼器,每個解碼器跟一個reader相連  
example_list = [tf.decode_csv(value, record_defaults=record_defaults)  
                  for _ in range(2)]  # Reader設置為2  
# 使用tf.train.batch_join()，可以使用多個reader，並行讀取數據。每個Reader使用一個線程。  
example_batch, label_batch = tf.train.batch_join(  
      example_list, batch_size=1)  
#初始化本地變量  
init_local_op = tf.initialize_local_variables()  
with tf.Session() as sess:  
    sess.run(init_local_op)  
    coord = tf.train.Coordinator()  
    threads = tf.train.start_queue_runners(coord=coord)  
    try:  
        while not coord.should_stop():  
            e_val,l_val = sess.run([example_batch,label_batch])  
            print e_val,l_val  
    except tf.errors.OutOfRangeError:  
            print('Epochs Complete!')  
    finally:  
            coord.request_stop()  
    coord.join(threads)  
    coord.request_stop()  
    coord.join(threads)  

在迭代控制中，記得添加tf.initialize_local_variables()，官網教程沒有說明，但是如果不初始化，運行就會報錯。

=========================================================================================對於傳統的機器學習而言，比方說分類問題，[x1 x2 x3]是feature。對於二分類問題,label經過one-hot編碼之后就會是[0,1]或者[1,0]。一般情況下，我們會考慮將數據組織在csv文件中,一行代表一個sample。然后使用隊列的方式去讀取數據

說明：對於該數據,前三列代表的是feature，因為是分類問題,后兩列就是經過one-hot編碼之后得到的label

使用隊列讀取該csv文件的代碼如下：

[python]view plain copy
               
           
#-*- coding:utf-8 -*-  
import tensorflow as tf  
# 生成一個先入先出隊列和一個QueueRunner,生成文件名隊列  
filenames = ['A.csv']  
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)  
# 定義Reader  
reader = tf.TextLineReader()  
key, value = reader.read(filename_queue)  
# 定義Decoder  
record_defaults = [[1], [1], [1], [1], [1]]  
col1, col2, col3, col4, col5 = tf.decode_csv(value,record_defaults=record_defaults)  
features = tf.pack([col1, col2, col3])  
label = tf.pack([col4,col5])  
example_batch, label_batch = tf.train.shuffle_batch([features,label], batch_size=2, capacity=200, min_after_dequeue=100, num_threads=2)  
# 運行Graph  
with tf.Session() as sess:  
    coord = tf.train.Coordinator()  #創建一個協調器，管理線程  
    threads = tf.train.start_queue_runners(coord=coord)  #啟動QueueRunner, 此時文件名隊列已經進隊。  
    for i in range(10):  
        e_val,l_val = sess.run([example_batch, label_batch])  
        print e_val,l_val  
    coord.request_stop()  
    coord.join(threads)  

輸出結果如下：

說明：

record_defaults = [[1], [1], [1], [1], [1]]

代表解析的模板,每個樣本有5列，在數據中是默認用‘，’隔開的，然后解析的標准是[1]，也即每一列的數值都解析為整型。[1.0]就是解析為浮點，['null']解析為string類型

Tensorflow數據讀取有三種方式：

Preloaded data: 預加載數據
Feeding: Python產生數據，再把數據喂給后端。
Reading from file: 從文件中直接讀取

這三種有讀取方式有什么區別呢？我們首先要知道TensorFlow(TF)是怎么樣工作的。

一、預加載數據：

 
                 import 
                 tensorflow as tf  
                
 
                 # 設計Graph  
                
 
                 x1  
                 = 
                 tf.constant([ 
                 2 
                 ,  
                 3 
                 ,  
                 4 
                 ])  
                
 
                 x2  
                 = 
                 tf.constant([ 
                 4 
                 ,  
                 0 
                 ,  
                 1 
                 ])  
                
 
                 y  
                 = 
                 tf.add(x1, x2)  
                
 
                 # 打開一個session --> 計算y  
                
 
                 with tf.Session() as sess:  
                
 
                    
                 print 
                 sess.run(y) 
                

二、python產生數據，再將數據喂給后端

 
                 import 
                 tensorflow as tf  
                
 
                 # 設計Graph  
                
 
                 x1  
                 = 
                 tf.placeholder(tf.int16)  
                
 
                 x2  
                 = 
                 tf.placeholder(tf.int16)  
                
 
                 y  
                 = 
                 tf.add(x1, x2)  
                
 
                 # 用Python產生數據  
                
 
                 li1  
                 = 
                 [ 
                 2 
                 ,  
                 3 
                 ,  
                 4 
                 ]  
                
 
                 li2  
                 = 
                 [ 
                 4 
                 ,  
                 0 
                 ,  
                 1 
                 ]  
                
 
                 # 打開一個session --> 喂數據 --> 計算y  
                
 
                 with tf.Session() as sess:  
                
 
                    
                 print 
                 sess.run(y, feed_dict 
                 = 
                 {x1: li1, x2: li2}) 
                

說明：在這里x1, x2只是占位符，沒有具體的值，那么運行的時候去哪取值呢？這時候就要用到sess.run()中的feed_dict參數，將Python產生的數據喂給后端，並計算y。

這兩種方案的缺點：

1、預加載：將數據直接內嵌到Graph中，再把Graph傳入Session中運行。當數據量比較大時，Graph的傳輸會遇到效率問題。

2、用占位符替代數據，待運行的時候填充數據。

三、從文件中讀取，簡單來說就是將數據讀取模塊的圖搭好

1、准備數據，構造三個文件,A.csv,B.csv,C.csv

1

2

3

 
                 $ echo  
                 - 
                 e  
                 "Alpha1,A1\nAlpha2,A2\nAlpha3,A3" 
                 > A.csv  
                
 
                 $ echo  
                 - 
                 e  
                 "Bee1,B1\nBee2,B2\nBee3,B3" 
                 > B.csv  
                
 
                 $ echo  
                 - 
                 e  
                 "Sea1,C1\nSea2,C2\nSea3,C3" 
                 > C.csv 
                

2、單個Reader，單個樣本

 
                 #-*- coding:utf-8 -*-  
                
 
                 import 
                 tensorflow as tf  
                
 
                 # 生成一個先入先出隊列和一個QueueRunner,生成文件名隊列  
                
 
                 filenames  
                 = 
                 [ 
                 'A.csv' 
                 ,  
                 'B.csv' 
                 ,  
                 'C.csv' 
                 ]  
                
 
                 filename_queue  
                 = 
                 tf.train.string_input_producer(filenames, shuffle 
                 = 
                 False 
                 )  
                
 
                 # 定義Reader  
                
 
                 reader  
                 = 
                 tf.TextLineReader()  
                
 
                 key, value  
                 = 
                 reader.read(filename_queue)  
                
 
                 # 定義Decoder  
                
 
                 example, label  
                 = 
                 tf.decode_csv(value, record_defaults 
                 = 
                 [[ 
                 'null' 
                 ], [ 
                 'null' 
                 ]])  
                
 
                 #example_batch, label_batch = tf.train.shuffle_batch([example,label], batch_size=1, capacity=200, min_after_dequeue=100, num_threads=2)  
                
 
                 # 運行Graph  
                
 
                 with tf.Session() as sess:  
                
 
                    
                 coord  
                 = 
                 tf.train.Coordinator()  
                 #創建一個協調器，管理線程  
                
 
                    
                 threads  
                 = 
                 tf.train.start_queue_runners(coord 
                 = 
                 coord)  
                 #啟動QueueRunner, 此時文件名隊列已經進隊。  
                
 
                    
                 for 
                 i  
                 in 
                 range 
                 ( 
                 10 
                 ):  
                
 
                      
                 print 
                 example. 
                 eval 
                 (),label. 
                 eval 
                 ()  
                
 
                    
                 coord.request_stop()  
                
 
                    
                 coord.join(threads)  
                

說明：這里沒有使用tf.train.shuffle_batch，會導致生成的樣本和label之間對應不上，亂序了。生成結果如下：

Alpha1 A2
Alpha3 B1
Bee2 B3
Sea1 C2
Sea3 A1
Alpha2 A3
Bee1 B2
Bee3 C1
Sea2 C3
Alpha1 A2

解決方案：用tf.train.shuffle_batch,那么生成的結果就能夠對應上。

 
                 #-*- coding:utf-8 -*-  
                
 
                 import 
                 tensorflow as tf  
                
 
                 # 生成一個先入先出隊列和一個QueueRunner,生成文件名隊列  
                
 
                 filenames  
                 = 
                 [ 
                 'A.csv' 
                 ,  
                 'B.csv' 
                 ,  
                 'C.csv' 
                 ]  
                
 
                 filename_queue  
                 = 
                 tf.train.string_input_producer(filenames, shuffle 
                 = 
                 False 
                 )  
                
 
                 # 定義Reader  
                
 
                 reader  
                 = 
                 tf.TextLineReader()  
                
 
                 key, value  
                 = 
                 reader.read(filename_queue)  
                
 
                 # 定義Decoder  
                
 
                 example, label  
                 = 
                 tf.decode_csv(value, record_defaults 
                 = 
                 [[ 
                 'null' 
                 ], [ 
                 'null' 
                 ]])  
                
 
                 example_batch, label_batch  
                 = 
                 tf.train.shuffle_batch([example,label], batch_size 
                 = 
                 1 
                 , capacity 
                 = 
                 200 
                 , min_after_dequeue 
                 = 
                 100 
                 , num_threads 
                 = 
                 2 
                 )  
                
 
                 # 運行Graph  
                
 
                 with tf.Session() as sess:  
                
 
                    
                 coord  
                 = 
                 tf.train.Coordinator()  
                 #創建一個協調器，管理線程  
                
 
                    
                 threads  
                 = 
                 tf.train.start_queue_runners(coord 
                 = 
                 coord)  
                 #啟動QueueRunner, 此時文件名隊列已經進隊。  
                
 
                    
                 for 
                 i  
                 in 
                 range 
                 ( 
                 10 
                 ):  
                
 
                      
                 e_val,l_val  
                 = 
                 sess.run([example_batch, label_batch])  
                
 
                      
                 print 
                 e_val,l_val  
                
 
                    
                 coord.request_stop()  
                
 
                    
                 coord.join(threads) 
                

3、單個Reader,多個樣本,主要也是通過tf.train.shuffle_batch來實現

 
                 #-*- coding:utf-8 -*-  
                
                 import 
                 tensorflow as tf  
                
                 filenames  
                 = 
                 [ 
                 'A.csv' 
                 ,  
                 'B.csv' 
                 ,  
                 'C.csv' 
                 ]  
                
                 filename_queue  
                 = 
                 tf.train.string_input_producer(filenames, shuffle 
                 = 
                 False 
                 )  
                
                 reader  
                 = 
                 tf.TextLineReader()  
                
                 key, value  
                 = 
                 reader.read(filename_queue)  
                
                 example, label  
                 = 
                 tf.decode_csv(value, record_defaults 
                 = 
                 [[ 
                 'null' 
                 ], [ 
                 'null' 
                 ]])  
                
                 # 使用tf.train.batch()會多加了一個樣本隊列和一個QueueRunner。  
                
                 #Decoder解后數據會進入這個隊列，再批量出隊。  
                
                 # 雖然這里只有一個Reader，但可以設置多線程，相應增加線程數會提高讀取速度，但並不是線程越多越好。  
                
                 example_batch, label_batch  
                 = 
                 tf.train.batch(  
                
                 [example, label], batch_size 
                 = 
                 5 
                 )  
                
                 with tf.Session() as sess:  
                
                 coord  
                 = 
                 tf.train.Coordinator()  
                
                 threads  
                 = 
                 tf.train.start_queue_runners(coord 
                 = 
                 coord)  
                
                 for 
                 i  
                 in 
                 range 
                 ( 
                 10 
                 ):  
                
                 e_val,l_val  
                 = 
                 sess.run([example_batch,label_batch])  
                
                 print 
                 e_val,l_val  
                
                 coord.request_stop()  
                
                 coord.join(threads)

說明：下面這種寫法，提取出來的batch_size個樣本，特征和label之間也是不同步的

 
                 #-*- coding:utf-8 -*-  
                
 
                 import 
                 tensorflow as tf  
                
 
                 filenames  
                 = 
                 [ 
                 'A.csv' 
                 ,  
                 'B.csv' 
                 ,  
                 'C.csv' 
                 ]  
                
 
                 filename_queue  
                 = 
                 tf.train.string_input_producer(filenames, shuffle 
                 = 
                 False 
                 )  
                
 
                 reader  
                 = 
                 tf.TextLineReader()  
                
 
                 key, value  
                 = 
                 reader.read(filename_queue)  
                
 
                 example, label  
                 = 
                 tf.decode_csv(value, record_defaults 
                 = 
                 [[ 
                 'null' 
                 ], [ 
                 'null' 
                 ]])  
                
 
                 # 使用tf.train.batch()會多加了一個樣本隊列和一個QueueRunner。  
                
 
                 #Decoder解后數據會進入這個隊列，再批量出隊。  
                
 
                 # 雖然這里只有一個Reader，但可以設置多線程，相應增加線程數會提高讀取速度，但並不是線程越多越好。  
                
 
                 example_batch, label_batch  
                 = 
                 tf.train.batch(  
                
 
                     
                 [example, label], batch_size 
                 = 
                 5 
                 )  
                
 
                 with tf.Session() as sess:  
                
 
                    
                 coord  
                 = 
                 tf.train.Coordinator()  
                
 
                    
                 threads  
                 = 
                 tf.train.start_queue_runners(coord 
                 = 
                 coord)  
                
 
                    
                 for 
                 i  
                 in 
                 range 
                 ( 
                 10 
                 ):  
                
 
                      
                 print 
                 example_batch. 
                 eval 
                 (), label_batch. 
                 eval 
                 ()  
                
 
                    
                 coord.request_stop()  
                
 
                    
                 coord.join(threads)  
                

說明：輸出結果如下：可以看出feature和label之間是不對應的

['Alpha1' 'Alpha2' 'Alpha3' 'Bee1' 'Bee2'] ['B3' 'C1' 'C2' 'C3' 'A1']
['Alpha2' 'Alpha3' 'Bee1' 'Bee2' 'Bee3'] ['C1' 'C2' 'C3' 'A1' 'A2']
['Alpha3' 'Bee1' 'Bee2' 'Bee3' 'Sea1'] ['C2' 'C3' 'A1' 'A2' 'A3']

4、多個reader，多個樣本

 
                 #-*- coding:utf-8 -*-  
                
 
                 import 
                 tensorflow as tf  
                
 
                 filenames  
                 = 
                 [ 
                 'A.csv' 
                 ,  
                 'B.csv' 
                 ,  
                 'C.csv' 
                 ]  
                
 
                 filename_queue  
                 = 
                 tf.train.string_input_producer(filenames, shuffle 
                 = 
                 False 
                 )  
                
 
                 reader  
                 = 
                 tf.TextLineReader()  
                
 
                 key, value  
                 = 
                 reader.read(filename_queue)  
                
 
                 record_defaults  
                 = 
                 [[ 
                 'null' 
                 ], [ 
                 'null' 
                 ]]  
                
 
                 #定義了多種解碼器,每個解碼器跟一個reader相連  
                
 
                 example_list  
                 = 
                 [tf.decode_csv(value, record_defaults 
                 = 
                 record_defaults)  
                
 
                           
                 for 
                 _  
                 in 
                 range 
                 ( 
                 2 
                 )]  
                 # Reader設置為2  
                
 
                 # 使用tf.train.batch_join()，可以使用多個reader，並行讀取數據。每個Reader使用一個線程。  
                
 
                 example_batch, label_batch  
                 = 
                 tf.train.batch_join(  
                
 
                     
                 example_list, batch_size 
                 = 
                 5 
                 )  
                
 
                 with tf.Session() as sess:  
                
 
                    
                 coord  
                 = 
                 tf.train.Coordinator()  
                
 
                    
                 threads  
                 = 
                 tf.train.start_queue_runners(coord 
                 = 
                 coord)  
                
 
                    
                 for 
                 i  
                 in 
                 range 
                 ( 
                 10 
                 ):  
                
 
                      
                 e_val,l_val  
                 = 
                 sess.run([example_batch,label_batch])  
                
 
                      
                 print 
                 e_val,l_val  
                
 
                    
                 coord.request_stop()  
                
 
                    
                 coord.join(threads)  
                

tf.train.batch與tf.train.shuffle_batch函數是單個Reader讀取，但是可以多線程。tf.train.batch_join與tf.train.shuffle_batch_join可設置多Reader讀取，每個Reader使用一個線程。至於兩種方法的效率，單Reader時，2個線程就達到了速度的極限。多Reader時，2個Reader就達到了極限。所以並不是線程越多越快，甚至更多的線程反而會使效率下降。

5、迭代控制，設置epoch參數，指定我們的樣本在訓練的時候只能被用多少輪

 
                 #-*- coding:utf-8 -*-  
                
                 import 
                 tensorflow as tf  
                
                 filenames  
                 = 
                 [ 
                 'A.csv' 
                 ,  
                 'B.csv' 
                 ,  
                 'C.csv' 
                 ]  
                
                 #num_epoch: 設置迭代數  
                
                 filename_queue  
                 = 
                 tf.train.string_input_producer(filenames, shuffle 
                 = 
                 False 
                 ,num_epochs 
                 = 
                 3 
                 )  
                
                 reader  
                 = 
                 tf.TextLineReader()  
                
                 key, value  
                 = 
                 reader.read(filename_queue)  
                
                 record_defaults  
                 = 
                 [[ 
                 'null' 
                 ], [ 
                 'null' 
                 ]]  
                
                 #定義了多種解碼器,每個解碼器跟一個reader相連  
                
                 example_list  
                 = 
                 [tf.decode_csv(value, record_defaults 
                 = 
                 record_defaults)  
                
                 for 
                 _  
                 in 
                 range 
                 ( 
                 2 
                 )]  
                 # Reader設置為2  
                
                 # 使用tf.train.batch_join()，可以使用多個reader，並行讀取數據。每個Reader使用一個線程。  
                
                 example_batch, label_batch  
                 = 
                 tf.train.batch_join(  
                
                 example_list, batch_size 
                 = 
                 1 
                 )  
                
                 #初始化本地變量  
                
                 init_local_op  
                 = 
                 tf.initialize_local_variables()  
                
                 with tf.Session() as sess:  
                
                 sess.run(init_local_op)  
                
                 coord  
                 = 
                 tf.train.Coordinator()  
                
                 threads  
                 = 
                 tf.train.start_queue_runners(coord 
                 = 
                 coord)  
                
                 try 
                 :  
                
                 while 
                 not 
                 coord.should_stop():  
                
                 e_val,l_val  
                 = 
                 sess.run([example_batch,label_batch])  
                
                 print 
                 e_val,l_val  
                
                 except 
                 tf.errors.OutOfRangeError:  
                
                 print 
                 ( 
                 'Epochs Complete!' 
                 )  
                
                 finally 
                 :  
                
                 coord.request_stop()  
                
                 coord.join(threads)  
                
                 coord.request_stop()  
                
                 coord.join(threads)

在迭代控制中，記得添加tf.initialize_local_variables()，官網教程沒有說明，但是如果不初始化，運行就會報錯。

對於傳統的機器學習而言，比方說分類問題，[x1 x2 x3]是feature。對於二分類問題,label經過one-hot編碼之后就會是[0,1]或者[1,0]。一般情況下，我們會考慮將數據組織在csv文件中,一行代表一個sample。然后使用隊列的方式去讀取數據

說明：對於該數據,前三列代表的是feature，因為是分類問題,后兩列就是經過one-hot編碼之后得到的label

使用隊列讀取該csv文件的代碼如下：

 
                 #-*- coding:utf-8 -*-  
                
 
                 import 
                 tensorflow as tf  
                
 
                 # 生成一個先入先出隊列和一個QueueRunner,生成文件名隊列  
                
 
                 filenames  
                 = 
                 [ 
                 'A.csv' 
                 ]  
                
 
                 filename_queue  
                 = 
                 tf.train.string_input_producer(filenames, shuffle 
                 = 
                 False 
                 )  
                
 
                 # 定義Reader  
                
 
                 reader  
                 = 
                 tf.TextLineReader()  
                
 
                 key, value  
                 = 
                 reader.read(filename_queue)  
                
 
                 # 定義Decoder  
                
 
                 record_defaults  
                 = 
                 [[ 
                 1 
                 ], [ 
                 1 
                 ], [ 
                 1 
                 ], [ 
                 1 
                 ], [ 
                 1 
                 ]]  
                
 
                 col1, col2, col3, col4, col5  
                 = 
                 tf.decode_csv(value,record_defaults 
                 = 
                 record_defaults)  
                
 
                 features  
                 = 
                 tf.pack([col1, col2, col3])  
                
 
                 label  
                 = 
                 tf.pack([col4,col5])  
                
 
                 example_batch, label_batch  
                 = 
                 tf.train.shuffle_batch([features,label], batch_size 
                 = 
                 2 
                 , capacity 
                 = 
                 200 
                 , min_after_dequeue 
                 = 
                 100 
                 , num_threads 
                 = 
                 2 
                 )  
                
 
                 # 運行Graph  
                
 
                 with tf.Session() as sess:  
                
 
                    
                 coord  
                 = 
                 tf.train.Coordinator()  
                 #創建一個協調器，管理線程  
                
 
                    
                 threads  
                 = 
                 tf.train.start_queue_runners(coord 
                 = 
                 coord)  
                 #啟動QueueRunner, 此時文件名隊列已經進隊。  
                
 
                    
                 for 
                 i  
                 in 
                 range 
                 ( 
                 10 
                 ):  
                
 
                      
                 e_val,l_val  
                 = 
                 sess.run([example_batch, label_batch])  
                
 
                      
                 print 
                 e_val,l_val  
                
 
                    
                 coord.request_stop()  
                
 
                    
                 coord.join(threads)  
                

輸出結果如下：

說明：

1	`record_defaults` `=` `[[` `1` `], [` `1` `], [` `1` `], [` `1` `], [` `1` `]]`

代表解析的模板,每個樣本有5列，在數據中是默認用‘，'隔開的，然后解析的標准是[1]，也即每一列的數值都解析為整型。[1.0]就是解析為浮點，['null']解析為string類型

二、此處給出了幾種不同的next_batch方法，該文章只是做出代碼片段的解釋，以備以后查看：

 
                 def 
                 next_batch( 
                 self 
                 , batch_size, fake_data 
                 = 
                 False 
                 ): 
                
                 """Return the next `batch_size` examples from this data set.""" 
                
                 if 
                 fake_data: 
                
                 fake_image  
                 = 
                 [ 
                 1 
                 ]  
                 * 
                 784 
                
                 if 
                 self 
                 .one_hot: 
                
                 fake_label  
                 = 
                 [ 
                 1 
                 ]  
                 + 
                 [ 
                 0 
                 ]  
                 * 
                 9 
                
                 else 
                 : 
                
                 fake_label  
                 = 
                 0 
                
                 return 
                 [fake_image  
                 for 
                 _  
                 in 
                 xrange 
                 (batch_size)], [ 
                
                 fake_label  
                 for 
                 _  
                 in 
                 xrange 
                 (batch_size) 
                
                 ] 
                
                 start  
                 = 
                 self 
                 ._index_in_epoch 
                
                 self 
                 ._index_in_epoch  
                 + 
                 = 
                 batch_size 
                
                 if 
                 self 
                 ._index_in_epoch >  
                 self 
                 ._num_examples:  
                 # epoch中的句子下標是否大於所有語料的個數，如果為True,開始新一輪的遍歷 
                
                 # Finished epoch 
                
                 self 
                 ._epochs_completed  
                 + 
                 = 
                 1 
                
                 # Shuffle the data 
                
                 perm  
                 = 
                 numpy.arange( 
                 self 
                 ._num_examples)  
                 # arange函數用於創建等差數組 
                
                 numpy.random.shuffle(perm)  
                 # 打亂 
                
                 self 
                 ._images  
                 = 
                 self 
                 ._images[perm] 
                
                 self 
                 ._labels  
                 = 
                 self 
                 ._labels[perm] 
                
                 # Start next epoch 
                
                 start  
                 = 
                 0 
                
                 self 
                 ._index_in_epoch  
                 = 
                 batch_size 
                
                 assert 
                 batch_size < 
                 = 
                 self 
                 ._num_examples 
                
                 end  
                 = 
                 self 
                 ._index_in_epoch 
                
                 return 
                 self 
                 ._images[start:end],  
                 self 
                 ._labels[start:end]

該段代碼摘自mnist.py文件，從代碼第12行start = self._index_in_epoch開始解釋，_index_in_epoch-1是上一次batch個圖片中最后一張圖片的下邊，這次epoch第一張圖片的下標是從 _index_in_epoch開始，最后一張圖片的下標是_index_in_epoch+batch, 如果 _index_in_epoch 大於語料中圖片的個數，表示這個epoch是不合適的，就算是完成了語料的一遍的遍歷，所以應該對圖片洗牌然后開始新一輪的語料組成batch開始

 
                 def 
                 ptb_iterator(raw_data, batch_size, num_steps): 
                
                 """Iterate on the raw PTB data. 
                
                 This generates batch_size pointers into the raw PTB data, and allows 
                
                 minibatch iteration along these pointers. 
                
                 Args: 
                
                 raw_data: one of the raw data outputs from ptb_raw_data. 
                
                 batch_size: int, the batch size. 
                
                 num_steps: int, the number of unrolls. 
                
                 Yields: 
                
                 Pairs of the batched data, each a matrix of shape [batch_size, num_steps]. 
                
                 The second element of the tuple is the same data time-shifted to the 
                
                 right by one. 
                
                 Raises: 
                
                 ValueError: if batch_size or num_steps are too high. 
                
                 """ 
                
                 raw_data  
                 = 
                 np.array(raw_data, dtype 
                 = 
                 np.int32) 
                
                 data_len  
                 = 
                 len 
                 (raw_data) 
                
                 batch_len  
                 = 
                 data_len  
                 / 
                 / 
                 batch_size  
                 #有多少個batch 
                
                 data  
                 = 
                 np.zeros([batch_size, batch_len], dtype 
                 = 
                 np.int32)  
                 # batch_len 有多少個單詞 
                
                 for 
                 i  
                 in 
                 range 
                 (batch_size):  
                 # batch_size 有多少個batch 
                
                 data[i]  
                 = 
                 raw_data[batch_len  
                 * 
                 i:batch_len  
                 * 
                 (i  
                 + 
                 1 
                 )] 
                
                 epoch_size  
                 = 
                 (batch_len  
                 - 
                 1 
                 )  
                 / 
                 / 
                 num_steps  
                 # batch_len 是指一個batch中有多少個句子 
                
                 #epoch_size = ((len(data) // model.batch_size) - 1) // model.num_steps # // 表示整數除法 
                
                 if 
                 epoch_size  
                 = 
                 = 
                 0 
                 : 
                
                 raise 
                 ValueError( 
                 "epoch_size == 0, decrease batch_size or num_steps" 
                 ) 
                
                 for 
                 i  
                 in 
                 range 
                 (epoch_size): 
                
                 x  
                 = 
                 data[:, i 
                 * 
                 num_steps:(i 
                 + 
                 1 
                 ) 
                 * 
                 num_steps] 
                
                 y  
                 = 
                 data[:, i 
                 * 
                 num_steps 
                 + 
                 1 
                 :(i 
                 + 
                 1 
                 ) 
                 * 
                 num_steps 
                 + 
                 1 
                 ] 
                
                 yield 
                 (x, y)

第三種方式：

 
                 def 
                 next 
                 ( 
                 self 
                 , batch_size): 
                
 
                    
                 """ Return a batch of data. When dataset end is reached, start over. 
                
 
                    
                 """ 
                
 
                    
                 if 
                 self 
                 .batch_id  
                 = 
                 = 
                 len 
                 ( 
                 self 
                 .data): 
                
 
                      
                 self 
                 .batch_id  
                 = 
                 0 
                
 
                    
                 batch_data  
                 = 
                 ( 
                 self 
                 .data[ 
                 self 
                 .batch_id: 
                 min 
                 ( 
                 self 
                 .batch_id  
                 + 
                
 
                                         
                 batch_size,  
                 len 
                 ( 
                 self 
                 .data))]) 
                
 
                    
                 batch_labels  
                 = 
                 ( 
                 self 
                 .labels[ 
                 self 
                 .batch_id: 
                 min 
                 ( 
                 self 
                 .batch_id  
                 + 
                
 
                                         
                 batch_size,  
                 len 
                 ( 
                 self 
                 .data))]) 
                
 
                    
                 batch_seqlen  
                 = 
                 ( 
                 self 
                 .seqlen[ 
                 self 
                 .batch_id: 
                 min 
                 ( 
                 self 
                 .batch_id  
                 + 
                
 
                                         
                 batch_size,  
                 len 
                 ( 
                 self 
                 .data))]) 
                
 
                    
                 self 
                 .batch_id  
                 = 
                 min 
                 ( 
                 self 
                 .batch_id  
                 + 
                 batch_size,  
                 len 
                 ( 
                 self 
                 .data)) 
                
 
                    
                 return 
                 batch_data, batch_labels, batch_seqlen 
                

第四種方式：

 
                 def 
                 batch_iter(sourceData, batch_size, num_epochs, shuffle 
                 = 
                 True 
                 ): 
                
                 data  
                 = 
                 np.array(sourceData)  
                 # 將sourceData轉換為array存儲 
                
                 data_size  
                 = 
                 len 
                 (sourceData) 
                
                 num_batches_per_epoch  
                 = 
                 int 
                 ( 
                 len 
                 (sourceData)  
                 / 
                 batch_size)  
                 + 
                 1 
                
                 for 
                 epoch  
                 in 
                 range 
                 (num_epochs): 
                
                 # Shuffle the data at each epoch 
                
                 if 
                 shuffle: 
                
                 shuffle_indices  
                 = 
                 np.random.permutation(np.arange(data_size)) 
                
                 shuffled_data  
                 = 
                 sourceData[shuffle_indices] 
                
                 else 
                 : 
                
                 shuffled_data  
                 = 
                 sourceData 
                
                 for 
                 batch_num  
                 in 
                 range 
                 (num_batches_per_epoch): 
                
                 start_index  
                 = 
                 batch_num  
                 * 
                 batch_size 
                
                 end_index  
                 = 
                 min 
                 ((batch_num  
                 + 
                 1 
                 )  
                 * 
                 batch_size, data_size) 
                
                 yield 
                 shuffled_data[start_index:end_index]

迭代器的用法，具體學習Python迭代器的用法

另外需要注意的是，前三種方式只是所有語料遍歷一次，而最后一種方法是，所有語料遍歷了num_epochs次

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 2、tensorflow讀取數據、形成batch、顯示數據關於oracle數據庫讀取數據的三種方式 Python三種讀取txt文件方式數據存儲的三種方式 Spark讀取ElasticSearch數據庫三種配置方式及其注意事項 Selenium：三種等待方式詳解線程的三種實現方式詳解 Spring事務之詳解--三種實現方式 js this詳解，事件的三種綁定方式 Selenium 三種等待方式詳解