pytorch bug: for step,data in enumerate(loader)+Connection reset by peer


單GPU跑的程序,而且是在docker中,迭代了幾百步后,程序突然崩掉了,

程序停在了 for step,data in enumerate(loader),下面是部分bug信息

Traceback (most recent call last): ........ File ".../torch/utils/data/dataloader.py", line 206, in __next__ idx, batch = self.data_queue.get() File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get return recv() File ".../torch/multiprocessing/queue.py", line 22, in recv return pickle.loads(buf) File "/usr/lib/python2.7/pickle.py", line 1388, in loads return Unpickler(file).load() File "/usr/lib/python2.7/pickle.py", line 864, in load dispatch[key](self) File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce value = func(*args) File ".../torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd fd = multiprocessing.reduction.rebuild_handle(df) File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle conn = Client(address, authkey=current_process().authkey) File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client answer_challenge(c, authkey) File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge message = connection.recv_bytes(256) # reject large message IOError: [Errno 104] Connection reset by peer

我以為是enumerate的問題,出現了臟數據,但細想不可能啊,都迭代了一個epoch了,

查看資料,追蹤這個error,Connection reset by peer,網上說是https://github.com/pytorch/pytorch/issues/9127,

以前版本有bug,需要將新版本的 torch/_six.py and torch/utils/data/dataloader.py 替換以前的版本,

工作量大,被這個思路帶着走,完全跑偏了。放棄了,

查詢DataLoader的參數,有建議把batch_size調小,調到了1,

num_workers值也調到了1,還是報錯,

DataLoader的函數定義如下:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
num_workers=0, collate_fn=default_collate, pin_memory=False,
drop_last=False)

1.  dataset:加載的數據集
2.  batch_size:batch size
3.  shuffle::是否將數據打亂
4.  sampler: 樣本抽樣
5.  num_workers:使用多進程加載的進程數,0代表不使用多進程
6.  collate_fn: 如何將多個樣本數據拼接成一個batch,一般使用默認的拼接方式即可
7.  pin_memory:是否將數據保存在pin memory區,pin memory中的數據轉到GPU會快一些
8.  drop_last:dataset中的數據個數可能不是batch_size的整數倍,drop_last為True會將多出來不足一個batch的數據丟棄

於是將num_workers參數值改成了默認值 0,不用多進程跑,程序可以運行了,激動萬分,感激涕零啊


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM