單GPU跑的程序,而且是在docker中,迭代了幾百步后,程序突然崩掉了,
程序停在了 for step,data in enumerate(loader),下面是部分bug信息
Traceback (most recent call last): ........ File ".../torch/utils/data/dataloader.py", line 206, in __next__ idx, batch = self.data_queue.get() File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get return recv() File ".../torch/multiprocessing/queue.py", line 22, in recv return pickle.loads(buf) File "/usr/lib/python2.7/pickle.py", line 1388, in loads return Unpickler(file).load() File "/usr/lib/python2.7/pickle.py", line 864, in load dispatch[key](self) File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce value = func(*args) File ".../torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd fd = multiprocessing.reduction.rebuild_handle(df) File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle conn = Client(address, authkey=current_process().authkey) File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client answer_challenge(c, authkey) File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge message = connection.recv_bytes(256) # reject large message IOError: [Errno 104] Connection reset by peer
我以為是enumerate的問題,出現了臟數據,但細想不可能啊,都迭代了一個epoch了,
查看資料,追蹤這個error,Connection reset by peer,網上說是https://github.com/pytorch/pytorch/issues/9127,
以前版本有bug,
需要將新版本的 torch/_six.py
and torch/utils/data/dataloader.py
替換以前的版本,
工作量大,被這個思路帶着走,完全跑偏了。放棄了,
查詢DataLoader的參數,有建議把batch_size調小,調到了1,
num_workers值也調到了1,還是報錯,
DataLoader的函數定義如下:
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
num_workers=0, collate_fn=default_collate, pin_memory=False,
drop_last=False)
1. dataset:加載的數據集
2. batch_size:batch size
3. shuffle::是否將數據打亂
4. sampler: 樣本抽樣
5. num_workers:使用多進程加載的進程數,0代表不使用多進程
6. collate_fn: 如何將多個樣本數據拼接成一個batch,一般使用默認的拼接方式即可
7. pin_memory:是否將數據保存在pin memory區,pin memory中的數據轉到GPU會快一些
8. drop_last:dataset中的數據個數可能不是batch_size的整數倍,drop_last為True會將多出來不足一個batch的數據丟棄
於是將num_workers參數值改成了默認值 0,不用多進程跑,程序可以運行了,激動萬分,感激涕零啊