部署完之后,代碼也能正確跑起來了,也確實集群分散了。跑一下各種各樣的代碼,發現了一個錯誤:
$ ~/OpenMpi/bin/mpiexec -np 10 ~/NetWorkTest My rank is 2 My rank is 7 My rank is 0 My rank is 3 My rank is 6 My rank is 8 My rank is 4 My rank is 1 My rank is 5 ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[18656,1],2] Exit code: 14 --------------------------------------------------------------------------
這份代碼是什么問題導致的呢?然后我不小心把 MPF_Finalize() 函數注釋掉了,那么就是說明有一個進程先錯誤返回了。Master 進程捕獲到了。
這里反映了一個事實: 集群中如果有一個進程掛掉了,那么整個進程集都會掛掉
加回去 MPF_Finalize() 函數,這個錯誤就沒了