Open MPI集群運行


部署完之后,代碼也能正確跑起來了,也確實集群分散了。跑一下各種各樣的代碼,發現了一個錯誤:

$ ~/OpenMpi/bin/mpiexec  -np 10  ~/NetWorkTest
My rank is 2
My rank is 7
My rank is 0
My rank is 3
My rank is 6
My rank is 8
My rank is 4
My rank is 1
My rank is 5
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[18656,1],2]
  Exit code:    14
--------------------------------------------------------------------------

這份代碼是什么問題導致的呢?然后我不小心把  MPF_Finalize() 函數注釋掉了,那么就是說明有一個進程先錯誤返回了。Master 進程捕獲到了。

這里反映了一個事實: 集群中如果有一個進程掛掉了,那么整個進程集都會掛掉

加回去 MPF_Finalize() 函數,這個錯誤就沒了

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM