使用MPI時執行代碼時運行命令中參見的幾種參數設置

本文轉載自查看原文 2021-08-12 13:45 341 MPI 高性能並行計算

我們寫完mpi代碼以后需要通過執行命令運行寫好的代碼，此時在運行命令中加入設置參數可以更好的控制程序的運行，這里就介紹一下自己常用的幾種參數設置。

現有硬件：兩台裝有Ubuntu18.04的操作系統（下面簡稱A電腦，B電腦）

A電腦： 24物理核心（48邏輯核心）

B電腦：6物理核心（12邏輯核心）

網絡：

A、B電腦之間使用100M以太網交換機連接（就是TP-Link路由器）。

其中，A電腦IP為 192.168.11.66， B電腦IP為 192.168.11.206

本文中的代碼 x.py :

from mpi4py import MPI
import numpy as np


comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()


sendbuf = np.zeros(100*10000, dtype='i') + rank
recvbuf = None
if rank == 0:
    recvbuf = np.empty([size, 100*10000], dtype='i')


print( MPI.Get_processor_name() )


import time
a = time.time()
for _ in range(1):
    comm.Gather(sendbuf, recvbuf, root=0)
b = time.time()
if rank == 0:
    print(b-a)

View Code

還有特別注意，本文所有的命令均為在主機A上執行，所以本文中對myhosts文件的編寫都是在A主機下進行的。

====================================================

1. 參數 --machinefile

該參數主要是用在分布式環境下，在單機環境該參數沒有意義。該參數就是指定分布式環境下有幾台主機，並且可以指定每台主機最多可以開幾個CPU進行計算。

具體命令:

mpirun -np 8  --machinefile myhosts   /home/xxx/anaconda3/bin/python x.py

其中， myhosts 為我們需要編寫的文本文件，該文件指定mpi分布式環境下各個主機的IP及可以運行的最多CPU數。

myhosts文件最基本的設置就是不指定每個主機最多可以運行的CPU數，那么此時每台主機最多可以運行的CPU數為多少呢，這時每台主機最多可以運行的CPU數為該主機的物理CPU核心數，本文中主機A 192.168.11.66的最多可以運行CPU數為24，主機B 192.168.11.206最多可以運行的CPU數為6。

最基本的 myhosts ：

cat myhosts

192.168.11.66       
192.168.11.206

myhosts中給出分布式環境下兩個主機IP，此時每個主機最多可以使用的CPU數為物理核心個數。

執行命令：

mpirun -np 8  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示8個進程均運行在主機A上（因為運行命令本身就是在主機A上運行的，所有優先使用主機A的計算資源）。

執行命令：

mpirun -np 24  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示24個進程均運行在主機A上（因為運行命令本身就是在主機A上運行的，所有優先使用主機A的計算資源）。

執行命令：

mpirun -np 25  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示共運行25個進程，其中24個進程運行在主機A上，一個進程運行在主機B上（因為運行命令本身就是在主機A上運行的，所有優先使用主機A的計算資源）。

因為主機A最多可以利用的CPU個數為24，所以需要有一個進程運行在主機B上。

執行命令：

mpirun -np 30  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示共運行30個進程，其中24個進程運行在主機A上， 6個進程運行在主機B上（因為運行命令本身就是在主機A上運行的，所有優先使用主機A的計算資源）。

因為主機A最多可以利用的CPU個數為24，所以需要有6個進程運行在主機B上。

執行命令：

mpirun -np 31  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果報錯，顯示信息：

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 31
slots that were requested by the application:

  /home/xxxxxx/anaconda3/bin/python

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

報錯信息顯示可以運行的CPU個數不夠，因為A主機最多運行24個CPU，B主機最多運行6個CPU，所以當前系統下最多可以運行的CPU個數為30，超出這個個數則會報錯。

2. 參數 slots

進階版的myhosts的編寫，指定每個主機最多可以使用的CPU個數，這個CPU個數最好是小於指定主機的物理核心數，否則該設定沒有意義：

cat myhosts

192.168.11.66       slots=4
192.168.11.206      slots=4

指定主機A 、B中每個主機最多可以使用cpu個數均為4，其中每個主機IP（或主機名）后面的的slots的數值可以自由設定，不過只能小於等於該主機的物理核心數。

執行命令：

mpirun -np 4  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示4個進程全部運行在主機A上。

執行命令：

mpirun -np 6  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示4個進程全部運行在主機A上，2個進程運行在主機B上。

執行命令：

mpirun -np 8  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示4個進程全部運行在主機A上，4個進程運行在主機B上。

執行命令：

mpirun -np 9  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果報錯，顯示信息：

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 31
slots that were requested by the application:

  /home/xxxxxx/anaconda3/bin/python

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

View Code

報錯信息顯示可以運行的CPU個數不夠，因為A主機我們指定最多運行4個CPU，B主機最多運行4個CPU，所以當前系統下最多可以運行的CPU個數為8，超出這個個數則會報錯。

=================================

3. -np 參數：

如果我們運行時不使用 -np 參數，那么運行情節如何呢：

在 myhosts 文件內容：

192.168.11.66
192.168.11.206

運行命令:

mpirun --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果，A主機運行24個進程，B主機運行6個進程，也就是說不指定 -np參數每個主機都是以全部的物理核心來運行進程。

如果在 myhosts 文件內容：

192.168.11.66       slots=4
192.168.11.206      slots=4

運行命令:

mpirun --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果，A主機運行4個進程，B主機運行4個進程，也就是說不指定 -np參數和之前一樣每個主機都是以全部的可以運行的CPU個數來運行進程。（因為這里在myhosts文件中使用了slots參數已經設定了A主機最多可以使用4個CPU，B主機最多可以使用4個CPU）

============================================================

4. 參數 -nolocal

在執行mpi命令時加入參數 -nolocal 則指定不運行當前所在主機上的CPU，具體：

假設myhosts文件內容如下：

cat myhosts

192.168.11.66       slots=4
192.168.11.206      slots=4

myhosts 文件指定A、B主機均只能最多使用4個CPU。

在主機A 192.168.11.66 上運行命令：

mpirun  -nolocal  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

報錯：

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:

  /home/xxxxxx/anaconda3/bin/python

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

也就是說 -nolocal 不允許本地主機A參與計算，而 myhosts文件中又允許A主機參與計算，因此造成沖突。在沒有使用 -np 參數的情況下是需要使用myhosts文件中指定的CPU數的最大值來運行的，但是-nolocal不允許A主機參與運行無法滿足myhosts文件中的8個CPU的設定，因此報錯。

我們在上面的運行語句中改進下，如下：

 mpirun -np 6  -nolocal  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

依舊報錯誤：

There are not enough slots available in the system to satisfy the 6
slots that were requested by the application:

原因是不使用本地主機A的情況下 -np 指定需要6個CPU運行，但是myhosts中指定B主機192.168.11.206最多可以運行4個CPU，因此不滿足6個CPU運行的要求報錯。

我們在上面的運行語句中改進下，如下：

 mpirun -np 4  -nolocal  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

成功運行。四個進程全部運行在B主機192.168.11.206 上。既滿足 -np 4 也滿足 -nolocal 設定，同時也滿足 myhosts中的設定。

同理：

上面的運行語句中改進下，如下：

 mpirun -np 2  -nolocal  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

成功運行。2個進程全部運行在B主機192.168.11.206 上。既滿足 -np 2 也滿足 -nolocal 設定，同時也滿足 myhosts中的設定。

===========================================================

5. 參數 --use-hwthread-cpus 與 --oversubscribe

前面我們知道 A、B主機的CPU物理核心個數：

A電腦： 24物理核心（48邏輯核心）

B電腦：6物理核心（12邏輯核心）

-np 指定該次運行一共需要的CPU個數，-nolocal 指定不使用當前主機的CPU進行運算，myhosts中指定參與計算的各主機的最多參與計算的CPU個數。

正如我們前面所說的，myhosts文件中雖然可以指定每個主機最多可以使用的CPU個數，但是這個個數是我們人為設定的，設定的一個要求就是要小於主機的物理核心個數。如果myhosts 中slots指定的CPU數量等於主機物理核心個數那么slots本身是沒有意義的，因為myhosts中不使用slots設定所能使用的最多CPU個數也是該主機的物理核心個數。

那么 myhosts 中slots的個數設定真的不能大於主機的物理核心數，其實不然。之所以我們默認要求slots個數不能大於物理核心數是因為在獨占主機進行計算密集型運算時當主機上運行的進程數等於物理核心數時往往會得到最高的利用率。

一個隱藏知識，根據Intel cpu的白皮書（藍皮書）可以看到在使用超流水線多線程運算時密集計算型計算性能可以提高30%，這就是說在Intel超流水線技術支持下密集計算任務單主機下進程數等於邏輯核心個數其性能要超進程數等於物理核心數時的30%，不過這只是在短時間計算情景下，如果在長時間運行情況下當進程數等於邏輯核心數時計算密集型任務往往會導致CPU的散熱撞到功率牆（散熱牆）從而導致大幅度CPU降頻，從而導致計算性能大幅下降，當然這說的是普通散熱情況下，因此在進行計算密集型計算任務時我們都是默認設定進程數等於物理核心數。

也就是說，如果我們在 myhosts 文件中設定 slots 個數超過主機的物理核心數在不考慮計算性能的情況下是完全可行的。

給出此時的myhosts內容：

cat myhosts

192.168.11.66       slots=100
192.168.11.206      slots=100

運行語句中如下：

 mpirun -np 200  --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

成功運行。共運行200個進程，其中100個進程運行在A主機192.168.11.66 上， 100個進程運行在B主機192.168.11.206 上。

由此可見使用 myhosts文件中的slots設定也是可以運行超過物理核心數的進程的。

剛才說的是在使用 --machinefile 參數利用myhosts 文件中的設定來實現超過物理核心數的進程數量運行的，如果我們不使用 --machinefile 參數的情況下呢？？？

執行命令：

mpirun -np 8  --host 192.168.11.66:4 --host 192.168.11.206:4   /home/xxxxxx/anaconda3/bin/python x.py

成功在A主機192.168.11.66主機上運行4個進程，在B主機192.168.11.206上運行4個進程，共運行8個進程。

執行命令：

mpirun -np 6  --host 192.168.11.206:6   /home/xxxxxx/anaconda3/bin/python x.py

成功在B主機192.168.11.206上運行6個進程，共運行6個進程。

執行命令：

mpirun -np 200  --host 192.168.11.66:100 --host 192.168.11.206:100   /home/xxxxxx/anaconda3/bin/python x.py

成功在A主機192.168.11.66主機上運行100個進程，在B主機192.168.11.206上運行100個進程，共運行200個進程。

執行命令：

mpirun -np 200 --host 192.168.11.206:200   /home/xxxxxx/anaconda3/bin/python x.py

成功在在B主機192.168.11.206上運行200個進程，共運行200個進程。

當然上面的都是在分布式的環境下運行的（分布式環境下是指使用 --host 參數）。

如果不使用 --host 參數，在單機環境下如何實現超過物理核心數的進程數運行呢？？？

如：

執行命令：

mpirun -np 48   /home/xxxxxx/anaconda3/bin/python x.py

命令的含義是在A主機192.168.11.66上運行48個進程，而A主機的物理核心數為24，因此報錯。

There are not enough slots available in the system to satisfy the 48
slots that were requested by the application:

這時改用命令：（加入參數 --use-hwthread-cpus ）

--use-hwthread-cpus 參數的含義是允許當前主機運行的進程最大數為邏輯核心數而不是物理核心數。

mpirun -np 48 --use-hwthread-cpus  /home/xxxxxx/anaconda3/bin/python x.py

成功在在A主機192.168.11.66上運行48個進程，A主機為當前命令執行時所在的主機，其邏輯核心數為48。

改命令為：

mpirun -np 49 --use-hwthread-cpus  /home/xxxxxx/anaconda3/bin/python x.py

運行失敗，因為 --use-hwthread-cpus 參數只能設定最多運行進程數為邏輯核心數，因此超過48后報錯（A主機邏輯核心數為48）。

這時改用參數 --oversubscribe ：

--oversubscribe 參數的含義就是不對進程數設限制，也就是說進程數可以隨便設置。

執行命令如下：

mpirun -np 200  --oversubscribe  /home/xxxxxx/anaconda3/bin/python x.py

成功在A主機192.168.11.66上運行了200個進程。

=================================================================

附加內容：

在執行mpi程序時rank0進程是在哪個主機上呢？？？
（rank0進程就是mpi程序運行后rank排名號為0號的進程）

在主機A 192.168.11.66 上：

myhosts文件內容：

192.168.11.66       slots=4
192.168.11.206      slots=4

執行命令：

mpirun -np 8   --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示，rank0進程運行在A主機 192.168.11.66上。

同理：

在主機B 192.168.11.206 上：

myhosts文件內容：

192.168.11.66       slots=4
192.168.11.206      slots=4

執行命令：

mpirun -np 8   --machinefile myhosts   /home/xxxxxx/anaconda3/bin/python x.py

運行結果顯示，rank0進程運行在B主機 192.168.11.206上。

由上面的運行情況我們可以知道 rank0 進程一般都是運行在啟動mpi程序並使用CPU運行進程的主機上（需要排除使用參數 -nolocal 的情況，該種情況啟動mpi程序的主機是不使用CPU參與計算的，因此rank0進程此時是不在啟動mpi程序的主機上的）

==================================================

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 mybatis中mapper接口的參數設置幾種方法 spark運行參數設置 Java運行參數設置 Java運行參數設置網絡參數設置命令 uboot 命令使用教程(uboot參數設置) 使用APScheduler啟動Django服務時自動運行腳本（可設置定時運行）讓 Markdown 中的代碼可以實時運行 postman——集合——執行集合——參數設置連接PDO時參數設置詳解