虛擬化集群中PBS-Torque的部署


1. 概述

本篇博客主要介紹在centos7操作系統集群中部署,配置和使用pbs調度系統的過程

centos7版本:CentOS Linux release 7.9.2009 (Core)

pbs調度系統版本:torque-6.1.3.tar.gz

集群信息:

使用三個虛擬節點部署pbs系統

節點名稱 節點IP 節點角色 節點服務
node16 192.168.80.16 管理節點,登陸節點 pbs_server pbs_schd
node17 192.168.80.17 節點節點 pbs_mom
node18 192.168.80.18 計算節點 pbs_mom

本篇只是pbs調度軟件torque的基本部署,配置,使用。更加復雜的功能,比如MPI環境,圖形界面顯示,GPU調度,Munge認證,數據庫信息獲取,高可用配置,未做詳細的探究,后期有時間再進行完善。

2. 部署

一般集群都需要時間統一,全局身份認證,這兩個步驟在本篇博客不作介紹。

本篇博客的node16-18,已經通過ldap和sssd實現了全局身份認證。

同時約定使用node16的一個軟件安裝目錄作為共享目錄,共享給node17和node18

2.1 創建和掛載共享目錄

node16上執行mkdir -p /hpc/torque/6.1.3,該目錄用來安裝torque軟件,共享給其他節點

執行mkdir -p /hpc/packages/作為編譯torque的工作目錄

編輯/etc/exportfs文件,內容如下:

/hpc 192.168.80.0/24(rw,no_root_squash,no_all_squash)
/home 192.168.80.0/24(rw,no_root_squash,no_all_squash)

執行:systemctl start nfs && systemctl enable nfs設置nfs啟動和開機啟動

執行:exportfs -r,使共享目錄即時生效

在node17,node18上分別執行:

mkdir -p /hpc
mount.nfs 192.168.80.16:/hpc /hpc
mount.nfs 192.168.80.16:/home /home

2.2 部署torque軟件

下載

wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.3.tar.gz

解壓

tar -zxvf torque-6.1.3.tar.gz -C /hpc/packages,解壓到編譯安裝torque的工作目錄

配置安裝信息

# 1. yum安裝軟件依賴
yum -y install libxml2-devel boost-devel openssl-devel libtool readline-devel pam-devel hwloc-devel numactl-devel tcl-devel tk-devel
# yum groupinstall "GNOME Desktop" "Graphical Administrator Tools" 如果編譯選項有--enable-gui時,需要安裝圖像界面依賴
# 2. configure傳參,配置編譯安裝信息
./configure \
	--prefix=/hpc/torque/6.1.3 \
	--mandir=/hpc/torque/6.1.3/man  \
	--enable-cgroups \
	--enable-syslog \
	--enable-drmaa \
	--enable-gui \
	--with-xauth \
	--with-hwloc \
	--with-pam \
	--with-tcl \
	--with-tk \
	# --enable-numa-support \	#這個參數加上,要編輯mom.layout,不清楚規則,暫時取消
# 3. 更新。后期可能添加對MPI,GPU,Munge認證,高可用配置等的支持,本篇后期補充

執行結束后:

Building components: server=yes mom=yes clients=yes
                     gui=yes drmaa=yes pam=yes
PBS Machine type    : linux
Remote copy         : /usr/bin/scp -rpB
PBS home            : /var/spool/torque
Default server      : node13

Unix Domain sockets : 
Linux cpusets       : no
Tcl                 : -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
Tk                  : -L/usr/lib64 -ltk8.5 -lX11 -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
Authentication      : trqauthd

Ready for 'make'.

編譯和安裝

# 1. 編譯,安裝
make -j4 && make install
# 2. 生成安裝腳本
# make packages

執行make packages輸出:

這一步可以不做。在本篇博客中,工作目錄在共享文件系統,因此只需要在每個節點執行make install即可。

[root@node16 torque-6.1.3]# make packages
Building packages from /hpc/packages/torque-6.1.3/tpackages
rm -rf /hpc/packages/torque-6.1.3/tpackages
mkdir /hpc/packages/torque-6.1.3/tpackages
Building ./torque-package-server-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-mom-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-clients-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-gui-linux-x86_64.sh ...
Building ./torque-package-pam-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /lib64/security'
Building ./torque-package-drmaa-linux-x86_64.sh ...
libtool: install: warning: relinking `libdrmaa.la'
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-devel-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-doc-linux-x86_64.sh ...
Done.

The package files are self-extracting packages that can be copied
and executed on your production machines.  Use --help for options.

執行 libtool --finish /hpc/torque/6.1.3/lib

這一步可以不做,make install 操作默認操作

[root@node16 torque-6.1.3]# libtool --finish /hpc/torque/6.1.3/lib
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/sbin" ldconfig -n /hpc/torque/6.1.3/lib
----------------------------------------------------------------------
Libraries have been installed in:
   /hpc/torque/6.1.3/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

接下來是配置環境變量和配置啟動腳本

此時執行ls -lrt /usr/lib/systemd/system,發現目錄下已經有了

-rw-r--r--  1 root root 1284 10月 12 11:17 pbs_server.service
-rw-r--r--  1 root root  704 10月 12 11:17 pbs_mom.service
-rw-r--r--  1 root root  335 10月 12 11:17 trqauthd.servic

少了一個pbs_sched.service啟動腳本,從目錄/hpc/packages/torque-6.1.3/contrib/systemd目錄拷貝到系統中

cp /hpc/packages/torque-6.1.3/contrib/systemd/pbs_sched.service /usr/lib/systemd//system

此時執行ls -lrt /etc/profile.d能夠看到目錄下已經有了torque.sh,只需要執行source /etc/profile就可以了

3. 配置

3.1 配置管理節點

3.1.1添加pbs管理用戶

這里設置為root用戶

./torque.setup,這個腳本注釋:create pbs_server database and default queue

[root@node16 torque-6.1.3]# ./torque.setup root
hostname: node13
Currently no servers active. Default server will be listed as active server. Error  15133
Active server name: node13  pbs_server port is: 15001
trqauthd daemonized - port /tmp/trqauthd-unix
trqauthd successfully started
initializing TORQUE (admin: root)

You have selected to start pbs_server in create mode.
If the server database exists it will be overwritten.
do you wish to continue y/(n)?
# 輸入y

3.1.2 啟動組件認證服務

3.1.1的步驟會啟動認證服務trqauthd,執行ps axu|grep trqauthd可驗證

后續通過systemctl start trqauthd時會報錯,因此此時建議執行pkill -f trqauthd先處理掉該進程,再通過systemctl start trqauthd啟動

pkill -f trqauthd
systemctl start trqauthd
systemctl enable trqauthd

3.1.3 啟動主服務

配置計算節點,vim /var/spool/torque/server_priv/nodes

node17 np=4
node18 np=4

然后執行以下命令

systemctl status pbs_server 
systemctl start pbs_server #如果這一步執行失敗,查看是否已經啟動了pbs_server,如果啟動執行pkill -f pbs_server,然后再執行此命令
systemctl enable pbs_server

執行qnodes查看信息

如果沒有qnodes命令,執行source /etc/profile加載環境變量

node17
     state = down
     power_state = Running
     np = 4
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     total_sockets = 0
     total_numa_nodes = 0
     total_cores = 0
     total_threads = 0
     dedicated_sockets = 0
     dedicated_numa_nodes = 0
     dedicated_cores = 0
     dedicated_threads = 0

node18
     state = down
     power_state = Running
     np = 4
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     total_sockets = 0
     total_numa_nodes = 0
     total_cores = 0
     total_threads = 0
     dedicated_sockets = 0
     dedicated_numa_nodes = 0
     dedicated_cores = 0
     dedicated_threads = 0
# node17和node18因為沒有啟動pbs_mom,所以狀態顯示為down

3.1.4 啟動調度服務

在node16上還要執行systemctl start pbs_sched,否則提交的作業全部為Q狀態

設置開機啟動systemctl enable pbs_sched

3.2 配置計算節點

3.1部分完成了管理節點node16的部署,包括:

  • yum安裝依賴環境
  • 解壓源碼,配置編譯信息,編譯安裝
  • 配置管理用戶
  • 編輯配置文件
  • 啟動trqauthed服務,啟動pbs_server服務,啟動pbs_sched服務

計算節點需要完成的內容:

  • yum安裝依賴環境
  • 配置管理節點信息
  • 執行安裝腳本,或者make install
  • 啟動trqauthd服務,啟動pbs_mom服務

因為所有的操作均在共享目錄下進行,因此只需要在node17和node18節點上執行make install即可

4. 使用

4.1 查看和激活隊列

在3.1.1過程中執行torque.setup執行后,會默認添加一個batch隊列,並設置了隊列的一些基本屬性

此時需要執行qmgr active queue batch,才能夠往這個隊列提交作業

提交作業在管理節點上執行,在計算節點執行會報錯

[liwl01@node18 ~]$ echo "sleep 120"|qsub
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating liwl01/liwl01 from node18)

在node16上提交作業

[liwl01@node16 ~]$ echo "sleep 300"|qsub
1.node16

計算中執行,S表示的作業狀態為R,運行狀態

[liwl01@node16 ~]$ qstat -a -n

node16: 
                                                                                  Req'd       Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.node16                liwl01      batch    STDIN             20038     1      1       --   01:00:00 R  00:04:17
   node17/0

計算結束后,S表示的作業狀態C,完成狀態

[liwl01@node16 ~]$ qstat -a -n

node16: 
                                                                                  Req'd       Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.node16                liwl01      batch    STDIN             20038     1      1       --   01:00:00 C       -- 
   node17/0

而qnodes執行結果

[liwl01@node16 ~]$ qnodes 
node17
     state = free
     power_state = Running
     np = 4
     ntype = cluster
     jobs = 0/1.node16
     status = opsys=linux,uname=Linux node17 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64,sessions=20038,nsessions=1,nusers=1,idletime=2159,totmem=3879980kb,availmem=3117784kb,physmem=3879980kb,ncpus=4,loadave=0.00,gres=,netload=1039583908,state=free,varattr= ,cpuclock=Fixed,macaddr=00:00:00:80:00:17,version=6.1.3,rectime=1634101823,jobs=1.node16
     mom_service_port = 15002
     mom_manager_port = 15003
     total_sockets = 1
     total_numa_nodes = 1
     total_cores = 4
     total_threads = 4
     dedicated_sockets = 0
     dedicated_numa_nodes = 0
     dedicated_cores = 0
     dedicated_threads = 1

5. 維護

待后期更新


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM