1. 概述
本篇博客主要介紹在centos7操作系統集群中部署,配置和使用pbs調度系統的過程
centos7版本:CentOS Linux release 7.9.2009 (Core)
pbs調度系統版本:torque-6.1.3.tar.gz
集群信息:
使用三個虛擬節點部署pbs系統
節點名稱 | 節點IP | 節點角色 | 節點服務 |
---|---|---|---|
node16 | 192.168.80.16 | 管理節點,登陸節點 | pbs_server pbs_schd |
node17 | 192.168.80.17 | 節點節點 | pbs_mom |
node18 | 192.168.80.18 | 計算節點 | pbs_mom |
本篇只是pbs調度軟件torque的基本部署,配置,使用。更加復雜的功能,比如MPI環境,圖形界面顯示,GPU調度,Munge認證,數據庫信息獲取,高可用配置,未做詳細的探究,后期有時間再進行完善。
2. 部署
一般集群都需要時間統一,全局身份認證,這兩個步驟在本篇博客不作介紹。
本篇博客的node16-18,已經通過ldap和sssd實現了全局身份認證。
同時約定使用node16的一個軟件安裝目錄作為共享目錄,共享給node17和node18
2.1 創建和掛載共享目錄
node16上執行mkdir -p /hpc/torque/6.1.3
,該目錄用來安裝torque軟件,共享給其他節點
執行mkdir -p /hpc/packages/
作為編譯torque的工作目錄
編輯/etc/exportfs文件,內容如下:
/hpc 192.168.80.0/24(rw,no_root_squash,no_all_squash)
/home 192.168.80.0/24(rw,no_root_squash,no_all_squash)
執行:systemctl start nfs && systemctl enable nfs
設置nfs啟動和開機啟動
執行:exportfs -r
,使共享目錄即時生效
在node17,node18上分別執行:
mkdir -p /hpc
mount.nfs 192.168.80.16:/hpc /hpc
mount.nfs 192.168.80.16:/home /home
2.2 部署torque軟件
下載
wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.3.tar.gz
解壓
tar -zxvf torque-6.1.3.tar.gz -C /hpc/packages
,解壓到編譯安裝torque的工作目錄
配置安裝信息
# 1. yum安裝軟件依賴
yum -y install libxml2-devel boost-devel openssl-devel libtool readline-devel pam-devel hwloc-devel numactl-devel tcl-devel tk-devel
# yum groupinstall "GNOME Desktop" "Graphical Administrator Tools" 如果編譯選項有--enable-gui時,需要安裝圖像界面依賴
# 2. configure傳參,配置編譯安裝信息
./configure \
--prefix=/hpc/torque/6.1.3 \
--mandir=/hpc/torque/6.1.3/man \
--enable-cgroups \
--enable-syslog \
--enable-drmaa \
--enable-gui \
--with-xauth \
--with-hwloc \
--with-pam \
--with-tcl \
--with-tk \
# --enable-numa-support \ #這個參數加上,要編輯mom.layout,不清楚規則,暫時取消
# 3. 更新。后期可能添加對MPI,GPU,Munge認證,高可用配置等的支持,本篇后期補充
執行結束后:
Building components: server=yes mom=yes clients=yes
gui=yes drmaa=yes pam=yes
PBS Machine type : linux
Remote copy : /usr/bin/scp -rpB
PBS home : /var/spool/torque
Default server : node13
Unix Domain sockets :
Linux cpusets : no
Tcl : -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
Tk : -L/usr/lib64 -ltk8.5 -lX11 -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
Authentication : trqauthd
Ready for 'make'.
編譯和安裝
# 1. 編譯,安裝
make -j4 && make install
# 2. 生成安裝腳本
# make packages
執行make packages輸出:
這一步可以不做。在本篇博客中,工作目錄在共享文件系統,因此只需要在每個節點執行make install即可。
[root@node16 torque-6.1.3]# make packages
Building packages from /hpc/packages/torque-6.1.3/tpackages
rm -rf /hpc/packages/torque-6.1.3/tpackages
mkdir /hpc/packages/torque-6.1.3/tpackages
Building ./torque-package-server-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-mom-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-clients-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-gui-linux-x86_64.sh ...
Building ./torque-package-pam-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /lib64/security'
Building ./torque-package-drmaa-linux-x86_64.sh ...
libtool: install: warning: relinking `libdrmaa.la'
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-devel-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-doc-linux-x86_64.sh ...
Done.
The package files are self-extracting packages that can be copied
and executed on your production machines. Use --help for options.
執行 libtool --finish /hpc/torque/6.1.3/lib
這一步可以不做,make install 操作默認操作
[root@node16 torque-6.1.3]# libtool --finish /hpc/torque/6.1.3/lib
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/sbin" ldconfig -n /hpc/torque/6.1.3/lib
----------------------------------------------------------------------
Libraries have been installed in:
/hpc/torque/6.1.3/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
接下來是配置環境變量和配置啟動腳本
此時執行ls -lrt /usr/lib/systemd/system
,發現目錄下已經有了
-rw-r--r-- 1 root root 1284 10月 12 11:17 pbs_server.service
-rw-r--r-- 1 root root 704 10月 12 11:17 pbs_mom.service
-rw-r--r-- 1 root root 335 10月 12 11:17 trqauthd.servic
少了一個pbs_sched.service啟動腳本,從目錄/hpc/packages/torque-6.1.3/contrib/systemd目錄拷貝到系統中
cp /hpc/packages/torque-6.1.3/contrib/systemd/pbs_sched.service /usr/lib/systemd//system
此時執行ls -lrt /etc/profile.d
能夠看到目錄下已經有了torque.sh,只需要執行source /etc/profile
就可以了
3. 配置
3.1 配置管理節點
3.1.1添加pbs管理用戶
這里設置為root用戶
./torque.setup,這個腳本注釋:create pbs_server database and default queue
[root@node16 torque-6.1.3]# ./torque.setup root
hostname: node13
Currently no servers active. Default server will be listed as active server. Error 15133
Active server name: node13 pbs_server port is: 15001
trqauthd daemonized - port /tmp/trqauthd-unix
trqauthd successfully started
initializing TORQUE (admin: root)
You have selected to start pbs_server in create mode.
If the server database exists it will be overwritten.
do you wish to continue y/(n)?
# 輸入y
3.1.2 啟動組件認證服務
3.1.1的步驟會啟動認證服務trqauthd,執行ps axu|grep trqauthd
可驗證
后續通過systemctl start trqauthd時會報錯,因此此時建議執行pkill -f trqauthd先處理掉該進程,再通過systemctl start trqauthd啟動
pkill -f trqauthd
systemctl start trqauthd
systemctl enable trqauthd
3.1.3 啟動主服務
配置計算節點,vim /var/spool/torque/server_priv/nodes
node17 np=4
node18 np=4
然后執行以下命令
systemctl status pbs_server
systemctl start pbs_server #如果這一步執行失敗,查看是否已經啟動了pbs_server,如果啟動執行pkill -f pbs_server,然后再執行此命令
systemctl enable pbs_server
執行qnodes
查看信息
如果沒有qnodes命令,執行source /etc/profile加載環境變量
node17
state = down
power_state = Running
np = 4
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 0
total_numa_nodes = 0
total_cores = 0
total_threads = 0
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
node18
state = down
power_state = Running
np = 4
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 0
total_numa_nodes = 0
total_cores = 0
total_threads = 0
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
# node17和node18因為沒有啟動pbs_mom,所以狀態顯示為down
3.1.4 啟動調度服務
在node16上還要執行systemctl start pbs_sched
,否則提交的作業全部為Q狀態
設置開機啟動systemctl enable pbs_sched
3.2 配置計算節點
3.1部分完成了管理節點node16的部署,包括:
- yum安裝依賴環境
- 解壓源碼,配置編譯信息,編譯安裝
- 配置管理用戶
- 編輯配置文件
- 啟動trqauthed服務,啟動pbs_server服務,啟動pbs_sched服務
計算節點需要完成的內容:
- yum安裝依賴環境
- 配置管理節點信息
- 執行安裝腳本,或者make install
- 啟動trqauthd服務,啟動pbs_mom服務
因為所有的操作均在共享目錄下進行,因此只需要在node17和node18節點上執行make install即可
4. 使用
4.1 查看和激活隊列
在3.1.1過程中執行torque.setup執行后,會默認添加一個batch隊列,並設置了隊列的一些基本屬性
此時需要執行qmgr active queue batch
,才能夠往這個隊列提交作業
提交作業在管理節點上執行,在計算節點執行會報錯
[liwl01@node18 ~]$ echo "sleep 120"|qsub
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating liwl01/liwl01 from node18)
在node16上提交作業
[liwl01@node16 ~]$ echo "sleep 300"|qsub
1.node16
計算中執行,S表示的作業狀態為R,運行狀態
[liwl01@node16 ~]$ qstat -a -n
node16:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.node16 liwl01 batch STDIN 20038 1 1 -- 01:00:00 R 00:04:17
node17/0
計算結束后,S表示的作業狀態C,完成狀態
[liwl01@node16 ~]$ qstat -a -n
node16:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.node16 liwl01 batch STDIN 20038 1 1 -- 01:00:00 C --
node17/0
而qnodes執行結果
[liwl01@node16 ~]$ qnodes
node17
state = free
power_state = Running
np = 4
ntype = cluster
jobs = 0/1.node16
status = opsys=linux,uname=Linux node17 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64,sessions=20038,nsessions=1,nusers=1,idletime=2159,totmem=3879980kb,availmem=3117784kb,physmem=3879980kb,ncpus=4,loadave=0.00,gres=,netload=1039583908,state=free,varattr= ,cpuclock=Fixed,macaddr=00:00:00:80:00:17,version=6.1.3,rectime=1634101823,jobs=1.node16
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 4
total_threads = 4
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 1
5. 維護
待后期更新