1. 概述
本篇博客主要介绍在centos7操作系统集群中部署,配置和使用pbs调度系统的过程
centos7版本:CentOS Linux release 7.9.2009 (Core)
pbs调度系统版本:torque-6.1.3.tar.gz
集群信息:
使用三个虚拟节点部署pbs系统
节点名称 | 节点IP | 节点角色 | 节点服务 |
---|---|---|---|
node16 | 192.168.80.16 | 管理节点,登陆节点 | pbs_server pbs_schd |
node17 | 192.168.80.17 | 节点节点 | pbs_mom |
node18 | 192.168.80.18 | 计算节点 | pbs_mom |
本篇只是pbs调度软件torque的基本部署,配置,使用。更加复杂的功能,比如MPI环境,图形界面显示,GPU调度,Munge认证,数据库信息获取,高可用配置,未做详细的探究,后期有时间再进行完善。
2. 部署
一般集群都需要时间统一,全局身份认证,这两个步骤在本篇博客不作介绍。
本篇博客的node16-18,已经通过ldap和sssd实现了全局身份认证。
同时约定使用node16的一个软件安装目录作为共享目录,共享给node17和node18
2.1 创建和挂载共享目录
node16上执行mkdir -p /hpc/torque/6.1.3
,该目录用来安装torque软件,共享给其他节点
执行mkdir -p /hpc/packages/
作为编译torque的工作目录
编辑/etc/exportfs文件,内容如下:
/hpc 192.168.80.0/24(rw,no_root_squash,no_all_squash)
/home 192.168.80.0/24(rw,no_root_squash,no_all_squash)
执行:systemctl start nfs && systemctl enable nfs
设置nfs启动和开机启动
执行:exportfs -r
,使共享目录即时生效
在node17,node18上分别执行:
mkdir -p /hpc
mount.nfs 192.168.80.16:/hpc /hpc
mount.nfs 192.168.80.16:/home /home
2.2 部署torque软件
下载
wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.3.tar.gz
解压
tar -zxvf torque-6.1.3.tar.gz -C /hpc/packages
,解压到编译安装torque的工作目录
配置安装信息
# 1. yum安装软件依赖
yum -y install libxml2-devel boost-devel openssl-devel libtool readline-devel pam-devel hwloc-devel numactl-devel tcl-devel tk-devel
# yum groupinstall "GNOME Desktop" "Graphical Administrator Tools" 如果编译选项有--enable-gui时,需要安装图像界面依赖
# 2. configure传参,配置编译安装信息
./configure \
--prefix=/hpc/torque/6.1.3 \
--mandir=/hpc/torque/6.1.3/man \
--enable-cgroups \
--enable-syslog \
--enable-drmaa \
--enable-gui \
--with-xauth \
--with-hwloc \
--with-pam \
--with-tcl \
--with-tk \
# --enable-numa-support \ #这个参数加上,要编辑mom.layout,不清楚规则,暂时取消
# 3. 更新。后期可能添加对MPI,GPU,Munge认证,高可用配置等的支持,本篇后期补充
执行结束后:
Building components: server=yes mom=yes clients=yes
gui=yes drmaa=yes pam=yes
PBS Machine type : linux
Remote copy : /usr/bin/scp -rpB
PBS home : /var/spool/torque
Default server : node13
Unix Domain sockets :
Linux cpusets : no
Tcl : -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
Tk : -L/usr/lib64 -ltk8.5 -lX11 -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
Authentication : trqauthd
Ready for 'make'.
编译和安装
# 1. 编译,安装
make -j4 && make install
# 2. 生成安装脚本
# make packages
执行make packages输出:
这一步可以不做。在本篇博客中,工作目录在共享文件系统,因此只需要在每个节点执行make install即可。
[root@node16 torque-6.1.3]# make packages
Building packages from /hpc/packages/torque-6.1.3/tpackages
rm -rf /hpc/packages/torque-6.1.3/tpackages
mkdir /hpc/packages/torque-6.1.3/tpackages
Building ./torque-package-server-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-mom-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-clients-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-gui-linux-x86_64.sh ...
Building ./torque-package-pam-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /lib64/security'
Building ./torque-package-drmaa-linux-x86_64.sh ...
libtool: install: warning: relinking `libdrmaa.la'
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-devel-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-doc-linux-x86_64.sh ...
Done.
The package files are self-extracting packages that can be copied
and executed on your production machines. Use --help for options.
执行 libtool --finish /hpc/torque/6.1.3/lib
这一步可以不做,make install 操作默认操作
[root@node16 torque-6.1.3]# libtool --finish /hpc/torque/6.1.3/lib
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/sbin" ldconfig -n /hpc/torque/6.1.3/lib
----------------------------------------------------------------------
Libraries have been installed in:
/hpc/torque/6.1.3/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
接下来是配置环境变量和配置启动脚本
此时执行ls -lrt /usr/lib/systemd/system
,发现目录下已经有了
-rw-r--r-- 1 root root 1284 10月 12 11:17 pbs_server.service
-rw-r--r-- 1 root root 704 10月 12 11:17 pbs_mom.service
-rw-r--r-- 1 root root 335 10月 12 11:17 trqauthd.servic
少了一个pbs_sched.service启动脚本,从目录/hpc/packages/torque-6.1.3/contrib/systemd目录拷贝到系统中
cp /hpc/packages/torque-6.1.3/contrib/systemd/pbs_sched.service /usr/lib/systemd//system
此时执行ls -lrt /etc/profile.d
能够看到目录下已经有了torque.sh,只需要执行source /etc/profile
就可以了
3. 配置
3.1 配置管理节点
3.1.1添加pbs管理用户
这里设置为root用户
./torque.setup,这个脚本注释:create pbs_server database and default queue
[root@node16 torque-6.1.3]# ./torque.setup root
hostname: node13
Currently no servers active. Default server will be listed as active server. Error 15133
Active server name: node13 pbs_server port is: 15001
trqauthd daemonized - port /tmp/trqauthd-unix
trqauthd successfully started
initializing TORQUE (admin: root)
You have selected to start pbs_server in create mode.
If the server database exists it will be overwritten.
do you wish to continue y/(n)?
# 输入y
3.1.2 启动组件认证服务
3.1.1的步骤会启动认证服务trqauthd,执行ps axu|grep trqauthd
可验证
后续通过systemctl start trqauthd时会报错,因此此时建议执行pkill -f trqauthd先处理掉该进程,再通过systemctl start trqauthd启动
pkill -f trqauthd
systemctl start trqauthd
systemctl enable trqauthd
3.1.3 启动主服务
配置计算节点,vim /var/spool/torque/server_priv/nodes
node17 np=4
node18 np=4
然后执行以下命令
systemctl status pbs_server
systemctl start pbs_server #如果这一步执行失败,查看是否已经启动了pbs_server,如果启动执行pkill -f pbs_server,然后再执行此命令
systemctl enable pbs_server
执行qnodes
查看信息
如果没有qnodes命令,执行source /etc/profile加载环境变量
node17
state = down
power_state = Running
np = 4
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 0
total_numa_nodes = 0
total_cores = 0
total_threads = 0
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
node18
state = down
power_state = Running
np = 4
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 0
total_numa_nodes = 0
total_cores = 0
total_threads = 0
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 0
# node17和node18因为没有启动pbs_mom,所以状态显示为down
3.1.4 启动调度服务
在node16上还要执行systemctl start pbs_sched
,否则提交的作业全部为Q状态
设置开机启动systemctl enable pbs_sched
3.2 配置计算节点
3.1部分完成了管理节点node16的部署,包括:
- yum安装依赖环境
- 解压源码,配置编译信息,编译安装
- 配置管理用户
- 编辑配置文件
- 启动trqauthed服务,启动pbs_server服务,启动pbs_sched服务
计算节点需要完成的内容:
- yum安装依赖环境
- 配置管理节点信息
- 执行安装脚本,或者make install
- 启动trqauthd服务,启动pbs_mom服务
因为所有的操作均在共享目录下进行,因此只需要在node17和node18节点上执行make install即可
4. 使用
4.1 查看和激活队列
在3.1.1过程中执行torque.setup执行后,会默认添加一个batch队列,并设置了队列的一些基本属性
此时需要执行qmgr active queue batch
,才能够往这个队列提交作业
提交作业在管理节点上执行,在计算节点执行会报错
[liwl01@node18 ~]$ echo "sleep 120"|qsub
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating liwl01/liwl01 from node18)
在node16上提交作业
[liwl01@node16 ~]$ echo "sleep 300"|qsub
1.node16
计算中执行,S表示的作业状态为R,运行状态
[liwl01@node16 ~]$ qstat -a -n
node16:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.node16 liwl01 batch STDIN 20038 1 1 -- 01:00:00 R 00:04:17
node17/0
计算结束后,S表示的作业状态C,完成状态
[liwl01@node16 ~]$ qstat -a -n
node16:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.node16 liwl01 batch STDIN 20038 1 1 -- 01:00:00 C --
node17/0
而qnodes执行结果
[liwl01@node16 ~]$ qnodes
node17
state = free
power_state = Running
np = 4
ntype = cluster
jobs = 0/1.node16
status = opsys=linux,uname=Linux node17 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64,sessions=20038,nsessions=1,nusers=1,idletime=2159,totmem=3879980kb,availmem=3117784kb,physmem=3879980kb,ncpus=4,loadave=0.00,gres=,netload=1039583908,state=free,varattr= ,cpuclock=Fixed,macaddr=00:00:00:80:00:17,version=6.1.3,rectime=1634101823,jobs=1.node16
mom_service_port = 15002
mom_manager_port = 15003
total_sockets = 1
total_numa_nodes = 1
total_cores = 4
total_threads = 4
dedicated_sockets = 0
dedicated_numa_nodes = 0
dedicated_cores = 0
dedicated_threads = 1
5. 维护
待后期更新