PBS-Torque集群部署


PBS主要功能是計算機集群的資源管理、作業調度,包含openPBS,PBS Pro和Torque三個主要分支;

Slurm集群部署:https://www.cnblogs.com/liu-shaobo/p/13285839.html


一、基礎環境

1、主機名和IP
控制節點:192.168.1.11 m1
計算節點:192.168.1.12 c1
計算節點:192.168.1.13 c2

 

2、主機配置
系統: Centos7.6 x86_64
CPU: 4C
內存:4G


3、關閉防火牆

# systemctl stop firewalld # systemctl disable firewalld # systemctl stop iptables # systemctl disable iptables

 

4、修改資源限制

# vim /etc/security/limits.conf 
* hard nofile 1000000
* soft nofile 1000000
* soft core unlimited
* soft stack 10240
* soft memlock unlimited
* hard memlock unlimited

 

5、配置CST時區

# ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

同步NTP服務器

# ntpdate 210.72.145.44 # yum install ntp -y # systemctl start ntpd # systemctl enable ntpd

安裝EPEL源

# yum install http://mirrors.sohu.com/fedora-epel/epel-release-latest-7.noarch.rpm

 

6、安裝NFS

# yum install -y nfs-utils rpcbind

編輯/etc/exports文件

# mkdir /software # cat /etc/exports /software *(rw,async,insecure,no_root_squash)

啟動NFS

# systemctl start nfs # systemctl start rpcbind # systemctl enable nfs # systemctl enable rpcbind

計算節點掛載NFS

# yum install -y nfs-utils # mkdir /software # mount m1:/software /software

 

7、管理節點配置SSH免登陸

# ssh-keygen # ssh-copy-id -i .ssh/id_rsa.pub c1 # ssh-copy-id -i .ssh/id_rsa.pub c2

 


二、部署Torque管理節點

Torque由四個服務組成:
pbs_server :資源管理系統的服務器,根據調度進程提供的可用節點資源清單進行作業分發和回收;
pbs_mom   :客戶端,監視各計算節點的資源使用情況;
trqauthd      :用於授權pbs_mom進程與pbs_server進程之間建立互信連接;
pbs_sched  :任務調度器;


1、安裝依賴

# yum install -y libtool openssl-devel libxml2-devel boost-devel gcc gcc-c++ hwloc hwloc-devel

 

2、安裝Torque

# wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.3.tar.gz
# tar zxvf torque-6.1.3.tar.gz # cd torque-6.1.3 # ./configure --prefix=/usr/local/torque --with-scp # make -j4 # make install

生成計算節點需要的安裝包,會生成5個可執行腳本

# make packages # libtool --finish /usr/local/torque/lib

 

3、配置Torque服務端

添加環境變量

# . /etc/profile.d/torque.sh

初始化serverdb

# qterm # ./torque.setup root

 

4、開啟Torque服務端

# qterm # systemctl enable pbs_server # systemctl start pbs_server # systemctl enable trqauthd # systemctl start trqauthd

 

 

三、部署Torque計算節點

1、安裝客戶端
將torque文件夾的安裝包復制到計算節點,或復制到NFS目錄

# ./torque-package-mom-linux-x86_64.sh --install # ./torque-package-clients-linux-x86_64.sh --install

 

2、配置客戶端

# vim /var/spool/torque/mom_priv/config
$pbsserver m1
$logevent 225
$loglevel 4
$usecp m1:/data /data

 

3、啟動客戶端

# systemctl enable pbs_mom # systemctl start pbs_mom # systemctl enable trqauthd # systemctl start trqauthd

確保servern_name文件內容為管理節點名

# cat /var/spool/torque/server_name

 

查看各節點狀態

# qnodes
c1
     state = free
     power_state = Running
     np = 4
     ntype = cluster
     status = opsys=linux,uname=Linux c1 4.19.0-6.el7.ucloud.x86_64 #1 SMP Wed Feb 12 07:32:16 UTC 2020 x86_64,sessions=44684,nsessions=1,nusers=1,idletime=142984,totmem=3873444kb,availmem=3429928kb,physmem=3873444kb,ncpus=4,loadave=0.00,gres=,netload=358955336,state=free,varattr= ,cpuclock=Fixed,macaddr=52:54:00:ba:9e:8b,version=6.1.3,rectime=1597632442,jobs=
     mom_service_port = 15002
     mom_manager_port = 15003

c2
     state = free
     power_state = Running
     np = 4
     ntype = cluster
     status = opsys=linux,uname=Linux c2 4.19.0-6.el7.ucloud.x86_64 #1 SMP Wed Feb 12 07:32:16 UTC 2020 x86_64,nsessions=0,nusers=0,idletime=2262,totmem=3873444kb,availmem=3553212kb,physmem=3873444kb,ncpus=4,loadave=0.01,gres=,netload=61440494,state=free,varattr= ,cpuclock=Fixed,macaddr=52:54:00:e9:4a:a6,version=6.1.3,rectime=1597632467,jobs=
     mom_service_port = 15002
     mom_manager_port = 15003

 

 

四、管理節點配置調度器

1、啟動調度器

# cp contrib/systemd/pbs_sched.service /usr/lib/systemd/system/ # systemctl enable pbs_sched # systemctl start pbs_sched

 

2、配置隊列

# qmgr -c 'create queue batch' # qmgr -c 'set server default_queue=batch' # qmgr -c 'set server query_other_jobs=true' # qmgr -c 'set queue batch queue_type=execution' # qmgr -c 'set queue batch started=true' # qmgr -c 'set queue batch enabled=true' # qmgr -c 'set queue batch resources_default.nodes=1'
# qmgr -c 'set server scheduling=true'

 

3、測試(配置SSH免密碼登錄到計算節點,用普通用戶執行)

$ echo "sleep 30" | qsub

 

查看任務信息

# qstat -a

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM