Slurm是面向Linux和Unix的開源工作調度程序,由世界上許多超級計算機使用,主要功能如下:
1、為用戶分配計算節點的資源,以執行工作;
2、提供的框架在一組分配的節點上啟動、執行和監視工作(通常是並行作業);
3、管理待處理作業的工作隊列來仲裁資源爭用問題;
Slurm架構:
截圖來自:https://slurm.schedmd.com/quickstart.html
PBS-Torque集群部署:https://www.cnblogs.com/liu-shaobo/p/13526084.html
一、基礎環境
1、主機名和IP
控制節點:192.168.1.11 m1
計算節點:192.168.1.12 c1
計算節點:192.168.1.13 c2
分別在3個節點設置主機名
# hostnamectl set-hostname m1 # hostnamectl set-hostname c1 # hostnamectl set-hostname c2
2、主機配置
系統: Centos7.6 x86_64
CPU: 2C
內存:4G
3、關閉防火牆
# systemctl stop firewalld # systemctl disable firewalld # systemctl stop iptables # systemctl disable iptables
4、修改資源限制
# cat /etc/security/limits.conf * hard nofile 1000000 * soft nofile 1000000 * soft core unlimited * soft stack 10240 * soft memlock unlimited * hard memlock unlimited
5、配置時區
配置CST時區
# ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
同步NTP服務器
# ntpdate 210.72.145.44 # yum install ntp -y # systemctl start ntpd # systemctl enable ntpd
安裝EPEL源
# yum install http://mirrors.sohu.com/fedora-epel/epel-release-latest-7.noarch.rpm
6、安裝NFS(控制節點)
# yum -y install nfs-utils rpcbind
編輯/etc/exports文件
# cat /etc/exports /software/ *(rw,async,insecure,no_root_squash)
啟動NFS
# systemctl start nfs # systemctl start rpcbind # systemctl enable nfs # systemctl enable rpcbind
客戶端掛載NFS
# yum -y install nfs-utils
# mkdir /software
# mount 192.168.1.11:/software /software
7、配置SSH免登陸
# ssh-keygen # ssh-copy-id -i .ssh/id_rsa.pub c1 # ssh-copy-id -i .ssh/id_rsa.pub c2
二、配置Munge
1、創建Munge用戶
Munge用戶要確保Master Node和Compute Nodes的UID和GID相同,所有節點都需要安裝Munge;
# groupadd -g 1108 munge # useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
2、生成熵池
# yum install -y rng-tools
使用/dev/urandom來做熵源
# rngd -r /dev/urandom
# vim /usr/lib/systemd/system/rngd.service
修改如下參數
[service]
ExecStart=/sbin/rngd -f -r /dev/urandom
# systemctl daemon-reload
# systemctl start rngd
# systemctl enable rngd
3、部署Munge
Munge是認證服務,實現本地或者遠程主機進程的UID、GID驗證。
# yum install munge munge-libs munge-devel -y
創建全局密鑰
在Master Node創建全局使用的密鑰
# /usr/sbin/create-munge-key -r # dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
密鑰同步到所有計算節點
# scp -p /etc/munge/munge.key root@c1:/etc/munge # scp -p /etc/munge/munge.key root@c2:/etc/munge # chown munge: /etc/munge/munge.key # chmod 400 /etc/munge/munge.key
啟動所有節點
# systemctl start munge # systemctl enable munge
4、測試Munge服務
每個計算節點與控制節點進行連接驗證
本地查看憑據
# munge -n
本地解碼
# munge -n | unmunge
驗證compute node,遠程解碼
# munge -n | ssh c1 unmunge
Munge憑證基准測試
# remunge
三、配置Slurm
1、創建Slurm用戶
# groupadd -g 1109 slurm # useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
2、安裝Slurm依賴
# yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel -y
編譯Slurm
# wget https://download.schedmd.com/slurm/slurm-19.05.7.tar.bz2 # rpmbuild -ta slurm-19.05.7.tar.bz2 # cd /root/rpmbuild/RPMS/x86_64/
所有節點安裝Slurm
# yum localinstall slurm-*
3、配置控制節點Slurm
# cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf # cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf # vim /etc/slurm/slurm.conf ##修改如下部分 ControlMachine=m1 ControlAddr=192.168.1.11 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=slurm StateSaveLocation=/var/spool/slurm/ctld SlurmUser=slurm SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=192.168.1.11 AccountingStoragePort=6819 JobCompType=jobcomp/none JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log NodeName=m[1-3] RealMemory=3400 Sockets=1 CoresPerSocket=2 State=IDLE PartitionName=all Nodes=m[1-3] Default=YES State=UP
復制控制節點配置文件到計算節點
# scp /etc/slurm/*.conf c1:/etc/slurm/ # scp /etc/slurm/*.conf c2:/etc/slurm/
設置控制、計算節點文件權限
# mkdir /var/spool/slurm # chown slurm: /var/spool/slurm # mkdir /var/log/slurm # chown slurm: /var/log/slurm
5、配置控制節點Slurm Accounting
Accounting records為slurm收集作業步驟的信息,可以寫入一個文本文件或數據庫,但這個文件會變得越來越大,最簡單的方法是使用MySQL來存儲信息。
創建數據庫的Slurm用戶(MySQL自行安裝)
mysql> grant all on slurm_acct_db.* to 'slurm'@'%' identified by 'slurm*456' with grant option;
配置slurmdbd.conf文件
# cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
# cat /etc/slurm/slurmdbd.conf AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 DbdAddr=192.168.1.11 DbdHost=m1 SlurmUser=slurm DebugLevel=verbose LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=mysql_ip StorageUser=slrum StoragePass=slurm*456 StorageLoc=slurm_acct_db
6、開啟節點服務
啟動控制節點Slurmdbd服務
# systemctl start slurmdbd # systemctl status slurmdbd # systemctl enable slurmdbd
啟動控制節點slurmctld服務
# systemctl start slurmctld # systemctl status slurmctld # systemctl enable slurmctld
啟動計算節點的服務
# systemctl start slurmd # systemctl status slurmd # systemctl enable slurmd
四、檢查Slurm集群
查看集群
# sinfo # scontrol show partition # scontrol show node
提交作業
# srun -N2 hostname
# scontrol show jobs
查看作業
# squeue -a