Slurm集群部署


Slurm是面向Linux和Unix的開源工作調度程序,由世界上許多超級計算機使用,主要功能如下:
1、為用戶分配計算節點的資源,以執行工作;
2、提供的框架在一組分配的節點上啟動、執行和監視工作(通常是並行作業);
3、管理待處理作業的工作隊列來仲裁資源爭用問題;

Slurm架構:

 

 截圖來自:https://slurm.schedmd.com/quickstart.html

PBS-Torque集群部署:https://www.cnblogs.com/liu-shaobo/p/13526084.html 

 

一、基礎環境

1、主機名和IP
控制節點:192.168.1.11 m1
計算節點:192.168.1.12 c1
計算節點:192.168.1.13 c2

分別在3個節點設置主機名

# hostnamectl set-hostname m1 # hostnamectl set-hostname c1 # hostnamectl set-hostname c2

 

2、主機配置

系統: Centos7.6 x86_64
CPU: 2C
內存:4G


3、關閉防火牆

# systemctl stop firewalld # systemctl disable firewalld # systemctl stop iptables # systemctl disable iptables

 

4、修改資源限制

# cat /etc/security/limits.conf * hard nofile 1000000
* soft nofile 1000000
* soft core unlimited
* soft stack 10240
* soft memlock unlimited
* hard memlock unlimited

 

5、配置時區
配置CST時區

# ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

同步NTP服務器

# ntpdate 210.72.145.44 # yum install ntp -y # systemctl start ntpd # systemctl enable ntpd

 安裝EPEL源

# yum install http://mirrors.sohu.com/fedora-epel/epel-release-latest-7.noarch.rpm

 

6、安裝NFS(控制節點)

# yum -y install nfs-utils rpcbind

編輯/etc/exports文件

# cat /etc/exports /software/ *(rw,async,insecure,no_root_squash)

啟動NFS

# systemctl start nfs # systemctl start rpcbind # systemctl enable nfs # systemctl enable rpcbind

客戶端掛載NFS

# yum -y install nfs-utils
# mkdir /software
# mount 192.168.1.11:/software /software

 

7、配置SSH免登陸

# ssh-keygen # ssh-copy-id -i .ssh/id_rsa.pub c1 # ssh-copy-id -i .ssh/id_rsa.pub c2

 

 

二、配置Munge

1、創建Munge用戶
Munge用戶要確保Master Node和Compute Nodes的UID和GID相同,所有節點都需要安裝Munge;

# groupadd -g 1108 munge # useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge


2、生成熵池

# yum install -y rng-tools

使用/dev/urandom來做熵源

# rngd -r /dev/urandom
# vim /usr/lib/systemd/system/rngd.service
修改如下參數
[service]
ExecStart=/sbin/rngd -f -r /dev/urandom

# systemctl daemon-reload
# systemctl start rngd
# systemctl enable rngd

 

3、部署Munge

Munge是認證服務,實現本地或者遠程主機進程的UID、GID驗證。

# yum install munge munge-libs munge-devel -y

 

創建全局密鑰
在Master Node創建全局使用的密鑰

# /usr/sbin/create-munge-key -r # dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

 

密鑰同步到所有計算節點

# scp -p /etc/munge/munge.key root@c1:/etc/munge # scp -p /etc/munge/munge.key root@c2:/etc/munge # chown munge: /etc/munge/munge.key # chmod 400 /etc/munge/munge.key

 

啟動所有節點

# systemctl start munge # systemctl enable munge

 

4、測試Munge服務
每個計算節點與控制節點進行連接驗證

本地查看憑據

# munge -n

本地解碼

# munge -n | unmunge

驗證compute node,遠程解碼

# munge -n | ssh c1 unmunge

Munge憑證基准測試

# remunge

 

 

三、配置Slurm

1、創建Slurm用戶

# groupadd -g 1109 slurm # useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

 

2、安裝Slurm依賴

# yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel -y

 

編譯Slurm

# wget https://download.schedmd.com/slurm/slurm-19.05.7.tar.bz2 
# rpmbuild -ta slurm-19.05.7.tar.bz2 # cd /root/rpmbuild/RPMS/x86_64/

 

所有節點安裝Slurm

# yum localinstall slurm-*

 

3、配置控制節點Slurm

# cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
# cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
# vim /etc/slurm/slurm.conf
##修改如下部分
ControlMachine=m1
ControlAddr=192.168.1.11
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SlurmUser=slurm
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=192.168.1.11
AccountingStoragePort=6819
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=m[1-3] RealMemory=3400 Sockets=1 CoresPerSocket=2 State=IDLE
PartitionName=all Nodes=m[1-3] Default=YES State=UP

 

復制控制節點配置文件到計算節點

# scp /etc/slurm/*.conf c1:/etc/slurm/ # scp /etc/slurm/*.conf c2:/etc/slurm/

 

設置控制、計算節點文件權限

# mkdir /var/spool/slurm # chown slurm: /var/spool/slurm # mkdir /var/log/slurm # chown slurm: /var/log/slurm

 

5、配置控制節點Slurm Accounting
Accounting records為slurm收集作業步驟的信息,可以寫入一個文本文件或數據庫,但這個文件會變得越來越大,最簡單的方法是使用MySQL來存儲信息。

創建數據庫的Slurm用戶(MySQL自行安裝)

mysql> grant all on slurm_acct_db.* to 'slurm'@'%' identified by 'slurm*456' with grant option;

 

配置slurmdbd.conf文件

# cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
# cat /etc/slurm/
slurmdbd.conf AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 DbdAddr=192.168.1.11 DbdHost=m1 SlurmUser=slurm DebugLevel=verbose LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=mysql_ip StorageUser=slrum StoragePass=slurm*456 StorageLoc=slurm_acct_db

 

6、開啟節點服務

 啟動控制節點Slurmdbd服務

# systemctl start slurmdbd
# systemctl status slurmdbd
# systemctl enable slurmdbd

 

啟動控制節點slurmctld服務

# systemctl start slurmctld # systemctl status slurmctld # systemctl enable slurmctld

 

啟動計算節點的服務

# systemctl start slurmd # systemctl status slurmd # systemctl enable slurmd

 

 

四、檢查Slurm集群

查看集群

# sinfo # scontrol show partition # scontrol show node

提交作業    

# srun -N2 hostname
# scontrol show jobs

查看作業

# squeue -a

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM