共计 14577 个字符,预计需要花费 37 分钟才能阅读完成。
[v_act]Slurm简介[/v_act]
Slurm是一个开源,容错,高度可扩展的集群管理和作业调度系统,适用于各种规模的Linux集群。 Slurm不需要对其操作进行内核修改,并且相对独立。作为集群工作负载管理器,Slurm有以下特性:
1、它在一段时间内为用户分配对资源(计算节点)的独占和/或非独占访问,以便他们可以执行工作;
2、它提供了一个框架,用于在分配的节点集上启动,执行和监视工作(通常是并行作业);
3、它通过管理待处理工作的队列来仲裁资源争用。
4、它提供作业信息统计,作业状态诊断等工具。
[v_act]环境说明[/v_act]
系统:CentOS最小化安装;升级软件补丁,内核;关闭SELinux和防火墙。
Slurm专用账户(slurm):Master端和Node端专用账户统一ID,建议ID号规划为200;
Slurm Master如需要支持GUI命令(sview)则需要安装GUI界面(Server with GUI);
[v_blue]Slurm Maser端安装参考下面链接文章[/v_blue]
[neilian ids=1465]
[v_act]Slurm Node端安装[/v_act]
0、安装EPEL源:yum install -y epel-release && yum makecache
[root@localhost ~]# yum install -y epel-release && yum makecache
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.aliyun.com
* extras: mirrors.aliyun.com
* updates: mirrors.aliyun.com
base | 3.6 kB 00:00:00
epel | 4.7 kB 00:00:00
extras | 2.9 kB 00:00:00
updates | 2.9 kB 00:00:00
(1/3): epel/x86_64/updateinfo | 1.0 MB 00:00:00
(2/3): updates/7/x86_64/primary_db | 4.5 MB 00:00:01
(3/3): epel/x86_64/primary_db | 6.9 MB 00:00:02
......此处省略......
(5/9): updates/7/x86_64/other_db | 318 kB 00:00:00
(6/9): updates/7/x86_64/filelists_db | 2.4 MB 00:00:01
(7/9): base/7/x86_64/filelists_db | 7.1 MB 00:00:03
(8/9): epel/x86_64/other_db | 3.3 MB 00:00:04
(9/9): epel/x86_64/filelists_db | 12 MB 00:00:04
Metadata Cache Created
1、配置主机名:hostnamectl set-hostname slurm-node1 #配置后重新连接即可即可生效
[root@localhost ~]# hostnamectl set-hostname slurm-node1
2、配置时间服务并同步时间:CentOS7系统默认已采用Chrony时间服务
[root@slurm-node1 ~]# systemctl status chronyd.service
● chronyd.service - NTP client/server
Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2020-09-27 17:12:25 CST; 1min 28s ago
Docs: man:chronyd(8)
man:chrony.conf(5)
Process: 817 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
Process: 793 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 805 (chronyd)
CGroup: /system.slice/chronyd.service
└─805 /usr/sbin/chronyd
Sep 27 17:12:24 localhost.localdomain systemd[1]: Starting NTP client/server...
Sep 27 17:12:24 localhost.localdomain chronyd[805]: chronyd version 3.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNC... +DEBUG)
Sep 27 17:12:24 localhost.localdomain chronyd[805]: Frequency -6.336 +/- 109.996 ppm read from /var/lib/chrony/drift
Sep 27 17:12:25 localhost.localdomain systemd[1]: Started NTP client/server.
Sep 27 17:12:33 localhost.localdomain chronyd[805]: Selected source 203.107.6.88
Sep 27 17:12:34 localhost.localdomain chronyd[805]: Source 94.130.49.186 replaced with 108.59.2.24
Hint: Some lines were ellipsized, use -l to show in full.
a1)注释或删除”server 0.centos.pool.ntp.org iburst”等四行信息
a2)添加”server Slurm-master-ip iburst”Slurm Master主控端服务器IP地址
# These servers were defined in the installation:
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
server 192.168.80.250 iburst
a3)重启Chrony服务(systemctl restart chronyd.service)并查验(chronyc sources)
[root@slurm-master ~]# systemctl restart chronyd.service
[root@slurm-node1 ~]# chronyc sources
210 Number of sources = 1
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* 192.168.80.250 3 6 17 1 +1266us[+1346us] +/- 21ms
3、部署Munge:目前在线安装的Munge版本为0.5.11
[root@slurm-node1 ~]# yum install -y munge munge-libs munge-devel
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.aliyun.com
* extras: mirrors.aliyun.com
* updates: mirrors.aliyun.com
base | 3.6 kB 00:00:00
epel | 4.7 kB 00:00:00
extras | 2.9 kB 00:00:00
updates | 2.9 kB 00:00:00
(1/3): epel/x86_64/updateinfo | 1.0 MB 00:00:00
(2/3): updates/7/x86_64/primary_db | 4.5 MB 00:00:01
(3/3): epel/x86_64/primary_db | 6.9 MB 00:00:02
Resolving Dependencies
--> Running transaction check
---> Package munge.x86_64 0:0.5.11-3.el7 will be installed
---> Package munge-devel.x86_64 0:0.5.11-3.el7 will be installed
---> Package munge-libs.x86_64 0:0.5.11-3.el7 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
==========================================================================================================================================================
Package Arch Version Repository Size
==========================================================================================================================================================
Installing:
munge x86_64 0.5.11-3.el7 epel 95 k
munge-devel x86_64 0.5.11-3.el7 epel 22 k
munge-libs x86_64 0.5.11-3.el7 epel 37 k
Transaction Summary
==========================================================================================================================================================
Install 3 Packages
Total download size: 154 k
Installed size: 341 k
Downloading packages:
(1/3): munge-0.5.11-3.el7.x86_64.rpm | 95 kB 00:00:00
(2/3): munge-devel-0.5.11-3.el7.x86_64.rpm | 22 kB 00:00:00
(3/3): munge-libs-0.5.11-3.el7.x86_64.rpm | 37 kB 00:00:00
----------------------------------------------------------------------------------------------------------------------------------------------------------
Total 105 kB/s | 154 kB 00:00:01
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : munge-libs-0.5.11-3.el7.x86_64 1/3
Installing : munge-0.5.11-3.el7.x86_64 2/3
Installing : munge-devel-0.5.11-3.el7.x86_64 3/3
Verifying : munge-0.5.11-3.el7.x86_64 1/3
Verifying : munge-devel-0.5.11-3.el7.x86_64 2/3
Verifying : munge-libs-0.5.11-3.el7.x86_64 3/3
Installed:
munge.x86_64 0:0.5.11-3.el7 munge-devel.x86_64 0:0.5.11-3.el7 munge-libs.x86_64 0:0.5.11-3.el7
Complete!
[root@slurm-node1 ~]# chmod -R 0700 /etc/munge /var/log/munge && chmod -R 0711 /var/lib/munge && chmod -R 0755 /var/run/munge
c1)同步Master节点上的Munge密钥文件:scp root@192.168.80.250:/etc/munge/munge.key /etc/munge/
[root@slurm-node1 ~]# scp root@192.168.80.250:/etc/munge/munge.key /etc/munge/
The authenticity of host '192.168.80.250 (192.168.80.250)' can't be established.
ECDSA key fingerprint is SHA256:2Eo2WLWyofiltEAs4nLUFLOcXLFD6YvsuPSDlEDUZGk.
ECDSA key fingerprint is MD5:3c:b0:5f:a8:af:6a:15:45:eb:a9:2a:b0:20:21:65:04.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.80.250' (ECDSA) to the list of known hosts.
root@192.168.80.250's password:
munge.key 100% 1024 251.7KB/s 00:00
c2)授权Munge秘钥文件:chown munge:munge /etc/munge/munge.key && chmod 0600 /etc/munge/munge.key
[root@slurm-node1 ~]# chown munge:munge /etc/munge/munge.key && chmod 0600 /etc/munge/munge.key
c3)启动Munge服务并配置服务自启动:systemctl start munge.service && systemctl enable munge.service
[root@slurm-node1 ~]# systemctl start munge.service && systemctl enable munge.service
Created symlink from /etc/systemd/system/multi-user.target.wants/munge.service to /usr/lib/systemd/system/munge.service.
c4)跟Slurm-Master验证:munge -n | ssh 192.168.80.250 unmunge
[root@slurm-node1 ~]# munge -n | ssh 192.168.80.250 unmunge
root@192.168.80.250's password:
STATUS: Success (0)
ENCODE_HOST: ??? (192.168.80.251)
ENCODE_TIME: 2020-09-27 17:24:54 +0800 (1601198694)
DECODE_TIME: 2020-09-27 17:24:56 +0800 (1601198696)
TTL: 300
CIPHER: aes128 (4)
MAC: sha1 (3)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
4、安装所需组件:yum install -y rpm-build bzip2-devel openssl openssl-devel zlib-devel perl-DBI perl-ExtUtils-MakeMaker pam-devel readline-devel mariadb-devel python3 gtk2 gtk2-devel
[root@slurm-master ~]# yum install -y rpm-build bzip2-devel openssl openssl-devel zlib-devel perl-DBI perl-ExtUtils-MakeMaker pam-devel readline-devel mariadb-devel python3 gtk2 gtk2-devel
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
* base: mirrors.aliyun.com
* extras: mirrors.aliyun.com
* updates: mirrors.aliyun.com
Package rpm-build-4.11.3-43.el7.x86_64 already installed and latest version
Package 1:openssl-1.0.2k-19.el7.x86_64 already installed and latest version
Package perl-DBI-1.627-4.el7.x86_64 already installed and latest version
Package gtk2-2.24.31-1.el7.x86_64 already installed and latest version
Resolving Dependencies
--> Running transaction check
---> Package bzip2-devel.x86_64 0:1.0.6-13.el7 will be installed
---> Package gtk2-devel.x86_64 0:2.24.31-1.el7 will be installed
--> Processing Dependency: pango-devel >= 1.20.0-1 for package: gtk2-devel-2.24.31-1.el7.x86_64
--> Processing Dependency: glib2-devel >= 2.28.0-1 for package: gtk2-devel-2.24.31-1.el7.x86_64
--> Processing Dependency: cairo-devel >= 1.6.0-1 for package: gtk2-devel-2.24.31-1.el7.x86_64
--> Processing Dependency: atk-devel >= 1.29.4-2 for package: gtk2-devel-2.24.31-1.el7.x86_64
--> Processing Dependency: pkgconfig(pangoft2) for package: gtk2-devel-2.24.31-1.el7.x86_64
......此处省略......
mesa-khr-devel.x86_64 0:18.3.4-7.el7_8.1 mesa-libEGL-devel.x86_64 0:18.3.4-7.el7_8.1
mesa-libGL-devel.x86_64 0:18.3.4-7.el7_8.1 ncurses-devel.x86_64 0:5.9-14.20130511.el7_4
pango-devel.x86_64 0:1.42.4-4.el7_7 pcre-devel.x86_64 0:8.32-17.el7
perl-ExtUtils-Install.noarch 0:1.58-295.el7 perl-ExtUtils-Manifest.noarch 0:1.61-244.el7
perl-ExtUtils-ParseXS.noarch 1:3.18-3.el7 perl-devel.x86_64 4:5.16.3-295.el7
pixman-devel.x86_64 0:0.34.0-1.el7 pyparsing.noarch 0:1.5.6-9.el7
python3-libs.x86_64 0:3.6.8-13.el7 python3-pip.noarch 0:9.0.3-7.el7_7
python3-setuptools.noarch 0:39.2.0-10.el7 systemtap-sdt-devel.x86_64 0:4.0-11.el7
Complete!
5、部署Slurm程序
[root@slurm-node1 ~]# groupadd -g 200 slurm && useradd -u 200 -g 200 -s /sbin/nologin -M slurm
[root@slurm-node1 ~]# cd && scp root@192.168.80.250:/root/slurm-20.02.5.tar.bz2 ./
root@192.168.80.250's password:
slurm-20.02.5.tar.bz2 100% 6177KB 14.6MB/s 00:00
[v_blue]其实之前Master端安装时已经编译过一次,安装包已经生成过了,只要拷贝过来即可:scp root@192.168.80.250:/root/rpmbuild/RPMS/x86_64/slurm-*.rpm ./[/v_blue]
[root@slurm-node1 ~]# rpmbuild -ta --clean slurm-20.02.5.tar.bz2
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.yM0uEb
+ umask 022
+ cd /root/rpmbuild/BUILD
+ cd /root/rpmbuild/BUILD
+ rm -rf slurm-20.02.5
+ /usr/bin/bzip2 -dc /root/slurm-20.02.5.tar.bz2
+ /usr/bin/tar -xvvf -
drwxr-xr-x 1000/1000 0 2020-09-11 04:56 slurm-20.02.5/
-rw-r--r-- 1000/1000 8543 2020-09-11 04:56 slurm-20.02.5/LICENSE.OpenSSL
drwxr-xr-x 1000/1000 0 2020-09-11 04:56 slurm-20.02.5/auxdir/
-rw-r--r-- 1000/1000 306678 2020-09-11 04:56 slurm-20.02.5/auxdir/libtool.m4
-rw-r--r-- 1000/1000 5860 2020-09-11 04:56 slurm-20.02.5/auxdir/ax_gcc_builtin.m4
-rwxr-xr-x 1000/1000 15368 2020-09-11 04:56 slurm-20.02.5/auxdir/install-sh
-rw-r--r-- 1000/1000 327116 2020-09-11 04:56 slurm-20.02.5/auxdir/ltmain.sh
-rw-r--r-- 1000/1000 2630 2020-09-11 04:56 slurm-20.02.5/auxdir/x_ac_freeipmi.m4
-rw-r--r-- 1000/1000 1783 2020-09-11 04:56 slurm-20.02.5/auxdir/x_ac_yaml.m4
-rw-r--r-- 1000/1000 2709 2020-09-11 04:56 slurm-20.02.5/auxdir/x_ac_databases.m4
-rw-r--r-- 1000/1000 2018 2020-09-11 04:56 slurm-20.02.5/auxdir/x_ac_http_parser.m4
-rwxr-xr-x 1000/1000 36136 2020-09-11 04:56 slurm-20.02.5/auxdir/config.sub
-rwxr-xr-x 1000/1000 23568 2020-09-11 04:56 slurm-20.02.5/auxdir/depcomp
......此处省略......
Checking for unpackaged file(s): /usr/lib/rpm/check-files /root/rpmbuild/BUILDROOT/slurm-20.02.5-1.el7.x86_64
Wrote: /root/rpmbuild/SRPMS/slurm-20.02.5-1.el7.src.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-perlapi-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-devel-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-example-configs-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-slurmctld-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-slurmd-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-slurmdbd-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-libpmi-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-torque-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-openlava-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-contribs-20.02.5-1.el7.x86_64.rpm
Wrote: /root/rpmbuild/RPMS/x86_64/slurm-pam_slurm-20.02.5-1.el7.x86_64.rpm
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.ojihVc
+ umask 022
+ cd /root/rpmbuild/BUILD
+ cd slurm-20.02.5
+ rm -rf /root/rpmbuild/BUILDROOT/slurm-20.02.5-1.el7.x86_64
+ exit 0
Executing(--clean): /bin/sh -e /var/tmp/rpm-tmp.lzdQo2
+ umask 022
+ cd /root/rpmbuild/BUILD
+ rm -rf slurm-20.02.5
+ exit 0
[root@slurm-node1 ~]# cd /root/rpmbuild/RPMS/x86_64 && yum install -y slurm-*.rpm
Loaded plugins: fastestmirror, langpacks
Examining slurm-20.02.5-1.el7.x86_64.rpm: slurm-20.02.5-1.el7.x86_64
Marking slurm-20.02.5-1.el7.x86_64.rpm to be installed
Examining slurm-contribs-20.02.5-1.el7.x86_64.rpm: slurm-contribs-20.02.5-1.el7.x86_64
Marking slurm-contribs-20.02.5-1.el7.x86_64.rpm to be installed
Examining slurm-devel-20.02.5-1.el7.x86_64.rpm: slurm-devel-20.02.5-1.el7.x86_64
Marking slurm-devel-20.02.5-1.el7.x86_64.rpm to be installed
Examining slurm-example-configs-20.02.5-1.el7.x86_64.rpm: slurm-example-configs-20.02.5-1.el7.x86_64
......此处省略......
Verifying : slurm-libpmi-20.02.5-1.el7.x86_64 12/13
Verifying : slurm-perlapi-20.02.5-1.el7.x86_64 13/13
Installed:
slurm.x86_64 0:20.02.5-1.el7 slurm-contribs.x86_64 0:20.02.5-1.el7 slurm-devel.x86_64 0:20.02.5-1.el7
slurm-example-configs.x86_64 0:20.02.5-1.el7 slurm-libpmi.x86_64 0:20.02.5-1.el7 slurm-openlava.x86_64 0:20.02.5-1.el7
slurm-pam_slurm.x86_64 0:20.02.5-1.el7 slurm-perlapi.x86_64 0:20.02.5-1.el7 slurm-slurmctld.x86_64 0:20.02.5-1.el7
slurm-slurmd.x86_64 0:20.02.5-1.el7 slurm-slurmdbd.x86_64 0:20.02.5-1.el7 slurm-torque.x86_64 0:20.02.5-1.el7
Dependency Installed:
perl-Switch.noarch 0:2.16-7.el7
Complete!
e1)同步Master节点上的Slurm配置文件:scp root@192.168.80.250:/etc/slurm/slurm.conf /etc/slurm/
[root@slurm-node1 x86_64]# scp root@192.168.80.250:/etc/slurm/slurm.conf /etc/slurm/
root@192.168.80.250's password:
slurm.conf 100% 2127 743.6KB/s 00:00
e2)创建配套目录及授权配套目录:mkdir /var/spool/slurm && chown slurm:slurm /var/spool/slurm
[root@slurm-node1 x86_64]# mkdir /var/spool/slurm && chown slurm:slurm /var/spool/slurm
e3)启动Slurm Node服务并设置服务自启动:systemctl start slurmd.service && systemctl enable slurmd.service
[root@slurm-node1 x86_64]# systemctl start slurmd.service && systemctl enable slurmd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/slurmd.service to /usr/lib/systemd/system/slurmd.service.
[root@slurm-node1 x86_64]# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2020-09-27 17:51:10 CST; 32s ago
Main PID: 69215 (slurmd)
CGroup: /system.slice/slurmd.service
└─69215 /usr/sbin/slurmd
Sep 27 17:51:10 slurm-node1 systemd[1]: Starting Slurm node daemon...
Sep 27 17:51:10 slurm-node1 systemd[1]: Started Slurm node daemon.
[v_blue]到此Slurm的Node端安装部署基本完成,可以使用在Master端利用sinfo命令查看或sview命令来管理
[root@slurm-master ~]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle slurm-node1
[root@slurm-master slurm]# sview
[root@slurm-master ~]# srun hostname
slurm-node1
[root@slurm-master ~]#
[/v_blue]