Ubuntu22.04部署slurm GPU集群
nis用户统一认证服务部署
所有节点修改/etc/hosts
10.172.172.81 master master.nis.local nis.loca
10.172.172.21 node1
所有节点安装nis软件包
apt -y install nis
所有节点关闭ufw和selinux
ufw disable
服务端
修改 /etc/default/nis 设置nis服务
NISSERVER=master
xxxxxxxxxx systemctl enable slurmctld slurmdshell
255.255.255.0 10.172.172.0
格式: 掩码 空格 网段
修改/etc/defaultdomain 设置nis域名
nis.local
开启服务
systemctl restart rpcbind ypserv yppasswdd ypxfrd
systemctl enable rpcbind ypserv yppasswdd ypxfrd
初始化数据库
root@master:~# /usr/lib/yp/ypinit -m
At this point, we have to construct a list of the hosts which will run NIS
servers. master.nis.local is in the list of NIS server hosts. Please continue
to add
the names for the other hosts, one per line. When you are done with the
list, type a <control D>.
next host to add: master.nis.local
next host to add: # Ctrl + D key
The current list of NIS servers looks like this:
master.nis.local
Is this correct? [y/n: y] y
We need a few minutes to build the databases...
Building /var/yp/nis.local/ypservers...
Running /var/yp/Makefile...
gmake[1]: Entering directory '/var/yp/nis.local'
Updating passwd.byname...
Updating passwd.byuid...
Updating group.byname...
Updating group.bygid...
Updating hosts.byname...
Updating hosts.byaddr...
Updating rpc.byname...
Updating rpc.bynumber...
Updating services.byname...
Updating services.byservicename...
Updating netid.byname...
Updating protocols.bynumber...
Updating protocols.byname...
Updating netgroup...
Updating netgroup.byhost...
Updating netgroup.byuser...
Updating shadow.byname...
gmake[1]: Leaving directory '/var/yp/nis.local'
master.nis.local has been set up as a NIS master server.
Now you can run ypinit -s master.nis.local on all slave server.
键入master.nis.local 然后crlt+d 退出,再回车,看到updating 而不是error或者fail就代表nis服务器部署成功
客户端
修改 /etc/yp.conf
domain nis.local server master.nis.local
编辑 /etc/nsswitch.conf 给下面几个参数增添nis字段
passwd: files systemd nis
group: files systemd nis
shadow: files nis
gshadow: files
hosts: files mdns4_minimal [NOTFOUND=return] dns nis
修改/etc/defaultdomain 设置nis域名
nis.local
编辑/etc/pam.d/common-session增加
session optional pam_mkhomedir.so skel=/etc/skel umask=077
开启nis服务
systemctl restart rpcbind nscd ypbind
systemctl enable rpcbind nscd ypbind
使用
此时maste节点 使用useradd 或者adduser命令增加用户后,只需要在master节点运行
make -C /var/yp
即可同步用户到nis客户端节点
关于home目录
如果想更改adduser默认的家目录路径,修改master节点中 /etc/adduser.conf 中 DHOME 的值,我在 这里设置为共享目录下的 /share/home
共享文件系统部署
mdadm 做raid5 磁盘阵列
使用lsblk查看master下有3块闲置的nvme硬盘
安装mdadm软件
apt install -y mdadm
将3块盘组成raid5模式的磁盘阵列(软件实现)
mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/nvme0n1 /dev/nvme2n1
/dev/nvme3n1
这里使用xfs格式化
apt install xfsprogs -y
mkfs.xfs /dev/md0
创建挂载点/share
mkdir /share
使用blkid查看/dev/md0的uuid 我们使用uuid实现开机硬盘自动挂载 编辑/etc/fstab增加
UUID="b006efa5-12bb-43f8-80f1-f990547863f1" /share xfs defaults 0 0
然后执行 mount -a
NFS
服务端
apt install nfs-kernel-server -y
编辑/etc/exports
/share *(rw,sync,no_subtree_check)
然后执行
exportfs -ra
或者重启nfs服务
systemctl restart nfs-server
即可生效
客户端
apt install nfs-common -y
mkdir /share
mount master:/share /share
配置时间同步服务
server>
安装时间同步服务
yum install chrony -y
开机自启动
systemctl enable chronyd
添加/etc/chrony.conf
,放行ip段
server 127.0.0.1 iburst #只留这一个,服务端写本机ip
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
allow all #允许那些网段
local stratum 10
logdir /var/log/chrony
重启服务
systemctl restart chronyd
client>
添加/etc/chrony.conf
:
server master iburst
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
logdir /var/log/chrony
重启服务
systemctl restart chronyd
查看时间源
chronyc sources
服务端查看客户端
chronyc clients
munge部署
配置全局用户
所有节点创建同样的munge用户
export MUNGEUSER=991
groupadd -g $MUNGEUSER munge
useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g
munge
安装依赖
所有节点安装
apt install gcc man2html libnuma-dev libpam0g-dev libjwt-dev libjson-c-dev libhttp-parser-dev libyaml-dev libhdf5-dev liblz4-dev libhwloc-dev libfreeipmi-dev libipmimonitoring-dev libfreeipmi-dev librrd-dev rrdtool libgtk2.0-dev liblua5.2-dev libcurl-ocaml-dev libmysqlclient-dev munge libmunge-dev build-essential libdbus-glib-1-dev libgirepository1.0-dev s-nail libpmi2-0 libpmi2-0-dev libevent-dev -y
配置munge
管理节点生成munge.key 并设置权限
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
将生成的密钥分发到计算节点
scp /etc/munge/munge.key XXX
所有节点设置munge权限
chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/
所有节点开启munge服务器
systemctl restart munge
systemctl status munge
可以在管理节点 验证munge服务和各计算节点的联通性
munge -n
munge -n | unmunge
munge -n | ssh NODE1 unmunge
mysql部署
master节点
apt install mysql-server
systemctl start mysql
mysql_secure_installation (一路回车就可以)
初始化slurm数据库表(提示输入密码,直接回车就可以)
mysql -uroot -p
CREATE USER 'slurm'@'%' IDENTIFIED BY 'ize2^&*FzU6';
FLUSH privileges;
CREATE DATABASE IF NOT EXISTS slurm_acct_db CHARACTER SET utf8mb4 COLLATE
utf8mb4_general_ci;
CREATE DATABASE IF NOT EXISTS slurm_jobcomp_db CHARACTER SET utf8mb4 COLLATE
utf8mb4_general_ci;
GRANT ALL PRIVILEGES on slurm_acct_db.* to 'slurm'@'%';
GRANT ALL PRIVILEGES on slurm_jobcomp_db.* to 'slurm'@'%';
FLUSH privileges;
quit
编辑/etc/mysql/mysql.cnf 增加内容
[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
修改 /etc/mysql/mysql.conf.d/mysqld.cnf 中
bind-address = 0.0.0.0
编译安装pmi
openpmix项目主页 Releases · openpmix/openpmix (github.com)
下载bz2结尾的(一般bz2结尾的压缩包包含rpmbuild所需的信息)
wget https://dl.ghpig.top/https://github.com/openpmix/openpmix/releases/download/v5.0.1/pmix-5.0.1.tar.bz2
tar xf pmix-5.0.1.tar.bz2
./configure --prefix=/usr/local/pmix
make -j && make install
slurm部署
slurm编译
前往slurm官网 https://www.schedmd.com/downloads.php
选择较为稳定的版本下载 https://download.schedmd.com/slurm/slurm-22.05.8.tar.bz2
wget https://download.schedmd.com/slurm/slurm-24.05.1.tar.bz2
解压
tar xjf slurm-24.05.1.tar.bz2
进入解压好的目录,构建并安装
./configure --prefix=/usr/local/slurm --with-pmix=/usr/local/pmix
make -j && make install
安装的目录指定到了/usr/local/slurm
配置slurm systemd 服务
cd etc/
chmod 755 *.service
cp *.service /lib/systemd/system/
vim /lib/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network-online.target remote-fs.target munge.service sssd.service
Wants=network-online.target
ConditionPathExists=/usr/local/slurm/etc/slurm.conf
[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/slurmctld
EnvironmentFile=-/etc/default/slurmctld
User=root
Group=root
RuntimeDirectory=slurmctld
RuntimeDirectoryMode=0755
ExecStart=/usr/local/slurm/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
TasksMax=infinity
# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null
[Install]
WantedBy=multi-user.target
vim /lib/systemd/system/slurmdbd.service
[Unit]
Description=Slurm DBD accounting daemon
After=network-online.target remote-fs.target munge.service mysql.service mysqld.service mariadb.service sssd.service
Wants=network-online.target
ConditionPathExists=/usr/local/slurm/etc/slurmdbd.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmdbd
EnvironmentFile=-/etc/default/slurmdbd
User=root
Group=root
RuntimeDirectory=slurmdbd
RuntimeDirectoryMode=0755
ExecStart=/usr/local/slurm/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
TasksMax=infinity
# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null
[Install]
WantedBy=multi-user.target
vim /lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network-online.target remote-fs.target sssd.service
Wants=network-online.target
ConditionPathExists=/usr/local/slurm/etc/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/local/slurm/sbin/slurmd $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
TasksMax=infinity
# Uncomment the following lines to disable logging through journald.
# NOTE: It may be preferable to set these through an override file instead.
#StandardOutput=null
#StandardError=null
[Install]
WantedBy=multi-user.target
slurm配置
/usr/local/slurm/etc/ 下
gres.conf (node1节点 a800节点,h100节点type就改为h100)
Name=gpu Type=a800 File=/dev/nvidia0
Name=gpu Type=a800 File=/dev/nvidia1
Name=gpu Type=a800 File=/dev/nvidia2
Name=gpu Type=a800 File=/dev/nvidia3
Name=gpu Type=a800 File=/dev/nvidia4
Name=gpu Type=a800 File=/dev/nvidia5
Name=gpu Type=a800 File=/dev/nvidia6
Name=gpu Type=a800 File=/dev/nvidia7
cgroup.conf
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
slurmdbd.conf 记得赋予600权限
DbdHost=localhost
DbdAddr=127.0.0.1
SlurmUser=root
MessageTimeout=60
DebugLevel=debug5
DefaultQOS=normal
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePort=3306
StorageUser=slurm
StoragePass=ize2^&*FzU6
slurm.conf
ClusterName=cool
SlurmctldHost=master
SlurmUser=root
SlurmctldPort=6817
SlurmdPort=6818
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurm
MailProg=/usr/bin/s-nail
ReturnToService=2
MPIDefault=none
PrologFlags=CONTAIN
ProctrackType=proctrack/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
TaskPlugin=task/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=error
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/mysql
JobCompHost=master
JobCompUser=slurm
JobCompPass=ize2^&*FzU6
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=master
#AccountingStorageTRES=gres/gpu,cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageTRES=gres/gpu
GresTypes=gpu
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmctldTimeout=60
SlurmdTimeout=120
InactiveLimit=0
MinJobAge=600
KillWait=10
WaitTime=10
NodeName=master RealMemory=2060000 Sockets=2 CoresPerSocket=48 ThreadsPerCore=2
Gres=gpu:h100:8 State=UNKNOWN
NodeName=node1 RealMemory=1030000 Sockets=2 CoresPerSocket=32 ThreadsPerCore=2
Gres=gpu:a800:8 State=UNKNOWN
PartitionName=master Nodes=master Default=YES MaxTime=INFINITE State=UP
PartitionName=node1 Nodes=node1 Default=NO MaxTime=INFINITE State=UP
PartitionName=all Nodes=master,node1 Default=NO MaxTime=INFINITE State=UP
管理节点启动slurmctld slurmdbd slurmd (如果管理节点也做计算)服务
mkdir /var/log/slurm
systemctl restart slurmdbd
systemctl restart slurmcltd
systemctl restart slurmd
systemclt enable slurmd
systemctl enable slurmctld
systemctl enable slurmd
计算节点启动slurmd服务
mkdir /var/log/slurm
systemctl restart slurmd
systemctl enable slurmd
使用
root@zmy-h100-2:/share# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
master* up infinite 1 idle master
node1 up infinite 1 idle node1
all up infinite 2 idle master,node1
这里查看有3个分区,分别时mater,node1 还有一个两者合在一起的分区 srun 提交一个作业分别使用 all 分区 2个节点 2 块gpu
srun -p all -N 2 --gres=gpu:2 nvidia-smi
其他问题
1.sinfo 显示节点STATE(状态)为darin 尝试运行下面的命令
scontrol update NodeName=node0 State=RESUME
2.sinfo 显示节点状态带*
尝试在该节点重启slurmd服务
systemctl restart slurmd