Ubuntu22.04部署slurm GPU集群

2024-03-25
5分钟阅读时长

nis用户统一认证服务部署

所有节点修改/etc/hosts

10.172.172.81 master master.nis.local nis.loca
10.172.172.21 node1

所有节点安装nis软件包

apt -y install nis

所有节点关闭ufw和selinux

ufw disable

服务端

修改 /etc/default/nis 设置nis服务

NISSERVER=master

xxxxxxxxxx systemctl enable slurmctld slurmdshell

255.255.255.0 10.172.172.0

格式: 掩码 空格 网段

修改/etc/defaultdomain 设置nis域名

nis.local

开启服务

systemctl restart rpcbind ypserv yppasswdd ypxfrd
systemctl enable rpcbind ypserv yppasswdd ypxfrd

初始化数据库

root@master:~# /usr/lib/yp/ypinit -m
At this point, we have to construct a list of the hosts which will run NIS
servers. master.nis.local is in the list of NIS server hosts. Please continue
to add
the names for the other hosts, one per line. When you are done with the
list, type a <control D>.
next host to add: master.nis.local
next host to add: # Ctrl + D key
The current list of NIS servers looks like this:
master.nis.local
Is this correct? [y/n: y] y
We need a few minutes to build the databases...
Building /var/yp/nis.local/ypservers...
Running /var/yp/Makefile...
gmake[1]: Entering directory '/var/yp/nis.local'
Updating passwd.byname...
Updating passwd.byuid...
Updating group.byname...
Updating group.bygid...
Updating hosts.byname...
Updating hosts.byaddr...
Updating rpc.byname...
Updating rpc.bynumber...
Updating services.byname...
Updating services.byservicename...
Updating netid.byname...
Updating protocols.bynumber...
Updating protocols.byname...
Updating netgroup...
Updating netgroup.byhost...
Updating netgroup.byuser...
Updating shadow.byname...
gmake[1]: Leaving directory '/var/yp/nis.local'
master.nis.local has been set up as a NIS master server.
Now you can run ypinit -s master.nis.local on all slave server.

键入master.nis.local 然后crlt+d 退出,再回车,看到updating 而不是error或者fail就代表nis服务器部署成功

客户端

修改 /etc/yp.conf

domain nis.local server master.nis.local

编辑 /etc/nsswitch.conf 给下面几个参数增添nis字段

passwd: files systemd nis
group: files systemd nis
shadow: files nis
gshadow: files
hosts: files mdns4_minimal [NOTFOUND=return] dns nis

修改/etc/defaultdomain 设置nis域名

nis.local

编辑/etc/pam.d/common-session增加

session optional pam_mkhomedir.so skel=/etc/skel umask=077

开启nis服务

systemctl restart rpcbind nscd ypbind
systemctl enable rpcbind nscd ypbind

使用

此时maste节点 使用useradd 或者adduser命令增加用户后,只需要在master节点运行

make -C /var/yp

即可同步用户到nis客户端节点

关于home目录

如果想更改adduser默认的家目录路径,修改master节点中 /etc/adduser.conf 中 DHOME 的值,我在 这里设置为共享目录下的 /share/home

共享文件系统部署

mdadm 做raid5 磁盘阵列

使用lsblk查看master下有3块闲置的nvme硬盘

image-20241114163615653

安装mdadm软件

apt install -y mdadm

将3块盘组成raid5模式的磁盘阵列(软件实现)

mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/nvme0n1 /dev/nvme2n1
/dev/nvme3n1

这里使用xfs格式化

apt install xfsprogs -y
mkfs.xfs /dev/md0

创建挂载点/share

mkdir /share

使用blkid查看/dev/md0的uuid 我们使用uuid实现开机硬盘自动挂载 编辑/etc/fstab增加

UUID="b006efa5-12bb-43f8-80f1-f990547863f1" /share xfs defaults 0 0

然后执行 mount -a

NFS

服务端

apt install nfs-kernel-server -y

编辑/etc/exports

/share *(rw,sync,no_subtree_check)

然后执行

exportfs -ra

或者重启nfs服务

systemctl restart nfs-server

即可生效

客户端

apt install nfs-common -y
mkdir /share
mount master:/share /share

配置时间同步服务

server>

安装时间同步服务

apt install chrony -y

开机自启动

systemctl enable chronyd

添加/etc/chrony.conf,放行ip段

server 127.0.0.1  iburst      #只留这一个,服务端写本机ip
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
allow all  #允许那些网段
local stratum 10
logdir /var/log/chrony

重启服务

systemctl restart chronyd

client>

添加/etc/chrony.conf:

server master iburst
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
logdir /var/log/chrony

重启服务

systemctl restart chronyd

查看时间源

chronyc sources

服务端查看客户端

chronyc clients

munge部署

配置全局用户

所有节点创建同样的munge用户

export MUNGEUSER=991
groupadd -g $MUNGEUSER munge
useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g
munge

安装依赖

所有节点安装

apt install gcc man2html libnuma-dev libpam0g-dev libjwt-dev libjson-c-dev libhttp-parser-dev libyaml-dev libhdf5-dev liblz4-dev libhwloc-dev libfreeipmi-dev libipmimonitoring-dev libfreeipmi-dev librrd-dev rrdtool libgtk2.0-dev liblua5.2-dev libcurl-ocaml-dev libmysqlclient-dev munge libmunge-dev build-essential libdbus-glib-1-dev libgirepository1.0-dev s-nail libpmi2-0 libpmi2-0-dev libevent-dev -y

配置munge

管理节点生成munge.key 并设置权限

dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key 
chmod 400 /etc/munge/munge.key

将生成的密钥分发到计算节点

scp /etc/munge/munge.key XXX

所有节点设置munge权限

chown -R munge: /etc/munge/ /var/log/munge/ 
chmod 0700 /etc/munge/ /var/log/munge/

所有节点开启munge服务器

systemctl restart munge
systemctl status munge

可以在管理节点 验证munge服务和各计算节点的联通性

munge -n 
munge -n | unmunge 
munge -n | ssh NODE1 unmunge

mysql部署

master节点

apt install mysql-server
systemctl start mysql
mysql_secure_installation (一路回车就可以)

初始化slurm数据库表(提示输入密码,直接回车就可以)

mysql -uroot -p 
CREATE USER 'slurm'@'%' IDENTIFIED BY 'ize2^&*FzU6';
FLUSH privileges;
CREATE DATABASE IF NOT EXISTS slurm_acct_db CHARACTER SET utf8mb4 COLLATE
utf8mb4_general_ci;
CREATE DATABASE IF NOT EXISTS slurm_jobcomp_db CHARACTER SET utf8mb4 COLLATE
utf8mb4_general_ci;
GRANT ALL PRIVILEGES on slurm_acct_db.* to 'slurm'@'%';
GRANT ALL PRIVILEGES on slurm_jobcomp_db.* to 'slurm'@'%';
FLUSH privileges;
quit

编辑/etc/mysql/mysql.cnf 增加内容 新版本不需要加下面的参数了

[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

修改 /etc/mysql/mysql.conf.d/mysqld.cnf 中

bind-address = 0.0.0.0

编译安装pmi

openpmix项目主页 Releases · openpmix/openpmix (github.com)

下载bz2结尾的(一般bz2结尾的压缩包包含rpmbuild所需的信息)

wget https://dl.ghpig.top/https://github.com/openpmix/openpmix/releases/download/v5.0.1/pmix-5.0.1.tar.bz2
tar xf pmix-5.0.1.tar.bz2
./configure  --prefix=/usr/local/pmix
make -j && make install

slurm部署

slurm编译

前往slurm官网 https://www.schedmd.com/downloads.php

选择较为稳定的版本下载 https://download.schedmd.com/slurm/slurm-22.05.8.tar.bz2

wget https://download.schedmd.com/slurm/slurm-24.05.1.tar.bz2

解压

tar xjf slurm-24.05.1.tar.bz2

进入解压好的目录,构建并安装

./configure --prefix=/usr/local/slurm --with-pmix=/usr/local/pmix
make -j && make install

安装的目录指定到了/usr/local/slurm

配置slurm systemd 服务

cd etc/
chmod 755 *.service
cp *.service /lib/systemd/system/

vim /lib/systemd/system/slurmctld.service

[Unit]
Description=Slurm controller daemon
After=network-online.target remote-fs.target munge.service sssd.service
Wants=network-online.target
ConditionPathExists=/usr/local/slurm/etc/slurm.conf

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/slurmctld
EnvironmentFile=-/etc/default/slurmctld
User=root
Group=root
RuntimeDirectory=slurmctld
RuntimeDirectoryMode=0755
ExecStart=/usr/local/slurm/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
TasksMax=infinity

[Install]
WantedBy=multi-user.target

vim /lib/systemd/system/slurmdbd.service

[Unit]
Description=Slurm DBD accounting daemon
After=network-online.target remote-fs.target munge.service mysql.service mysqld.service mariadb.service sssd.service
Wants=network-online.target
ConditionPathExists=/usr/local/slurm/etc/slurmdbd.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmdbd
EnvironmentFile=-/etc/default/slurmdbd
User=root
Group=root
RuntimeDirectory=slurmdbd
RuntimeDirectoryMode=0755
ExecStart=/usr/local/slurm/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
TasksMax=infinity

[Install]
WantedBy=multi-user.target

vim /lib/systemd/system/slurmd.service

[Unit]
Description=Slurm node daemon
After=munge.service network-online.target remote-fs.target sssd.service
Wants=network-online.target
ConditionPathExists=/usr/local/slurm/etc/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/local/slurm/sbin/slurmd $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
TasksMax=infinity

[Install]
WantedBy=multi-user.target

slurm配置

/usr/local/slurm/etc/ 下

gres.conf (node1节点 a800节点,h100节点type就改为h100)

Name=gpu Type=a800 File=/dev/nvidia0
Name=gpu Type=a800 File=/dev/nvidia1
Name=gpu Type=a800 File=/dev/nvidia2
Name=gpu Type=a800 File=/dev/nvidia3
Name=gpu Type=a800 File=/dev/nvidia4
Name=gpu Type=a800 File=/dev/nvidia5
Name=gpu Type=a800 File=/dev/nvidia6
Name=gpu Type=a800 File=/dev/nvidia7

cgroup.conf

ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

slurmdbd.conf 记得赋予600权限

DbdHost=localhost
DbdAddr=127.0.0.1
SlurmUser=root
MessageTimeout=60
DebugLevel=debug5
DefaultQOS=normal
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePort=3306
StorageUser=slurm
StoragePass=ize2^&*FzU6

slurm.conf

ClusterName=cool
SlurmctldHost=master
SlurmUser=root


SlurmctldPort=6817
SlurmdPort=6818


StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurm
MailProg=/usr/bin/s-nail
ReturnToService=2

MPIDefault=none

PrologFlags=CONTAIN
ProctrackType=proctrack/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

TaskPlugin=task/cgroup


SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=error
SlurmdLogFile=/var/log/slurm/slurmd.log

JobCompType=jobcomp/mysql
JobCompHost=master
JobCompUser=slurm
JobCompPass=ize2^&*FzU6
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30


AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=master
#AccountingStorageTRES=gres/gpu,cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageTRES=gres/gpu
GresTypes=gpu

SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid

SlurmctldTimeout=60
SlurmdTimeout=120
InactiveLimit=0
MinJobAge=600
KillWait=10
WaitTime=10


NodeName=master RealMemory=2060000 Sockets=2 CoresPerSocket=48 ThreadsPerCore=2
Gres=gpu:h100:8 State=UNKNOWN
NodeName=node1 RealMemory=1030000 Sockets=2 CoresPerSocket=32 ThreadsPerCore=2
Gres=gpu:a800:8 State=UNKNOWN

PartitionName=master Nodes=master Default=YES MaxTime=INFINITE State=UP
PartitionName=node1 Nodes=node1 Default=NO MaxTime=INFINITE State=UP
PartitionName=all Nodes=master,node1 Default=NO MaxTime=INFINITE State=UP

管理节点启动slurmctld slurmdbd slurmd (如果管理节点也做计算)服务

mkdir /var/log/slurm
systemctl restart slurmdbd
systemctl restart slurmcltd
systemctl restart slurmd
systemclt enable slurmd
systemctl enable slurmctld
systemctl enable slurmd

计算节点启动slurmd服务

mkdir /var/log/slurm
systemctl restart slurmd
systemctl enable slurmd

使用

root@zmy-h100-2:/share# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
master* up infinite 1 idle master
node1 up infinite 1 idle node1
all up infinite 2 idle master,node1

这里查看有3个分区,分别时mater,node1 还有一个两者合在一起的分区 srun 提交一个作业分别使用 all 分区 2个节点 2 块gpu

srun -p all -N 2 --gres=gpu:2 nvidia-smi

其他问题

1.sinfo 显示节点STATE(状态)为darin 尝试运行下面的命令

scontrol update NodeName=node0 State=RESUME

2.sinfo 显示节点状态带*

尝试在该节点重启slurmd服务

systemctl restart slurmd