Prometheus监控部署

2024-04-03
2分钟阅读时长

被监控端

基础信息(cpu 内存 网络 硬盘)采集器

Download | Prometheus下载监控

image-20241115150901706

wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
mv node_exporter /usr/bin/

新建 /usr/lib/systemd/system/node_exporter.service

[Unit]
Description=node_exporter
Documentation=node_exporter Monitoring System
After=network.target 

[Service]
ExecStart=/usr/bin/node_exporter --web.listen-address=:9100

[Install]
WantedBy=multi-user.target

启动node_exporter监控服务

systemctl daemon-reload
systemctl restart node_exporter
systemctl status node_exporter

Docker采集器

开启ipv4转发

echo -e "net.ipv4.ip_forward = 1\nnet.ipv4.conf.default.rp_filter = 0 \nnet.ipv4.conf.all.rp_filter = 0" >> /etc/sysctl.conf
sysctl -p

安装nvidia-runtime

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
yum install -y nvidia-container-runtime

运行cadvisor监控(专门监控docker)

docker run -d -p 8080:8080 --name cadvisor  --privileged=true -v /:/rootfs:ro -v /var/run:/var/run:rw -v /sys:/sys:ro -v /var/lib/docker/:/var/lib/docker:ro google/cadvisor:latest

GPU采集器

nvidia_gpu_exporter项目地址 https://github.com/utkuozdemir/nvidia_gpu_exporter/releases

yum install -y https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v1.2.0/nvidia-gpu-exporter_1.2.0_linux_amd64.rpm

监控端

Prometheus服务端我们用docker部署

开启ipv4转发

echo -e "net.ipv4.ip_forward = 1\nnet.ipv4.conf.default.rp_filter = 0 \nnet.ipv4.conf.all.rp_filter = 0" >> /etc/sysctl.conf
sysctl -p

新建 /opt/prometheus/prometheus.yml

prometheus.yml
# my global config
global:
  scrape_interval:     15s # 采集被监控段指标的一个周期
  evaluation_interval: 15s # 告警评估的一个周期

# 告警的配置文件
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# 告警规则配置
rule_files:
  # - "first_rules.yml"
#被监控端的配置
scrape_configs:

  - job_name: 'GPU'
    static_configs:
    - targets: ['10.1.0.69:9835']
      labels:
        instance: node60

  - job_name: "Docker"
    static_configs:
    - targets: ['10.1.0.69:8080']
      labels:
        instance: node60

  - job_name: "Linux"
    static_configs:
    - targets: ['10.1.0.69:9100']
      labels:
        instance: node60

运行

docker run -d  --name=prometheus -p 9090:9090 -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

访问 http://10.1.1.249:9090/

image-20241115150909455

Grafana 面板

运行

docker  run -d --name=grafana  -p 3000:3000  grafana/grafana

默认用户名密码都是admin

image-20241115150915493

更改语言时区

image-20241115150921494

image-20241115150928641

添加数据源

image-20241115150936074

image-20241115150941951

image-20241115150946931

将上面Prometheus 地址填入其中 save保存

image-20241115150956367

正确的话弹窗image-20241115151002504

添加仪表盘

image-20241115151008610

选择导入一个

image-20241115151015351

Dashboards | Grafana Labs 找到自己喜欢的仪表盘

image-20241115151032219

点进去,复制ID

image-20241115151041868

填入对应位置,点击加载

image-20241115151049679

image-20241115151056436

选择我们的数据源 然后import下

即可

image-20241115151103970

这是node export采集器展示的,同理你可以添加gpu docker的dashboard