开启/v关闭 ALertmanger告警
6.5.0 或 LTS 6.1.94 版本及以上的环境,请参考 开启/关闭 可观测性系统。
配置参数
配置项 | 类型 | 示例 | 描述 |
---|---|---|---|
alertmanagerConfig | 复合结构 | 参考文档模版 | alertmanager告警配置 |
操作步骤
ONES 配置开启alertmanager监控告警
进入运行中的 ones 容器
ones-ai-k8s.sh
一、使用Webhook 接收Alertmanager告警
vim config/private.yaml
# 以下分别创建两组 receivers webhook接收地址(可以创建多组receivers)
# 将P0-P3级别告警分别发送:Critical receivers
# 添加如下配置
alertmanagerConfig: |
"global":
"resolve_timeout": "1m"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "critical"
"target_match_re":
"severity": "warning|info"
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "warning"
"target_match_re":
"severity": "info"
"receivers":
- "name": "Default"
"webhook_configs":
- url: 'http://webhook-address:8129/Alert'
- "name": "Critical"
"webhook_configs":
- url: 'http://webhook-address:8130/Alert'
"route":
"group_by":
- "namespace"
"group_interval": "1m"
"group_wait": "30s"
"receiver": "Default"
"repeat_interval": "12h"
"routes":
- "match":
"alertname": "Default"
"receiver": "Default"
- "match":
"priority": "P0"
"receiver": "Critical"
"repeat_interval": "10m"
- "match":
"priority": "P1"
"receiver": "Critical"
"repeat_interval": "30m"
- "match":
"priority": "P2"
"receiver": "Critical"
"repeat_interval": "2h"
"templates":
- '/etc/alertmanager/config/default.tmpl'
二、使用邮件Smtp 发送Alertmanager告警
# 在Alertmanager使用邮箱通知,只需要定义好SMTP相关的配置,
# 并且在receiver中定义接收方的邮件地址即可。在Alertmanager中我们可以直接在配置文件的global中定义全局的SMTP配置
# 如果使用QQ邮箱作为发件箱,默认SSL协议,请在global段添加"smtp_require_tls": false
# 添加如下配置
vim config/private.yaml
alertmanagerConfig: |
"global":
"resolve_timeout": "5m"
"smtp_smarthost": 'smtp.163.com:25'
"smtp_from": 'example@163.com'
"smtp_auth_username": 'example@163.com'
"smtp_auth_password": 'yourpassword'
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "critical"
"target_match_re":
"severity": "warning|info"
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "warning"
"target_match_re":
"severity": "info"
"receivers":
- "name": "Default"
"email_configs":
- "to": "example@163.com"
- "name": "Watchdog"
- "name": "Critical"
"email_configs":
- "to": "example@163.com"
"route":
"group_by":
- "namespace"
"group_interval": "1m"
"group_wait": "30s"
"receiver": "Default"
"repeat_interval": "12h"
"routes":
- "match":
"alertname": "Watchdog"
"receiver": "Watchdog"
- "match":
"priority": "P0"
"receiver": "Critical"
"repeat_interval": "10m"
- "match":
"priority": "P1"
"receiver": "Critical"
"repeat_interval": "30m"
- "match":
"priority": "P2"
"receiver": "Critical"
"repeat_interval": "2h"
"templates":
- '/etc/alertmanager/config/default.tmpl'
三、使用企业微信群Bot告警机器人功能
# 以下分别创建两组 receivers,其中Critical设置两组webhook接收地址
# 发送集群内企业微信Bot WebhookService服务,由服务转发告警至企业微信Bot
# 将P0-P1级别告警分别发送:Critical receivers
# 将P2-P3级别告警分别发送:Default receivers
# 添加如下配置
vim config/private.yaml
enableAlertWechatRobot: true # 打开集群内企业微信告警机器人功能,由服务转发到企业微信Bot
alertmanagerConfig: |
"global":
"resolve_timeout": "10m"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "critical"
"target_match_re":
"severity": "warning|info"
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "warning"
"target_match_re":
"severity": "info"
"receivers":
- "name": "Default"
"webhook_configs":
- "url": "http://wechatrobot-svc.monitoring-system:13100/webhook?key=这里填写企微Bot Webhook Key"
"send_resolved": true
- "name": "Critical"
"webhook_configs":
- "url": "http://wechatrobot-svc.monitoring-system:13100/webhook?key=这里填写企微Bot Webhook Key"
"send_resolved": true
- "url": "http://wechatrobot-svc.monitoring-system:13100/webhook?key=这里填写企微Bot Webhook Key"
"send_resolved": true
"route":
"group_by": ['...']
"group_interval": "5m"
"group_wait": "1s"
"receiver": "Default"
"repeat_interval": "12h"
"routes":
- "match":
"alertname": "Default"
"receiver": "Default"
"repeat_interval": "12h"
- "match":
"priority": "P0"
"receiver": "Critical"
"repeat_interval": "10m"
- "match":
"priority": "P1"
"receiver": "Critical"
"repeat_interval": "10m"
- "match":
"priority": "P2"
"receiver": "Default"
"repeat_interval": "2h"
- "match":
"priority": "P3"
"receiver": "Default"
"repeat_interval": "12h"
"templates":
- "/etc/alertmanager/config/default.tmpl"
更多告警接收器配置方案
ONES ALertmanger 告警配置与ALertmanger配置保持高度兼容一致
请参见: Alertmanager官方配置文档
配置告警验证配置
验证后,建议取消 避免频繁收到测试通知。
testRules: |
- name: ones-alert-health
rules:
- alert: ONESAlertHealth
annotations:
message: 告警接收正常.
expr: |
absent(up{job="kube-scheduler"} == 1)
for: 15m
labels:
severity: critical
priority: P0
应用配置
make setup-prometheus && make setup-monitoring-system
示例输出
bash-5.0# make setup-prometheus && make setup-monitoring-system
python3 ./script/python/ones/cmd/setupapp.py --app=monitoring-system --version=v1 --ones-path=
2023-08-11 09:41:03,026 [INFO] ones_path=, k8s_root_dir=/data/ones/ones-ai-k8s
2023-08-11 09:41:03,094 [INFO] render config
2023-08-11 09:41:04,862 [WARNING] /data/ones/ones-ai-k8s/private-overlay-templates/monitoring-system/v1 not found, skip
2023-08-11 09:41:04,862 [INFO] ones_path=, k8s_root_dir=/data/ones/ones-ai-k8s
2023-08-11 09:41:04,930 [INFO] compatible overlay
2023-08-11 09:41:06,691 [INFO] collect dir info
2023-08-11 09:41:06,695 [INFO] setup global resrouces
2023-08-11 09:41:07,804 [INFO] gen registry credentials
2023-08-11 09:41:07,808 [INFO] setup namespace, registry-secret
2023-08-11 09:41:08,202 [INFO] setup local-storage static-pvc, static-pv (cache config to tmp/config.yaml)
storageclass.storage.k8s.io/ones-local-storage unchanged
storageclass.storage.k8s.io/ones-local-storage-mock unchanged
deployment.apps/localstorage-ones-cn-server-node02 unchanged
storageclass.storage.k8s.io/ones-local-storage unchanged
storageclass.storage.k8s.io/ones-local-storage-mock unchanged
2023-08-11 09:41:09,868 [INFO] diff monitoring-system before setup
2023-08-11 09:41:11,709 [INFO] backward compatible
2023-08-11 09:41:11,782 [INFO] setup monitoring-system
2023-08-11 09:41:12,358 [INFO] record latest data
render to tmp/latest-record-setup-app-comfigmap.yaml
render to tmp/date-record-setup-app-comfigmap.yaml
2023-08-11 09:41:13,453 [INFO] remove /data/ones/ones-ai-k8s/tmp
2023-08-11 09:41:13,455 [INFO] setup monitoring-system finish
2023-08-11 09:41:13,455 [INFO] elapsed time: 10.361 seconds
bash-5.0#
验证
### 如下pod状态正常
kubectl get pod -A |grep -i alert
通过接收器验证: 邮件SMTP则检查,收件人邮箱 Webhook 可以查验Webhook 接口是否收到Post测试告警请求
常见场景告警处理
请参见: 故障处理专项-告警主题故障处理
ONES 配置关闭alertmanager监控告警
进入运行中的 ones 容器
ones-ai-k8s.sh
关闭告警,清理数据
make delete-prometheus RETAIN_DATA=false
make delete-monitoring-system RETAIN_DATA=false