前幾天我在mesos平台上基於 cadvisor部署了 influxdb 和 grafana,用於監控 mesos 以及 docker app 運行信息,發現這套監控系統不太適合 mesos + docker 的架構,原因是:
1)mesos task id 和 docker container name 不一致
cadvisor 的設計基於 docker host,沒有考慮到mesos 數據中心;
cadvisor 用 docker name(docker ps能看到)來標記抓取的數據,而 mesos 用 task id(在mesos ui 或者metrics里能看到) 來標記正在運行的任務。mesos task 的類型可以是 docker 容器,也可以是非容器。mesos task id 與docker container name 的命名也是完全不一樣的。
上述問題導致 cadvisor 抓取到數據后,用戶難以識別屬於哪個 mesos task
2)cadvisor 和 grafana 不支持報警
經過查詢資料,發現 mesos-exporter + prometheus + alert-manager 是個很好的組合,可以解決上述問題:
mesos-exporter 是 mesosphere 開發的工具,用於導出 mesos 集群包括 task 的監控數據並傳遞給prometheus;prometheus是個集 db、graph、statistic 於一體的監控工具;alert-manager 是 prometheus 的報警工具
搭建方法:
1. build mesos-exporter
git clone https://github.com/mesosphere/mesos_exporter.git cd mesos_exporter docker build -f Dockerfile -t mesosphere/mesos-exporter .
2. docker pull prometheus, alert-manager
3. 部署 mesos-exporter, alert-manager, prometheus
mesos-exporter:
{
"id": "mesos-exporter-slave",
"instances": 6,
"cpus": 0.2,
"mem": 128,
"args": [
"-slave=http://127.0.0.1:5051",
"-timeout=5s"
],
"constraints": [
["hostname","UNIQUE"],
["hostname", "LIKE", "slave[1-6]"]
],
"container": {
"type": "DOCKER",
"docker": {
"image": "172.31.17.36:5000/mesos-exporter:latest",
"network": "HOST"
},
"volumes": [
{
"containerPath": "/etc/localtime",
"hostPath": "/etc/localtime",
"mode": "RO"
}
]
}
}
請打開slave 防火牆的9110/tcp 端口
alert-manager:
{
"id": "alertmanager",
"instances": 1,
"cpus": 0.5,
"mem": 128,
"constraints": [
["hostname","UNIQUE"],
["hostname", "LIKE", "slave[1-6]"]
],
"labels": {
"HAPROXY_GROUP":"external",
"HAPROXY_0_VHOST":"alertmanager.test.com"
},
"container": {
"type": "DOCKER",
"docker": {
"image": "172.31.17.36:5000/alertmanager:latest",
"network": "BRIDGE",
"portMappings": [
{ "containerPort": 9093, "hostPort": 0, "servicePort": 0, "protocol": "tcp" }
]
},
"volumes": [
{
"containerPath": "/etc/localtime",
"hostPath": "/etc/localtime",
"mode": "RO"
},
{
"containerPath": "/etc/alertmanager/config.yml",
"hostPath": "/var/nfsshare/alertmanager/config.yml",
"mode": "RO"
},
{
"containerPath": "/alertmanager",
"hostPath": "/var/nfsshare/alertmanager/data",
"mode": "RW"
}
]
}
}
prometheus:
{
"id": "prometheus",
"instances": 1,
"cpus": 0.5,
"mem": 128,
"args": [
"-config.file=/etc/prometheus/prometheus.yml",
"-storage.local.path=/prometheus",
"-web.console.libraries=/etc/prometheus/console_libraries",
"-web.console.templates=/etc/prometheus/consoles",
"-alertmanager.url=http://alertmanager.test.com"
],
"constraints": [
["hostname","UNIQUE"],
["hostname", "LIKE", "slave[1-6]"]
],
"labels": {
"HAPROXY_GROUP":"external",
"HAPROXY_0_VHOST":"prometheus.test.com"
},
"container": {
"type": "DOCKER",
"docker": {
"image": "172.31.17.36:5000/prometheus:latest",
"network": "BRIDGE",
"portMappings": [
{ "containerPort": 9090, "hostPort": 0, "servicePort": 0, "protocol": "tcp" }
]
},
"volumes": [
{
"containerPath": "/etc/localtime",
"hostPath": "/etc/localtime",
"mode": "RO"
},
{
"containerPath": "/etc/prometheus",
"hostPath": "/var/nfsshare/prometheus/conf",
"mode": "RO"
},
{
"containerPath": "/prometheus",
"hostPath": "/var/nfsshare/prometheus/data",
"mode": "RW"
}
]
}
}
4. prometheus 配置
prometheus.yml
# my global config
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
# scrape_timeout is set to the global default (10s).
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: 'mesos-slaves'
scrape_interval: 5s
metrics_path: '/metrics'
scheme: 'http'
target_groups:
- targets: ['172.31.17.31:9110', '172.31.17.32:9110', '172.31.17.33:9110', '172.31.17.34:9110', '172.31.17.35:9110', '172.31.17.36:9110']
- labels:
group: 'office'
待補充 ...

5. 報警設置
待補充 ...
6. 與 grafana 集成
prometheus的 graph 功能不太完善,可以與 grafana 集成,讓 grafana 承擔 graph 功能。

data source 設置:

7. 附:mesos metrics 和 statics 地址
http://master1:5050/metrics/snapshot
http://slave4:5051/metrics/snapshot
http://master1:5050/master/state.json
http://slave4:5051/monitor/statistics.json
用戶可以基於上述頁面的數據,編寫自己的監控程序。
