【雲計算】mesos+marathon 服務發現、負載均衡、監控告警方案

本文轉載自查看原文 2016-08-05 15:42 1973 mesos/ 監控告警/ marathon/ 負載均衡/ mesos/marathon/aurora/ architect/design/ 服務發現/ docker

Mesos-dns 和 Marathon-lb 是mesosphere 官網提供的兩種服務發現和負載均衡工具。官方的文檔主要針對DCOS，針對其它系統的相關中文文檔不多，下面是我在Centos7上的安裝說明和使用總結。

1. Mesos服務發現與負載均衡

默認情況下，mesos marathon會把app發布到隨機節點的隨機端口上，當mesos slaves和app越來越多的時候，想查找某組app就變得困難。

mesos提供了兩個工具：mesos-dns和marathon-lb。mesos-dns是一個服務發現工具，marathon-lb不僅是服務發現工具，還是負載均衡工具。

2. mesos-dns

Mesos-dns是 mesos 服務發現工具，能查找app的Ip，端口號以及master，leader等信息。

2.1 安裝

從下述地址下載mesos-dns二進制文件：

https://github.com/mesosphere/mesos-dns/releases

重命名為mesos-dns

chmod +x mesos-dns

按照官方文檔編寫config.json，填入zk、master等相關信息

2.2 啟動

2.2.1 命令行方式

mesos-dns -config config.json

2.2.2 也可以用marathon部署

#mesos-dns.json

{
"id": "mesos-dns",
"cpus": 0.5,
"mem": 128.0,
"instances": 3,
"constraints": [["hostname", "UNIQUE"]],
"cmd": "/opt/mesos-dns/mesos-dns -config /opt/mesos-dns/config.json"
}

#向marathon發送部署內容

curl -i -H 'Content-Type: application/json' 172.31.17.71:8080/v2/apps -d@mesos-dns.json

圖中的mesos-dns是通過marathon部署的mesos-dns，共兩個實例。

2.3 使用方法

注：slave4是安裝了mesos-dns的主機名

2.3.1 查找app的ip

dig test-app.marathon.mesos +short @slave4

172.17.0.2

2.3.2 查找app所在節點的IP

dig test-app.marathon.slave.mesos +short @slave4

172.31.17.33
172.31.17.31
172.31.17.32

2.3.3 查找app服務端口號

dig SRV _test-app._tcp.marathon.mesos +short @slave4

0 0 31234 test-app-s3ehn-s11.marathon.slave.mesos.

0 0 31846 test-app-zfp5d-s10.marathon.slave.mesos.

0 0 31114 test-app-3xynw-s12.marathon.slave.mesos.

3. marathon-lb

Marathon-lb既是一個服務發現工具，也是負載均衡工具，它集成了haproxy，自動獲取各個app的信息，為每一組app生成haproxy配置，通過servicePort或者web虛擬主機提供服務。

要使用marathonn-lb，每組app必須設置HAPROXY_GROUP標簽。

Marathon-lb運行時綁定在各組app定義的服務端口（servicePort，如果app不定義servicePort，marathon會隨機分配端口號）上，可以通過marathon-lb所在節點的相關服務端口訪問各組app。

例如：marathon-lb部署在slave5，test-app 部署在slave1，test-app 的servicePort是10004，那么可以在slave5的 10004端口訪問到test-app提供的服務。

由於servicePort 非80、443端口（80、443端口已被marathon-lb中的 haproxy獨占），對於web服務來說不太方便，可以使用 haproxy虛擬主機解決這個問題：

在提供web服務的app配置里增加HAPROXY_{n}_VHOST（WEB虛擬主機）標簽，marathon-lb會自動把這組app的WEB集群服務發布在marathon-lb所在節點的80和443端口上，用戶設置DNS后通過虛擬主機名來訪問。

3.1 安裝

#下載marathon-lb鏡像

docker pull docker.io/mesosphere/marathon-lb

可以通過docker run運行，也可以通過marathon部署到mesos集群里。

3.2 運行

3.2.1 命令行運行

docker run -d --privileged -e PORTS=9090 --net=host docker.io/mesosphere/marathon-lb sse -m http://master1_ip:8080 -m http://master2_ip:8080 -m http://master3_ip:8080 --group external

3.2.2 通過marathon部署

{
"id": "marathon-lb",
"instances": 3,
"constraints": [["hostname", "UNIQUE"]],
"container": {
"type": "DOCKER",
"docker": {
"image": "docker.io/mesosphere/marathon-lb",
"privileged": true,
"network": "HOST"
}
},
"args": ["sse", "-m","http://master1_ip:8080", "-m","http://master2_ip:8080", "-m","http://master3_ip:8080","--group", "external"]
}

curl -i -H 'Content-Type: application/json' 172.31.17.71:8080/v2/apps -d@marathon-lb.json

3.3 使用方法

下面使用marathon-lb對http服務進行服務發現和負載均衡：

3.3.1 發布app

# 先創建app的json配置信息

一定要加上HAPROXY_GROUP標簽，對於web服務，可以加上VHOST標簽，讓marathon-lb設置WEB虛擬主機；

對於web服務，servicePort設置為0即可，marathon-lb會自動把web服務集群發布到80、443上；

{
"id": "test-app",
"labels": {
"HAPROXY_GROUP":"external",
"HAPROXY_0_VHOST":"test-app.XXXXX.com"
},
"cpus": 0.5,
"mem": 64.0,
"instances": 3,
"constraints": [["hostname", "UNIQUE"]],
"container": {
"type": "DOCKER",
"docker": {
"image": "httpd",
"privileged": false,
"network": "BRIDGE",
"portMappings": [
{ "containerPort": 80, "hostPort": 0, "servicePort": 0, "protocol": "tcp"}
]
}
}
}

#發布app

curl -i -H 'Content-Type: application/json' 172.31.17.71:8080/v2/apps -d@test-app.json

3.3.2 訪問app

先設置DNS或者hosts文件：

172.31.17.34 test-app.XXXXX.com

用瀏覽器通過http和https訪問虛擬主機，發現服務已經啟動，實際上是marathon-lb內置的haproxy對test-app的三個實例配置的web服務集群：

http://test-app.XXXXX.com

https://test-app.XXXXX.com

對於marathon-lb，可以同時部署多台，然后用DNS輪詢或者keepalived虛擬IP實現高可用。

前幾天我在mesos平台上基於 cadvisor部署了 influxdb 和 grafana，用於監控 mesos 以及 docker app 運行信息，發現這套監控系統不太適合 mesos + docker 的架構，原因是：

1）mesos task id 和 docker container name 不一致

cadvisor 的設計基於 docker host，沒有考慮到mesos 數據中心；

cadvisor 用 docker name（docker ps能看到）來標記抓取的數據，而 mesos 用 task id（在mesos ui 或者metrics里能看到）來標記正在運行的任務。mesos task 的類型可以是 docker 容器，也可以是非容器。mesos task id 與docker container name 的命名也是完全不一樣的。

上述問題導致 cadvisor 抓取到數據后，用戶難以識別屬於哪個 mesos task

2）cadvisor 和 grafana 不支持報警

經過查詢資料，發現 mesos-exporter + prometheus + alert-manager 是個很好的組合，可以解決上述問題：

mesos-exporter 是 mesosphere 開發的工具，用於導出 mesos 集群包括 task 的監控數據並傳遞給prometheus；prometheus是個集 db、graph、statistic 於一體的監控工具；alert-manager 是 prometheus 的報警工具

搭建方法：

1. build mesos-exporter

 
                git clone https: 
                //github 
                .com 
                /mesosphere/mesos_exporter 
                .git 
               
                cd  
                mesos_exporter 
               
                docker build -f Dockerfile -t mesosphere 
                /mesos-exporter  
                .

2. docker pull prometheus, alert-manager

3. 部署 mesos-exporter, alert-manager, prometheus

mesos-exporter：

 
                { 
               
                "id" 
                :  
                "mesos-exporter-slave" 
                , 
               
                "instances" 
                : 6, 
               
                "cpus" 
                : 0.2, 
               
                "mem" 
                : 128, 
               
                "args" 
                : [ 
               
                "-slave=http://127.0.0.1:5051" 
                , 
               
                "-timeout=5s" 
               
                ], 
               
                "constraints" 
                : [ 
               
                [ 
                "hostname" 
                , 
                "UNIQUE" 
                ], 
               
                [ 
                "hostname" 
                ,  
                "LIKE" 
                ,  
                "slave[1-6]" 
                ] 
               
                ], 
               
                "container" 
                : { 
               
                "type" 
                :  
                "DOCKER" 
                , 
               
                "docker" 
                : { 
               
                "image" 
                :  
                "172.31.17.36:5000/mesos-exporter:latest" 
                , 
               
                "network" 
                :  
                "HOST" 
               
                }, 
               
                "volumes" 
                : [ 
               
                { 
               
                "containerPath" 
                :  
                "/etc/localtime" 
                , 
               
                "hostPath" 
                :  
                "/etc/localtime" 
                , 
               
                "mode" 
                :  
                "RO" 
               
                } 
               
                ] 
               
                } 
               
                }

請打開slave 防火牆的9110/tcp 端口

alert-manager:

 
                { 
               
                "id" 
                :  
                "alertmanager" 
                , 
               
                "instances" 
                : 1, 
               
                "cpus" 
                : 0.5, 
               
                "mem" 
                : 128, 
               
                "constraints" 
                : [ 
               
                [ 
                "hostname" 
                , 
                "UNIQUE" 
                ], 
               
                [ 
                "hostname" 
                ,  
                "LIKE" 
                ,  
                "slave[1-6]" 
                ] 
               
                ], 
               
                "labels" 
                : { 
               
                "HAPROXY_GROUP" 
                : 
                "external" 
                , 
               
                "HAPROXY_0_VHOST" 
                : 
                "alertmanager.XXXXX.com" 
               
                }, 
               
                "container" 
                : { 
               
                "type" 
                :  
                "DOCKER" 
                , 
               
                "docker" 
                : { 
               
                "image" 
                :  
                "172.31.17.36:5000/alertmanager:latest" 
                , 
               
                "network" 
                :  
                "BRIDGE" 
                , 
               
                "portMappings" 
                : [ 
               
                {  
                "containerPort" 
                : 9093,  
                "hostPort" 
                : 0,  
                "servicePort" 
                : 0,  
                "protocol" 
                :  
                "tcp"  
                } 
               
                ] 
               
                }, 
               
                "volumes" 
                : [ 
               
                { 
               
                "containerPath" 
                :  
                "/etc/localtime" 
                , 
               
                "hostPath" 
                :  
                "/etc/localtime" 
                , 
               
                "mode" 
                :  
                "RO" 
               
                }, 
               
                { 
               
                "containerPath" 
                :  
                "/etc/alertmanager/config.yml" 
                , 
               
                "hostPath" 
                :  
                "/var/nfsshare/alertmanager/config.yml" 
                , 
               
                "mode" 
                :  
                "RO" 
               
                }, 
               
                { 
               
                "containerPath" 
                :  
                "/alertmanager" 
                , 
               
                "hostPath" 
                :  
                "/var/nfsshare/alertmanager/data" 
                , 
               
                "mode" 
                :  
                "RW" 
               
                } 
               
                ] 
               
                } 
               
                }

prometheus：

 
                { 
               
                "id" 
                :  
                "prometheus" 
                , 
               
                "instances" 
                : 1, 
               
                "cpus" 
                : 0.5, 
               
                "mem" 
                : 128, 
               
                "args" 
                : [ 
               
                "-config.file=/etc/prometheus/prometheus.yml" 
                , 
               
                "-storage.local.path=/prometheus" 
                , 
               
                "-web.console.libraries=/etc/prometheus/console_libraries" 
                , 
               
                "-web.console.templates=/etc/prometheus/consoles" 
                , 
               
                "-alertmanager.url=http://alertmanager.XXXXX.com" 
               
                ], 
               
                "constraints" 
                : [ 
               
                [ 
                "hostname" 
                , 
                "UNIQUE" 
                ], 
               
                [ 
                "hostname" 
                ,  
                "LIKE" 
                ,  
                "slave[1-6]" 
                ] 
               
                ], 
               
                "labels" 
                : { 
               
                "HAPROXY_GROUP" 
                : 
                "external" 
                , 
               
                "HAPROXY_0_VHOST" 
                : 
                "prometheus.XXXXX.com" 
               
                }, 
               
                "container" 
                : { 
               
                "type" 
                :  
                "DOCKER" 
                , 
               
                "docker" 
                : { 
               
                "image" 
                :  
                "172.31.17.36:5000/prometheus:latest" 
                , 
               
                "network" 
                :  
                "BRIDGE" 
                , 
               
                "portMappings" 
                : [ 
               
                {  
                "containerPort" 
                : 9090,  
                "hostPort" 
                : 0,  
                "servicePort" 
                : 0,  
                "protocol" 
                :  
                "tcp"  
                } 
               
                ] 
               
                }, 
               
                "volumes" 
                : [ 
               
                { 
               
                "containerPath" 
                :  
                "/etc/localtime" 
                , 
               
                "hostPath" 
                :  
                "/etc/localtime" 
                , 
               
                "mode" 
                :  
                "RO" 
               
                }, 
               
                { 
               
                "containerPath" 
                :  
                "/etc/prometheus" 
                , 
               
                "hostPath" 
                :  
                "/var/nfsshare/prometheus/conf" 
                , 
               
                "mode" 
                :  
                "RO" 
               
                }, 
               
                { 
               
                "containerPath" 
                :  
                "/prometheus" 
                , 
               
                "hostPath" 
                :  
                "/var/nfsshare/prometheus/data" 
                , 
               
                "mode" 
                :  
                "RW" 
               
                } 
               
                ] 
               
                } 
               
                }

4. prometheus 配置

prometheus.yml

 
                # my global config 
               
                global: 
               
                scrape_interval:     15s  
                # By default, scrape targets every 15 seconds. 
               
                evaluation_interval: 15s  
                # By default, scrape targets every 15 seconds. 
               
                # scrape_timeout is set to the global default (10s). 
               
                # Attach these labels to any time series or alerts when communicating with 
               
                # external systems (federation, remote storage, Alertmanager). 
               
                external_labels: 
               
                monitor:  
                'codelab-monitor' 
               
                # Load and evaluate rules in this file every 'evaluation_interval' seconds. 
               
                rule_files: 
               
                # - "first.rules" 
               
                # - "second.rules" 
               
                scrape_configs: 
               
                - job_name:  
                'mesos-slaves' 
               
                scrape_interval: 5s 
               
                metrics_path:  
                '/metrics' 
               
                scheme:  
                'http' 
               
                target_groups: 
               
                - targets: [ 
                '172.31.17.31:9110' 
                ,  
                '172.31.17.32:9110' 
                ,  
                '172.31.17.33:9110' 
                ,  
                '172.31.17.34:9110' 
                ,  
                '172.31.17.35:9110' 
                ,  
                '172.31.17.36:9110' 
                ] 
               
                - labels: 
               
                group:  
                'office'