場景:
隨着監控數據的增長,單個prometheus采集數據性能無法滿足,即使100G+內存,也會出現OOM現象。
解決思路:
1.減少prometheus駐留內存的數據量,將數據持久化到tsdb或對象存儲;
2.根據業務切割成多個prometheus,分模塊存儲數據。若需要進行多個promenade之間的匯聚,利用thanos的query實現。
搭建thanos前提假設:
1.已經安裝docker和docker compose(本例子通過docker-compose進行安裝部署)
2.通過2個prometheus驗證thanos的可用性
安裝步驟:
1.
#定義2個prometheus的存儲路徑
mkdir -p /home/dockerdata/prometheus
mkdir -p /home/dockerdata/prometheus2
#定義minio(用於對象存儲)和docker-compose的路徑
mkdir -p /home/dockerfile/thanos
mkdir -p /home/minio/data
2.minio配置文件(位於/home/dockerfile/thanos/bucket_config.yaml)
type: S3
config:
bucket: "thanos"
endpoint: 'minio:9000'
access_key: "danny"
insecure: true #是否使用安全協議http或https
signature_version2: false
encrypt_sse: false
secret_key: "xxxxxxxx" #設置s3密碼,保證8位以上的長度
put_user_metadata: {}
http_config:
idle_conn_timeout: 90s
response_header_timeout: 2m
insecure_skip_verify: false
trace:
enable: false
part_size: 134217728
3.prometheus配置文件(位於/home/dockerfile/thanos/prometheus.yml和/home/dockerfile/thanos/prometheus2.yml)。有2個prometheus配置文件主要是區分端口和extends label。必須定義extends label,用於區分相同metrics的數據源
prometheus:
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). external_labels: monitor: 'danny-ecs' # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: #- targets: ['localhost:9090'] - targets: ['node-exporter:9100']
prometheus2.yml
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). external_labels: monitor: 'prometheus2' # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: #- targets: ['localhost:9091'] - targets: ['node-exporter:9100']
4.配置Dockerfile(docker-compose用到的,不確定是否必須。位於:/home/dockerfile/thanos)
FROM quay.io/prometheus/busybox:latest
LABEL maintainer="danny"
COPY /thanos_tmp_for_docker /bin/thanos
ENTRYPOINT [ "/bin/thanos" ]
5.docker-compose配置文件定義(基本所有內容都在這里了)
version: '2'
services:
prometheus1:
container_name: prometheus1
image: prom/prometheus
ports:
- 9090:9090
logging:
driver: "json-file"
options:
max-size: "5m"
max-file: "3"
volumes:
- /home/dockerdata/prometheus:/prometheus
- /home/dockerfile/thanos/prometheus.yml:/etc/prometheus/prometheus.yml
command:
- --web.enable-lifecycle
- --web.enable-admin-api
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
- --storage.tsdb.min-block-duration=30m # small just to not wait hours to test :)
- --storage.tsdb.max-block-duration=30m # small just to not wait hours to test :)
depends_on:
- minio
sidecar1:
container_name: sidecar1
image: quay.io/thanos/thanos:v0.13.0-rc.2
logging:
driver: "json-file"
options:
max-size: "5m"
max-file: "3"
volumes:
- /home/dockerdata/prometheus:/var/prometheus
- /home/dockerfile/thanos/bucket_config.yaml:/bucket_config.yaml
command:
- sidecar
- --tsdb.path=/var/prometheus
- --prometheus.url=http://prometheus1:9090
- --objstore.config-file=/bucket_config.yaml
- --http-address=0.0.0.0:19191
- --grpc-address=0.0.0.0:19090
depends_on:
- minio
- prometheus1
prometheus2:
container_name: prometheus2
image: prom/prometheus
ports:
- 9091:9090
logging:
driver: "json-file"
options:
max-size: "5m"
max-file: "3"
volumes:
- /home/dockerdata/prometheus2:/prometheus
- /home/dockerfile/thanos/prometheus2.yml:/etc/prometheus/prometheus.yml
command:
- --web.enable-lifecycle
- --web.enable-admin-api
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
- --storage.tsdb.min-block-duration=30m
- --storage.tsdb.max-block-duration=30m
depends_on:
- minio
sidecar2:
container_name: sidecar2
image: quay.io/thanos/thanos:v0.13.0-rc.2
logging:
driver: "json-file"
options:
max-size: "5m"
max-file: "3"
volumes:
- /home/dockerdata/prometheus2:/var/prometheus
- /home/dockerfile/thanos/bucket_config.yaml:/bucket_config.yaml
command:
- sidecar
- --tsdb.path=/var/prometheus
- --prometheus.url=http://prometheus2:9090
- --objstore.config-file=/bucket_config.yaml
- --http-address=0.0.0.0:19191
- --grpc-address=0.0.0.0:19090
depends_on:
- minio
- prometheus2
grafana:
container_name: grafana
image: grafana/grafana
ports:
- "3000:3000"
# to search on old metrics
storer:
container_name: storer
image: quay.io/thanos/thanos:v0.13.0-rc.2
volumes:
- /home/dockerfile/thanos/bucket_config.yaml:/bucket_config.yaml
command:
- store
- --data-dir=/var/thanos/store
- --objstore.config-file=bucket_config.yaml
- --http-address=0.0.0.0:19191
- --grpc-address=0.0.0.0:19090
depends_on:
- minio
# downsample metrics on the bucket
compactor:
container_name: compactor
image: quay.io/thanos/thanos:v0.13.0-rc.2
volumes:
- /home/dockerfile/thanos/bucket_config.yaml:/bucket_config.yaml
command:
- compact
- --data-dir=/var/thanos/compact
- --objstore.config-file=bucket_config.yaml
- --http-address=0.0.0.0:19191
- --wait
depends_on:
- minio
# querier component which can be scaled
querier:
container_name: querier
image: quay.io/thanos/thanos:v0.13.0-rc.2
labels:
- "traefik.enable=true"
- "traefik.port=19192"
- "traefik.frontend.rule=PathPrefix:/"
ports:
- "19192:19192"
command:
- query
- --http-address=0.0.0.0:19192
- --store=sidecar1:19090
- --store=sidecar2:19090
- --store=storer:19090
- --query.replica-label=replica
minio:
image: minio/minio:latest
container_name: minio
ports:
- 9000:9000
volumes:
- "/home/minio/data:/data"
environment:
MINIO_ACCESS_KEY: "danny"
MINIO_SECRET_KEY: "xxxxxxxx" #輸入8位以上的密碼
command: server /data
restart: always
logging:
driver: "json-file"
options:
max-size: "5m"
max-file: "3"
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- '9100:9100'
6.啟動。第一次up的時候,storer會啟動失敗,需要單獨重啟一次storer
cd /home/dockerfile/thanos
docker-compose up -d
docker-compose up -d storer
7.驗證
docker ps -a
證明所有容器已經啟動成功
8.promenade 的block數據成功上傳到minio驗證
訪問地址:http://ip:9000/minio/login
賬號密碼在bucket_config.yaml中定義的
創建bucket:thanos
如果正常運行,block數據會成功上傳到thanos這個bucket(即sidecar組建安裝成功):
9.驗證query組建和store組建是否安裝成功。
訪問query頁面(跟promenade的頁面基本一致)
輸入metrices:node_uname_info
安裝部署參考鏈接:https://www.cnblogs.com/rongfengliang/p/11319933.html
對thanos架構理解參考鏈接:http://www.dockone.io/article/10035