Springboot2 Metrics之actuator集成influxdb, Grafana提供監控和報警

本文轉載自查看原文 2019-06-21 19:24 1555 Metrics/ Influxdb/ CI/CD/ Grafana

到目前為止，各種日志收集，統計監控開源組件數不勝數，即便如此還是會有很多人只是tail -f查看一下日志文件。隨着容器化技術的成熟，日志和metrics度量統計已經不能僅僅靠tail -f來查看了，你甚至都不能進入部署的機器。因此，日志收集和metrics統計就必不可少。日志可以通過logstash或者filebeat收集到ES中用來查閱。對於各種統計指標，springboot提供了actuator組件，可以對cpu, 內存，線程，request等各種指標進行統計，並收集起來。本文將粗略的集成influxdb來實現數據收集，以及使用Grafana來展示。

最終dashboard模板： https://github.com/Ryan-Miao/boot-metrics-exporter/blob/master/grafana/grafana-dashboard-template.json

最終獲得如下統計報表：

對於redis cache命中率的統計：

對於單獨重要request的統計

基於health check的alert

安裝influxdb和Grafana

安裝influxdb:

https://www.cnblogs.com/woshimrf/p/docker-influxdb.html

安裝Grafana:

https://www.cnblogs.com/woshimrf/p/docker-grafana.html

Springboot配置

可以直接使用封裝好的starter:

https://github.com/Ryan-Miao/boot-metrics-exporter

或者：

引入依賴

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-influx</artifactId>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
        </dependency>

定義MeterConfig, 用來統一設置一些tag，比如instance id

@Component
public class MeterConfig implements MeterRegistryCustomizer {

    private static final Logger LOGGER = LoggerFactory.getLogger(MeterConfig.class);

    @Override
    public void customize(MeterRegistry registry) {
        try {
            String hostAddress = InetAddress.getLocalHost().getHostAddress();
            if (LOGGER.isDebugEnabled()) {
                LOGGER.debug("設置metrics實例id為ip:" + hostAddress);
            }
            registry.config().commonTags("instance-id", hostAddress);
        } catch (UnknownHostException e) {
            String uuid = UUID.randomUUID().toString();
            registry.config().commonTags("instance-id", uuid);
            LOGGER.error("獲取實例ip失敗，設置實例id為uuid:" + uuid, e);
        }
    }
}

添加對應的配置：

management:
  metrics:
    export:
      influx:
        db: my-db
        uri: http://192.168.5.9:8086
        user-name: admin
        password: admin
        enabled: true
    web:
      server:
        auto-time-requests: true
    tags:
      app: ${spring.application.name}

這里選擇將metric export到influxdb，還有很多其他存儲方案可選。

網絡配置

grafana和influxdb可能部署在某個vpc，比如monitor集群。而需要收集的業務系統則遍布在各個業務線的vpc內，因此需要業務集群打通訪問influxdb的網絡和端口。

自定義Metrics

Springboot actuator暴露的health接口只有up/down的選擇，在grafana如何使用這個來判斷閾值，我還沒找到，於是轉換成了數字。

自定義MeterBinder

import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.binder.MeterBinder;
import lombok.Data;

@Data
public class HealthMetrics implements MeterBinder {

    /**
     * 100  up
     * 0  down
     * 0 unknown
     */
    private Integer health = 100;


    @Override
    public void bindTo(MeterRegistry registry) {
        Gauge.builder("health", () -> health)
                .register(registry);
    }
}

定義每30s更新一下狀態：

public abstract class AbstractHealthCheckStatusSetter {
    private final HealthMetrics healthMetrics;

    protected AbstractHealthCheckStatusSetter(HealthMetrics healthMetrics) {
        this.healthMetrics = healthMetrics;
    }

    /**
     * 修改health的狀態定義。修改HealthMetrics.health的value。
     */
    public abstract void setHealthStatus(HealthMetrics h);

    /**
     * 定時更新health統計.
     */
    @PostConstruct
    void doSet() {
        ScheduledExecutorService scheduledExecutorService = new ScheduledThreadPoolExecutor(1);
        scheduledExecutorService.scheduleWithFixedDelay(
                () -> setHealthStatus(healthMetrics), 30L, 30L, TimeUnit.SECONDS);
    }


}

實現類

public class HealthCheckStatusSetter extends AbstractHealthCheckStatusSetter {
    private final HealthEndpoint healthEndpoint;

    public HealthCheckStatusSetter(HealthMetrics healthMetrics, HealthEndpoint healthEndpoint) {
        super(healthMetrics);
        this.healthEndpoint = healthEndpoint;
    }


    @Override
    public void setHealthStatus(HealthMetrics healthMetrics) {
        Health health = healthEndpoint.health();
        if (health != null) {
            Status status = health.getStatus();
            switch (status.getCode()) {
                case "UP": {
                    healthMetrics.setHealth(100);
                    break;
                }
                case "DOWN":
                    ;
                case "UNKNOWN":
                    ;
                default: {
                    healthMetrics.setHealth(0);
                    break;
                }

            }
        }

    }
    

}

加入配置

    @Bean
    @ConditionalOnMissingBean
    public HealthMetrics healthMetrics() {
        return new HealthMetrics();
    }

    /**
     * 這里采用healthEndpoint來判斷系統的健康狀況。如果有別的需要，可以實現AbstractHealthCheckStatusSetter，自己設置health.
     */
    @Bean
    @ConditionalOnMissingBean
    @ConditionalOnBean(HealthEndpoint.class)
    public AbstractHealthCheckStatusSetter healthCheckSchedule(HealthEndpoint healthEndpoint, HealthMetrics healthMetrics) {
        return new HealthCheckStatusSetter(healthMetrics, healthEndpoint);
    }

Redis cache命中率統計

整套metrics監控是基於Spring boot actuator來實現的，而actuator是通過io.micrometer來做統計的。那么就可以通過自定義micrometer metrics的方式來添加各種metric。比如我們常用redis作為緩存，那么緩存的命中率是我們所關注的。可以自己寫一套counter來記錄：命中hit+1，沒命中miss+1.

也可以直接使用redisson。

我們使用RedissonCache來集成spring cache, 此時cache的命中統計指標就已經被收集好了。

Cache基本統計指標的定義：

然而，統計的結果是按行存儲的：

怎么基於此計算命中率呢？

hit-rate= sum(hit)/sum(hit+miss)

因此，我手動對這個序列做了整合：

DROP CONTINUOUS QUERY cq_cache_hit ON my-db

DROP CONTINUOUS QUERY cq_cache_miss ON my-db

DROP measurement cache_hit_rate

CREATE CONTINUOUS QUERY "cq_cache_hit" ON "my-db" RESAMPLE EVERY 10m BEGIN SELECT sum("value") AS hit  INTO "cache_hit_rate"  FROM "rp_30days"."cache_gets" WHERE ( "result" = 'hit') GROUP BY time(10m),"app", "cache"  fill(0) END

CREATE CONTINUOUS QUERY "cq_cache_miss" ON "my-db" RESAMPLE EVERY 10m BEGIN SELECT sum("value") AS miss  INTO "cache_hit_rate"  FROM "rp_30days"."cache_gets" WHERE ( "result" = 'miss') GROUP BY time(10m),"app", "cache" fill(0) ENDD

監控告警

Grafana提供了alert功能，當查詢的指標不滿足閾值時，發出告警。

選擇influxdb or Prometheus ?

關於收集metric指標的存儲方案，大多數教程都是Prometheus, 生態比較完整。我當時之所以選擇influxdb，僅僅是因為容器的網絡問題。Prometheus需要訪問實例來拉取數據，需要允許Prometheus訪問業務網絡，那我就得不停打通網絡，而且，k8s集群不同的網絡是不通的，沒找到網絡打通方案。而influx這種只要實例push數據。同樣的，還可以選擇es。

influxdb有單點局限性，以及數量大之后的穩定性等問題。需要合理的計算時間間隔的數據。比如，對於幾天幾個月等查詢，提前匯總細粒度的統計。

還有一種據說可以無限擴展的方案就是OpenTSDB. 暫未研究。

會遇到的問題

當前demo是influxdb單點，極其脆弱，稍微長點的時間間隔查詢就會掛掉，也只能用來做demo，或者只是查看最近15min這種簡單的實時查看。對於近幾個月，一年這種長時間聚合，只能提前做好聚合函數進行粗粒度的統計匯總。

參考

https://github.com/OpenTSDB/opentsdb

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Prometheus+grafana監控SpringBoot2應用 springboot2 + prometheus + grafana 監控整合利用Metrics+influxdb+grafana構建監控平台（轉）微服務監控之二：Metrics+influxdb+grafana構建監控平台 [系統集成] 基於telegraf, influxdb, grafana 建立 esxi 監控 AspNet Core下利用 app-metrics+Grafana + InfluxDB實現高大上的性能監控界面 .net core使用App.Metrics+InfluxDB+Grafana進行APM監控 .Net Core 2.0+ InfluxDB+Grafana+App Metrics 實現跨平台的實時性能監控 Metrics.net + influxdb + grafana 構建WebAPI的自動化監控和預警 .NET Core微服務之基於App.Metrics+InfluxDB+Grafana實現統一性能監控