一、背景

公司業務某些頁面加載很慢，簡單通過日志並不能確認到底是哪些接口慢導致的，最近又在研究influxdb，於是計划通過influxdb+telegraf+grafana實現一套nginx監控，用來分析接口的耗時。

二、安裝

我的開發機是centos7，以下安裝命令都是在centos上的操作，其他操作系統請參考官方文檔。

2.1、influxdb

wget https://dl.influxdata.com/influxdb/releases/influxdb-1.8.0.x86_64.rpm
sudo yum localinstall influxdb-1.8.0.x86_64.rpm
sudo systemctl start influxdb

這里因為我們只是測試驗證，所以使用默認配置即可，如果現網環境，還是建議至少設置用戶、密碼等信息，默認influxdb的配置文件在/etc/influxdb/infludb.conf。

安裝啟動之后，可以直接在終端輸入influx來嘗試登陸下看看。

2.2、telegraf

wget https://dl.influxdata.com/telegraf/releases/telegraf-1.14.1-1.x86_64.rpm
sudo yum localinstall telegraf-1.14.1-1.x86_64.rpm
sudo systemctl start telegraf

這里telegraf也先使用默認的配置啟動起來，后續配置在下一節詳細討論。

2.3、grafana

1 wget https://dl.grafana.com/oss/release/grafana-6.7.2-1.x86_64.rpm
2 sudo yum localinstall grafana-6.7.2-1.x86_64.rpm
3 sudo systemctl start grafana-server

grafana也是使用默認的配置，啟動之后就可以通過http://your-ip:3000來訪問grafana了。

三、配置

3.1、nginx log_format配置

本次監控的思路是通過telegraf抓取nginx的日志，並將信息導入到influxdb中，然后通過grafana將耗時數據展示出來，所以第一步我們需要先配置nginx的日志，以下是我使用的log_format：

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" $http_x_forwarded_for $host '
                      '"$upstream_addr" $upstream_status $request_time '
                      '$upstream_response_time "$cookie_uin" "$cookie_luin" "$cookie_username"';

雖然其中許多對於這里的監控不是必須的，但是由於要和現網保持一致，所以如果要跟隨這篇文檔來配置監控的話，建議也使用這個配置，因為這個日志的配置還會影響到telegraf日志解析的配置，這個地方有點麻煩，所以建議使用同樣的配置，避免遇到奇葩的問題，待整個流程跑通了之后，可以再去熟悉telegraf日志解析的格式，進而根據需要調整自己的log_format。

3.2、telegraf input插件配置

這里使用了logparser插件來抓取日志，並使用grok格式來解析日志，一下是配置文件：

 1 # Stream and parse log file(s).
 2 [[inputs.logparser]]
 3   ## Log files to parse.
 4   ## These accept standard unix glob matching rules, but with the addition of
 5   ## ** as a "super asterisk". ie:
 6   ##   /var/log/**.log     -> recursively find all .log files in /var/log
 7   ##   /var/log/*/*.log    -> find all .log files with a parent dir in /var/log
 8   ##   /var/log/apache.log -> only tail the apache log file
 9   files = ["/var/log/nginx/access.log"]
10 
11   ## Read files that currently exist from the beginning. Files that are created
12   ## while telegraf is running (and that match the "files" globs) will always
13   ## be read from the beginning.
14   from_beginning = false
15 
16   ## Method used to watch for file updates.  Can be either "inotify" or "poll".
17   # watch_method = "inotify"
18 
19   ## Parse logstash-style "grok" patterns:
20   [inputs.logparser.grok]
21     ## This is a list of patterns to check the given log file(s) for.
22     ## Note that adding patterns here increases processing time. The most
23     ## efficient configuration is to have one pattern per logparser.
24     ## Other common built-in patterns are:
25     ##   %{COMMON_LOG_FORMAT}   (plain apache & nginx access logs)
26     ##   %{COMBINED_LOG_FORMAT} (access logs + referrer & agent)
27     # patterns = ["%{COMMON_LOG_FORMAT}"]
28     patterns = ["%{NGINX_ACCESS_LOG}"]
29 
30     ## Name of the outputted measurement name.
31     measurement = "nginx_access_log"
32 
33     ## Full path(s) to custom pattern files.
34     custom_pattern_files = []
35 
36     ## Custom patterns can also be defined here. Put one pattern per line.
37     custom_patterns = '''
38     NGINX_ACCESS_LOG %{IP:remote_addr} - (-|%{WORD:remote_user}) \[%{HTTPDATE:time_local}\] %{QS:request} %{NUMBER:status:int} %{NUMBER:body_bytes_sent:int} %{QS:referrer} %{QS:agent} %{IPORHOST:xforwardedfor} %{IPORHOST:host} %{QS:upstream_addr} (-|%{NUMBER:upstream_status:int}) %{BASE10NUM:request_time:float} (-|%{BASE10NUM:upstream_response_time:float}) %{QS:cookie_uin} %{QS:cookie_luin}
39     '''
40 
41     ## Timezone allows you to provide an override for timestamps that
42     ## don't already include an offset
43     ## e.g. 04/06/2016 12:41:45 data one two 5.43µs
44     ##
45     ## Default: "" which renders UTC
46     ## Options are as follows:
47     ##   1. Local             -- interpret based on machine localtime
48     ##   2. "Canada/Eastern"  -- Unix TZ values like those found in https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
49     ##   3. UTC               -- or blank/unspecified, will return timestamp in UTC
50     # timezone = "Canada/Eastern"
51 
52     ## When set to "disable", timestamp will not incremented if there is a
53     ## duplicate.
54     # unique_timestamp = "auto"

其中最核心的地方是NGINX_ACCESS_LOG的配置，它決定了如何去解析nginx的日志，這里使用到了grok的語法，更多關於grok的信息見參考資料。

修改之后記得使用systemctl restart telegraf命令使改動生效。

3.3、grafana的配置

grafana需要先配置influxdb數據源，具體操作可以看下grafana的文檔，這里說一下如何配置從influxdb查詢數據並將數據展示出來，這里主要用到了日志中解析出來的request_time。

四、效果

這里可以看到，很清晰明了的把nginx的調用耗時展示了出來，但是到這里並沒有結束，因為並沒有達到我們的目的：定位哪些接口查詢較慢，目前僅僅實現了一個全局的視圖，能夠知道目前有沒有接口很慢，但是不知道哪個接口慢（除非直接查influxdb），我思考了一下，這是因為其實grafana只是用來展示metric的，metric更多的是關注統計，而哪個接口慢更多的是需要關注個體，這里是grafana很難實現的（除非一個接口一個query，那配置起來就太麻煩了），所以要解決我們的問題其實是需要引入全鏈路追蹤系統的，這個目前還在調研中，待部署上線了再寫篇文章介紹一下全鏈路追蹤。

五、參考資料

1、grok patterns：http://grokdebug.herokuapp.com/patterns#
2、grok debug：http://grokdebug.herokuapp.com/
3、telegraf文檔：https://docs.influxdata.com/telegraf/v1.14/
4、grok input data format：https://docs.influxdata.com/telegraf/v1.14/data_formats/input/grok/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 telegraf + influxdb + grafana 監控 redis 性能監控（grafana + influxdb + telegraf） grafana + influxdb + telegraf , 構建性能監控平台 grafana + influxdb + telegraf , 構建性能監控平台基於telegraf+influxdb+grafana進行postgresql數據庫監控通過 Telegraf + InfluxDB + Grafana 快速搭建監控體系的詳細步驟搭建 Telegraf + InfluxDB + Grafana 監控遇到幾個小問題 [系統集成] 基於telegraf, influxdb, grafana 建立 esxi 監控 Grafana+Telegraf+Influxdb監控Tomcat集群方案 Spring Boot Actutaur + Telegraf + InFluxDB + Grafana 構建監控平台