一、背景
公司業務某些頁面加載很慢,簡單通過日志並不能確認到底是哪些接口慢導致的,最近又在研究influxdb,於是計划通過influxdb+telegraf+grafana實現一套nginx監控,用來分析接口的耗時。
二、安裝
我的開發機是centos7,以下安裝命令都是在centos上的操作,其他操作系統請參考官方文檔。
2.1、influxdb
wget https://dl.influxdata.com/influxdb/releases/influxdb-1.8.0.x86_64.rpm sudo yum localinstall influxdb-1.8.0.x86_64.rpm sudo systemctl start influxdb
這里因為我們只是測試驗證,所以使用默認配置即可,如果現網環境,還是建議至少設置用戶、密碼等信息,默認influxdb的配置文件在/etc/influxdb/infludb.conf。
安裝啟動之后,可以直接在終端輸入influx來嘗試登陸下看看。
2.2、telegraf
wget https://dl.influxdata.com/telegraf/releases/telegraf-1.14.1-1.x86_64.rpm sudo yum localinstall telegraf-1.14.1-1.x86_64.rpm sudo systemctl start telegraf
這里telegraf也先使用默認的配置啟動起來,后續配置在下一節詳細討論。
2.3、grafana
1 wget https://dl.grafana.com/oss/release/grafana-6.7.2-1.x86_64.rpm 2 sudo yum localinstall grafana-6.7.2-1.x86_64.rpm 3 sudo systemctl start grafana-server
grafana也是使用默認的配置,啟動之后就可以通過http://your-ip:3000來訪問grafana了。
三、配置
3.1、nginx log_format配置
本次監控的思路是通過telegraf抓取nginx的日志,並將信息導入到influxdb中,然后通過grafana將耗時數據展示出來,所以第一步我們需要先配置nginx的日志,以下是我使用的log_format:
log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" $http_x_forwarded_for $host ' '"$upstream_addr" $upstream_status $request_time ' '$upstream_response_time "$cookie_uin" "$cookie_luin" "$cookie_username"';
雖然其中許多對於這里的監控不是必須的,但是由於要和現網保持一致,所以如果要跟隨這篇文檔來配置監控的話,建議也使用這個配置,因為這個日志的配置還會影響到telegraf日志解析的配置,這個地方有點麻煩,所以建議使用同樣的配置,避免遇到奇葩的問題,待整個流程跑通了之后,可以再去熟悉telegraf日志解析的格式,進而根據需要調整自己的log_format。
3.2、telegraf input插件配置
這里使用了logparser插件來抓取日志,並使用grok格式來解析日志,一下是配置文件:
1 # Stream and parse log file(s). 2 [[inputs.logparser]] 3 ## Log files to parse. 4 ## These accept standard unix glob matching rules, but with the addition of 5 ## ** as a "super asterisk". ie: 6 ## /var/log/**.log -> recursively find all .log files in /var/log 7 ## /var/log/*/*.log -> find all .log files with a parent dir in /var/log 8 ## /var/log/apache.log -> only tail the apache log file 9 files = ["/var/log/nginx/access.log"] 10 11 ## Read files that currently exist from the beginning. Files that are created 12 ## while telegraf is running (and that match the "files" globs) will always 13 ## be read from the beginning. 14 from_beginning = false 15 16 ## Method used to watch for file updates. Can be either "inotify" or "poll". 17 # watch_method = "inotify" 18 19 ## Parse logstash-style "grok" patterns: 20 [inputs.logparser.grok] 21 ## This is a list of patterns to check the given log file(s) for. 22 ## Note that adding patterns here increases processing time. The most 23 ## efficient configuration is to have one pattern per logparser. 24 ## Other common built-in patterns are: 25 ## %{COMMON_LOG_FORMAT} (plain apache & nginx access logs) 26 ## %{COMBINED_LOG_FORMAT} (access logs + referrer & agent) 27 # patterns = ["%{COMMON_LOG_FORMAT}"] 28 patterns = ["%{NGINX_ACCESS_LOG}"] 29 30 ## Name of the outputted measurement name. 31 measurement = "nginx_access_log" 32 33 ## Full path(s) to custom pattern files. 34 custom_pattern_files = [] 35 36 ## Custom patterns can also be defined here. Put one pattern per line. 37 custom_patterns = ''' 38 NGINX_ACCESS_LOG %{IP:remote_addr} - (-|%{WORD:remote_user}) \[%{HTTPDATE:time_local}\] %{QS:request} %{NUMBER:status:int} %{NUMBER:body_bytes_sent:int} %{QS:referrer} %{QS:agent} %{IPORHOST:xforwardedfor} %{IPORHOST:host} %{QS:upstream_addr} (-|%{NUMBER:upstream_status:int}) %{BASE10NUM:request_time:float} (-|%{BASE10NUM:upstream_response_time:float}) %{QS:cookie_uin} %{QS:cookie_luin} 39 ''' 40 41 ## Timezone allows you to provide an override for timestamps that 42 ## don't already include an offset 43 ## e.g. 04/06/2016 12:41:45 data one two 5.43µs 44 ## 45 ## Default: "" which renders UTC 46 ## Options are as follows: 47 ## 1. Local -- interpret based on machine localtime 48 ## 2. "Canada/Eastern" -- Unix TZ values like those found in https://en.wikipedia.org/wiki/List_of_tz_database_time_zones 49 ## 3. UTC -- or blank/unspecified, will return timestamp in UTC 50 # timezone = "Canada/Eastern" 51 52 ## When set to "disable", timestamp will not incremented if there is a 53 ## duplicate. 54 # unique_timestamp = "auto"
其中最核心的地方是NGINX_ACCESS_LOG的配置,它決定了如何去解析nginx的日志,這里使用到了grok的語法,更多關於grok的信息見參考資料。
修改之后記得使用systemctl restart telegraf命令使改動生效。
3.3、grafana的配置
grafana需要先配置influxdb數據源,具體操作可以看下grafana的文檔,這里說一下如何配置從influxdb查詢數據並將數據展示出來,這里主要用到了日志中解析出來的request_time。
四、效果
這里可以看到,很清晰明了的把nginx的調用耗時展示了出來,但是到這里並沒有結束,因為並沒有達到我們的目的:定位哪些接口查詢較慢,目前僅僅實現了一個全局的視圖,能夠知道目前有沒有接口很慢,但是不知道哪個接口慢(除非直接查influxdb),我思考了一下,這是因為其實grafana只是用來展示metric的,metric更多的是關注統計,而哪個接口慢更多的是需要關注個體,這里是grafana很難實現的(除非一個接口一個query,那配置起來就太麻煩了),所以要解決我們的問題其實是需要引入全鏈路追蹤系統的,這個目前還在調研中,待部署上線了再寫篇文章介紹一下全鏈路追蹤。
五、參考資料
1、grok patterns:http://grokdebug.herokuapp.com/patterns#
2、grok debug:http://grokdebug.herokuapp.com/
3、telegraf文檔:https://docs.influxdata.com/telegraf/v1.14/
4、grok input data format:https://docs.influxdata.com/telegraf/v1.14/data_formats/input/grok/