故障描述
應用從虛擬機環境遷移到kubernetes環境中,有些應用不定時出現請求失敗的情況,且應用沒有記錄任何日志,而在NGINX中記錄502錯誤。我們查看了之前虛擬機中的訪問情況,沒有發現該問題。
基礎信息
# 請求流程
client --> nginx(nginx-ingress-controller) --> tomcat(容器)
# nginx版本
$ nginx -V nginx version: nginx/1.15.5 built by gcc 8.2.0 (Debian 8.2.0-7) built with OpenSSL 1.1.0h 27 Mar 2018 TLS SNI support enabled configure arguments: --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --modules-path=/etc/nginx/modules --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy--http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-compat --with-pcre-jit --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_addition_module --with-http_dav_module --with-http_geoip_module --with-http_gzip_static_module --with-http_sub_module --with-http_v2_module --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-threads --with-http_secure_link_module --with-http_gunzip_module --with-file-aio --without-mail_pop3_module --without-mail_smtp_module --without-mail_imap_module --without-http_uwsgi_module --without-http_scgi_module --with-cc-opt='-g -Og -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wno-deprecated-declarations -fno-strict-aliasing -D_FORTIFY_SOURCE=2 --param=ssp-buffer-size=4 -DTCP_FASTOPEN=23 -fPIC -I/root/.hunter/_Base/2c5c6fc/a134798/92161a9/Install/include -Wno-cast-function-type -m64 -mtune=native' --with-ld-opt='-fPIE -fPIC -pie -Wl,-z,relro -Wl,-z,now -L/root/.hunter/_Base/2c5c6fc/a134798/92161a9/Install/lib' --user=www-data --group=www-data --add-module=/tmp/build/ngx_devel_kit-0.3.1rc1 --add-module=/tmp/build/set-misc-nginx-module-0.32 --add-module=/tmp/build/headers-more-nginx-module-0.33 --add-module=/tmp/build/nginx-goodies-nginx-sticky-module-ng-08a395c66e42 --add-module=/tmp/build/nginx-http-auth-digest-274490cec649e7300fea97fed13d84e596bbc0ce --add-module=/tmp/build/ngx_http_substitutions_filter_module-bc58cb11844bc42735bbaef7085ea86ace46d05b --add-module=/tmp/build/lua-nginx-module-e94f2e5d64daa45ff396e262d8dab8e56f5f10e0 --add-module=/tmp/build/lua-upstream-nginx-module-0.07 --add-module=/tmp/build/nginx_cookie_flag_module-1.1.0 --add-module=/tmp/build/nginx-influxdb-module-f20cfb2458c338f162132f5a21eb021e2cbe6383 --add-dynamic-module=/tmp/build/nginx-opentracing-0.6.0/opentracing --add-dynamic-module=/tmp/build/ModSecurity-nginx-37b76e88df4bce8a9846345c27271d7e6ce1acfb --add-dynamic-module=/tmp/build/ngx_http_geoip2_module-3.0 --add-module=/tmp/build/nginx_ajp_module-bf6cd93f2098b59260de8d494f0f4b1f11a84627 --add-module=/tmp/build/ngx_brotli
# nginx-ingress-controller版本
0.20.0
# tomcat版本
tomcat 7.0.72
# kubernetes版本
[10:16:48 openshift@st2-control-111a27 ~]$ oc version oc v3.11.0+62803d0-1 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://paas-sh-01.99bill.com:443 openshift v3.11.0+0323629-61 kubernetes v1.11.0+d4cacc0 [10:16:56 openshift@st2-control-111a27 ~]$ kubectl version Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2018-10-15T09:45:30Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2018-11-20T19:51:55Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
排查
1. 查看日志 有2種錯誤日志情況
# nginx error.log
2019/11/08 07:59:43 [error] 189#189: *1707448 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.55.200, server: _, request: "GET /dbpool-server/ HTTP/1.1", upstream: "http://10.128.8.193:8080/dbpool-server/", host: "192.168.111.24:8090" 2019/11/08 07:00:00 [error] 188#188: *1691956 upstream prematurely closed connection while reading response header from upstream, client: 192.168.55.200, server: _, request: "GET /dbpool-server/ HTTP/1.1", upstream: "http://10.128.8.193:8080/dbpool-server/", host: "192.168.111.24:8090"
# nginx upstream.log
- 2019-11-08T07:59:43+08:00 GET /dbpool-server/ HTTP/1.1 502 0.021 10.128.8.193:8080 502 0.021 - 192.168.111.24 - 2019-11-08T07:00:00+08:00 GET /dbpool-server/ HTTP/1.1 502 0.006 10.128.8.193:8080 502 0.006 - 192.168.111.24
# tomcat當時的訪問日志
192.168.55.200 - - [08/Nov/2019:06:59:00 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0 192.168.55.200 - - [08/Nov/2019:06:59:20 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0 192.168.55.200 - - [08/Nov/2019:06:59:40 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0 192.168.55.200 - - [08/Nov/2019:07:00:20 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0 192.168.55.200 - - [08/Nov/2019:07:00:40 +0800] GET /dbpool-server/ HTTP/1.1 200 222 1 ... 192.168.55.200 - - [08/Nov/2019:07:59:03 +0800] GET /dbpool-server/ HTTP/1.1 200 222 1 192.168.55.200 - - [08/Nov/2019:07:59:23 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0 192.168.55.200 - - [08/Nov/2019:08:00:03 +0800] GET /dbpool-server/ HTTP/1.1 200 222 1 192.168.55.200 - - [08/Nov/2019:08:00:23 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0 192.168.55.200 - - [08/Nov/2019:08:00:43 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0
分析
根據故障現象,從網絡上搜索了一些結果,但是感覺信息不多。而且由於nginx這一類的代理出現Connection reset的場景復雜,原因多樣,也不好判斷是什么原因引起的。
之前使用虛擬機的時候沒有出現這樣的情況,於是對比了這兩方面的配置,發現虛機環境nginx與tomcat之間使用的是HTTP 1.0,而在kubernetes環境中使用的是HTTP 1.1,這中間最大的區別在於HTTP 1.1默認開啟了keepalive功能。所以想到可能是由於tomcat和nginx設置的keepalive timeout時間不一致導致了問題的發生。
# tomcat keepalive配置server.conf文件中
...
<Connector acceptCount="800" port="${http.port}" protocol="HTTP/1.1" executor="tomcatThreadPool" enableLookups="false" connectionTimeout="20000" maxThreads="1024" disableUploadTimeout="true" URIEncoding="UTF-8" useBodyEncodingForURI="true"/>
...
根據tomcat官方文檔說明,keepAliveTimeout默認等於connectionTimeout,我們這里配置的是20s。
參數 | 解釋 |
keepAliveTimeout | The number of milliseconds this Connector will wait for another HTTP request before closing the connection. The default value is to use the value that has been set for the connectionTimeout attribute. Use a value of -1 to indicate no (i.e. infinite) timeout. |
maxKeepAliveRequests | The maximum number of HTTP requests which can be pipelined until the connection is closed by the server. Setting this attribute to 1 will disable HTTP/1.0 keep-alive, as well as HTTP/1.1 keep-alive and pipelining. Setting this to -1 will allow an unlimited amount of pipelined or keep-alive HTTP requests. If not specified, this attribute is set to 100. |
connectionTimeout | The number of milliseconds this Connector will wait, after accepting a connection, for the request URI line to be presented. Use a value of -1 to indicate no (i.e. infinite) timeout. The default value is 60000 (i.e. 60 seconds) but note that the standard server.xml that ships with Tomcat sets this to 20000 (i.e. 20 seconds). Unless disableUploadTimeout is set to false , this timeout will also be used when reading the request body (if any). |
tomcat官方文檔 https://tomcat.apache.org/tomcat-7.0-doc/config/http.html
# nginx(nginx-ingress-controller)中的配置
由於沒有顯示的配置,所以使用的是nginx的默認參數配置,默認是60s。
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout
Syntax: keepalive_timeout timeout;
Default:
keepalive_timeout 60s;
Context: upstream
This directive appeared in version 1.15.3.
Sets a timeout during which an idle keepalive connection to an upstream server will stay open.
解決
竟然有了大概的分析猜測,可以嘗試調整nginx的keepalive timeout為15s(需要小於tomcat的超時時間),測試了之后,故障就這樣得到了解決。
由於我們使用的nginx-ingress-controller版本是0.20.0,還不支持通過參數配置這個參數,所以我們可以修改配置模板,如下:
...
{{ if $all.DynamicConfigurationEnabled }} upstream upstream_balancer { server 0.0.0.1; # placeholder balancer_by_lua_block { balancer.balance() } {{ if (gt $cfg.UpstreamKeepaliveConnections 0) }} keepalive {{ $cfg.UpstreamKeepaliveConnections }}; {{ end }} # ADD -- START keepalive_timeout 15s; # ADD -- END } {{ end }}
...
不過,從0.21.0版本開始,已經可以通過配置參數 |[upstream-keepalive-timeout](#upstream-keepalive-timeout)|int|60| 對這個值修改。
小想法: 一般情況下,都是nginx主動對tomcat發起連接請求,所以如果是nginx主動關閉連接,也是合理的解釋,這樣會避免一些因為協議配置或漏洞引起的問題。
故障重現
# 准備一個測試應用並啟動一個測試實例
# pod實例 [10:18:49 openshift@st2-control-111a27 ~]$ kubectl get pods dbpool-server-f5d64996d-5vxqc NAME READY STATUS RESTARTS AGE dbpool-server-f5d64996d-5vxqc 1/1 Running 0 1d
# nginx環境
# ingress-nginx-controller信息 [10:21:24 openshift@st2-control-111a27 ~]$ kubectl get pods -owide nginx-ingress-inside-controller-98b95b554-6hxbw NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE nginx-ingress-inside-controller-98b95b554-6hxbw 1/1 Running 0 3d 192.168.111.24 paas-ing010001.99bill.com <none> # ingress信息 [10:19:58 openshift@st2-control-111a27 ~]$ kubectl get ingress dbpool-server-inside -oyaml apiVersion: extensions/v1beta1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: pletest-inside nginx.ingress.kubernetes.io/ssl-redirect: "false" nginx.ingress.kubernetes.io/upstream-vhost: aaaaaaaaaaaaaaaaaaaaa generation: 1 name: dbpool-server-inside namespace: pletest spec: rules: - http: paths: - backend: serviceName: dbpool-server servicePort: http path: /dbpool-server/
# 測試訪問腳本
一直去訪問應用,畢竟不知道什么時候會出問題,不能手動去訪問。
#!/bin/sh n=1 while true do curl http://192.168.111.24:8090/dbpool-server/ sleep 20 # 這個20和tomcat的keepalive超時時間是一致的,嘗試過1,但是沒有復現成功 n=$((n+1)) echo $n done
# 抓包分析
在nginx端抓包
tcpdump -i vxlan_sys_4789 host 10.128.8.193 -w /tmp/20191106-2.pcap
通過下面的數據包可以看到,TCP連接在空閑了20s之后,有tomcat發起了斷開,但是這是nginx正好發送了一個請求過去,tomcat沒有回應這個請求,而是通過RST異常結束了這個連接。這樣就解釋了tomcat為什么沒有日志的情況。
(出現 104: Connection reset by peer 錯誤情況的nginx端數據包)
(出現 upstream prematurely closed connection 錯誤的nginx端數據包)
(出現 104: Connection reset by peer 錯誤情況的tomcat端數據包)
(出現 upstream prematurely closed connection 錯誤情況的tomcat端數據包)
社區對於該問題的Issues
在nginx-ingress社區對這個問題也有人提出來,可以看以下幾個連接: