nginx錯誤分析 `104: Connection reset by peer`

本文轉載自查看原文 2019-11-08 11:55 3190 kubernetes/ ingress/ nginx/ Connection reset

故障描述

應用從虛擬機環境遷移到kubernetes環境中，有些應用不定時出現請求失敗的情況，且應用沒有記錄任何日志，而在NGINX中記錄502錯誤。我們查看了之前虛擬機中的訪問情況，沒有發現該問題。

基礎信息

# 請求流程

client --> nginx(nginx-ingress-controller) --> tomcat(容器)

# nginx版本

$ nginx -V
nginx version: nginx/1.15.5
built by gcc 8.2.0 (Debian 8.2.0-7)
built with OpenSSL 1.1.0h  27 Mar 2018
TLS SNI support enabled
configure arguments: --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --modules-path=/etc/nginx/modules --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy--http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-compat --with-pcre-jit --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_addition_module --with-http_dav_module --with-http_geoip_module --with-http_gzip_static_module --with-http_sub_module --with-http_v2_module --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-threads --with-http_secure_link_module --with-http_gunzip_module --with-file-aio --without-mail_pop3_module --without-mail_smtp_module --without-mail_imap_module --without-http_uwsgi_module --without-http_scgi_module --with-cc-opt='-g -Og -fPIE -fstack-protector-strong -Wformat -Werror=format-security -Wno-deprecated-declarations -fno-strict-aliasing -D_FORTIFY_SOURCE=2 --param=ssp-buffer-size=4 -DTCP_FASTOPEN=23 -fPIC -I/root/.hunter/_Base/2c5c6fc/a134798/92161a9/Install/include -Wno-cast-function-type -m64 -mtune=native' --with-ld-opt='-fPIE -fPIC -pie -Wl,-z,relro -Wl,-z,now -L/root/.hunter/_Base/2c5c6fc/a134798/92161a9/Install/lib' --user=www-data --group=www-data --add-module=/tmp/build/ngx_devel_kit-0.3.1rc1 --add-module=/tmp/build/set-misc-nginx-module-0.32 --add-module=/tmp/build/headers-more-nginx-module-0.33 --add-module=/tmp/build/nginx-goodies-nginx-sticky-module-ng-08a395c66e42 --add-module=/tmp/build/nginx-http-auth-digest-274490cec649e7300fea97fed13d84e596bbc0ce --add-module=/tmp/build/ngx_http_substitutions_filter_module-bc58cb11844bc42735bbaef7085ea86ace46d05b --add-module=/tmp/build/lua-nginx-module-e94f2e5d64daa45ff396e262d8dab8e56f5f10e0 --add-module=/tmp/build/lua-upstream-nginx-module-0.07 --add-module=/tmp/build/nginx_cookie_flag_module-1.1.0 --add-module=/tmp/build/nginx-influxdb-module-f20cfb2458c338f162132f5a21eb021e2cbe6383 --add-dynamic-module=/tmp/build/nginx-opentracing-0.6.0/opentracing --add-dynamic-module=/tmp/build/ModSecurity-nginx-37b76e88df4bce8a9846345c27271d7e6ce1acfb --add-dynamic-module=/tmp/build/ngx_http_geoip2_module-3.0 --add-module=/tmp/build/nginx_ajp_module-bf6cd93f2098b59260de8d494f0f4b1f11a84627 --add-module=/tmp/build/ngx_brotli

# nginx-ingress-controller版本

0.20.0

# tomcat版本

tomcat 7.0.72

# kubernetes版本

[10:16:48 openshift@st2-control-111a27 ~]$ oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://paas-sh-01.99bill.com:443
openshift v3.11.0+0323629-61
kubernetes v1.11.0+d4cacc0
[10:16:56 openshift@st2-control-111a27 ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2018-10-15T09:45:30Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2018-11-20T19:51:55Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

排查

1. 查看日志有2種錯誤日志情況

# nginx error.log

2019/11/08 07:59:43 [error] 189#189: *1707448 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.55.200, server: _, request: "GET /dbpool-server/ HTTP/1.1", upstream: "http://10.128.8.193:8080/dbpool-server/", host: "192.168.111.24:8090"
2019/11/08 07:00:00 [error] 188#188: *1691956 upstream prematurely closed connection while reading response header from upstream, client: 192.168.55.200, server: _, request: "GET /dbpool-server/ HTTP/1.1", upstream: "http://10.128.8.193:8080/dbpool-server/", host: "192.168.111.24:8090"

# nginx upstream.log

- 2019-11-08T07:59:43+08:00 GET /dbpool-server/ HTTP/1.1 502 0.021 10.128.8.193:8080 502 0.021 - 192.168.111.24
- 2019-11-08T07:00:00+08:00 GET /dbpool-server/ HTTP/1.1 502 0.006 10.128.8.193:8080 502 0.006 - 192.168.111.24

# tomcat當時的訪問日志

192.168.55.200 - - [08/Nov/2019:06:59:00 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0
192.168.55.200 - - [08/Nov/2019:06:59:20 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0
192.168.55.200 - - [08/Nov/2019:06:59:40 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0
192.168.55.200 - - [08/Nov/2019:07:00:20 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0
192.168.55.200 - - [08/Nov/2019:07:00:40 +0800] GET /dbpool-server/ HTTP/1.1 200 222 1
...
192.168.55.200 - - [08/Nov/2019:07:59:03 +0800] GET /dbpool-server/ HTTP/1.1 200 222 1
192.168.55.200 - - [08/Nov/2019:07:59:23 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0
192.168.55.200 - - [08/Nov/2019:08:00:03 +0800] GET /dbpool-server/ HTTP/1.1 200 222 1
192.168.55.200 - - [08/Nov/2019:08:00:23 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0
192.168.55.200 - - [08/Nov/2019:08:00:43 +0800] GET /dbpool-server/ HTTP/1.1 200 222 0

分析

根據故障現象，從網絡上搜索了一些結果，但是感覺信息不多。而且由於nginx這一類的代理出現Connection reset的場景復雜，原因多樣，也不好判斷是什么原因引起的。

之前使用虛擬機的時候沒有出現這樣的情況，於是對比了這兩方面的配置，發現虛機環境nginx與tomcat之間使用的是HTTP 1.0，而在kubernetes環境中使用的是HTTP 1.1，這中間最大的區別在於HTTP 1.1默認開啟了keepalive功能。所以想到可能是由於tomcat和nginx設置的keepalive timeout時間不一致導致了問題的發生。

# tomcat keepalive配置server.conf文件中

...
  <Connector acceptCount="800"
            port="${http.port}"
            protocol="HTTP/1.1"
            executor="tomcatThreadPool"
            enableLookups="false"
            connectionTimeout="20000"
                maxThreads="1024"
                disableUploadTimeout="true"
            URIEncoding="UTF-8"
            useBodyEncodingForURI="true"/>
...

根據tomcat官方文檔說明，keepAliveTimeout默認等於connectionTimeout，我們這里配置的是20s。

參數	解釋
keepAliveTimeout	The number of milliseconds this Connector will wait for another HTTP request before closing the connection. The default value is to use the value that has been set for the connectionTimeout attribute. Use a value of -1 to indicate no (i.e. infinite) timeout.
maxKeepAliveRequests	The maximum number of HTTP requests which can be pipelined until the connection is closed by the server. Setting this attribute to 1 will disable HTTP/1.0 keep-alive, as well as HTTP/1.1 keep-alive and pipelining. Setting this to -1 will allow an unlimited amount of pipelined or keep-alive HTTP requests. If not specified, this attribute is set to 100.
connectionTimeout	The number of milliseconds this Connector will wait, after accepting a connection, for the request URI line to be presented. Use a value of -1 to indicate no (i.e. infinite) timeout. The default value is 60000 (i.e. 60 seconds) but note that the standard server.xml that ships with Tomcat sets this to 20000 (i.e. 20 seconds). Unless disableUploadTimeout is set to `false`, this timeout will also be used when reading the request body (if any).

tomcat官方文檔 https://tomcat.apache.org/tomcat-7.0-doc/config/http.html

# nginx(nginx-ingress-controller)中的配置

由於沒有顯示的配置，所以使用的是nginx的默認參數配置，默認是60s。

http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout            
Syntax:    keepalive_timeout timeout;
Default:    
keepalive_timeout 60s;
Context:    upstream
This directive appeared in version 1.15.3.

Sets a timeout during which an idle keepalive connection to an upstream server will stay open.

解決

竟然有了大概的分析猜測，可以嘗試調整nginx的keepalive timeout為15s(需要小於tomcat的超時時間)，測試了之后，故障就這樣得到了解決。

由於我們使用的nginx-ingress-controller版本是0.20.0，還不支持通過參數配置這個參數，所以我們可以修改配置模板，如下：

...
        {{ if $all.DynamicConfigurationEnabled }}
        upstream upstream_balancer {
            server 0.0.0.1; # placeholder
 
            balancer_by_lua_block {
              balancer.balance()
            }
 
            {{ if (gt $cfg.UpstreamKeepaliveConnections 0) }}
            keepalive {{ $cfg.UpstreamKeepaliveConnections }};
            {{ end }}
            # ADD -- START
            keepalive_timeout 15s;
            # ADD -- END
        }
        {{ end }}
...

不過，從0.21.0版本開始，已經可以通過配置參數 |[upstream-keepalive-timeout](#upstream-keepalive-timeout)|int|60| 對這個值修改。

小想法： 一般情況下，都是nginx主動對tomcat發起連接請求，所以如果是nginx主動關閉連接，也是合理的解釋，這樣會避免一些因為協議配置或漏洞引起的問題。

故障重現

# 准備一個測試應用並啟動一個測試實例

# pod實例
[10:18:49 openshift@st2-control-111a27 ~]$ kubectl get pods dbpool-server-f5d64996d-5vxqc
NAME                            READY     STATUS    RESTARTS   AGE
dbpool-server-f5d64996d-5vxqc   1/1       Running   0          1d

# nginx環境

# ingress-nginx-controller信息
[10:21:24 openshift@st2-control-111a27 ~]$ kubectl get pods -owide nginx-ingress-inside-controller-98b95b554-6hxbw
NAME                                              READY     STATUS    RESTARTS   AGE       IP               NODE   NOMINATED NODE
nginx-ingress-inside-controller-98b95b554-6hxbw   1/1       Running   0          3d        192.168.111.24   paas-ing010001.99bill.com   <none>
# ingress信息
[10:19:58 openshift@st2-control-111a27 ~]$ kubectl get ingress dbpool-server-inside -oyaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: pletest-inside
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
    nginx.ingress.kubernetes.io/upstream-vhost: aaaaaaaaaaaaaaaaaaaaa
  generation: 1
  name: dbpool-server-inside
  namespace: pletest
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: dbpool-server
          servicePort: http
        path: /dbpool-server/

# 測試訪問腳本

一直去訪問應用，畢竟不知道什么時候會出問題，不能手動去訪問。

#!/bin/sh
n=1
 
while true
do
        curl http://192.168.111.24:8090/dbpool-server/
        sleep 20 # 這個20和tomcat的keepalive超時時間是一致的，嘗試過1，但是沒有復現成功
        n=$((n+1))
        echo $n
done

# 抓包分析

在nginx端抓包

tcpdump -i vxlan_sys_4789 host 10.128.8.193 -w /tmp/20191106-2.pcap

通過下面的數據包可以看到，TCP連接在空閑了20s之后，有tomcat發起了斷開，但是這是nginx正好發送了一個請求過去，tomcat沒有回應這個請求，而是通過RST異常結束了這個連接。這樣就解釋了tomcat為什么沒有日志的情況。

(出現 104: Connection reset by peer 錯誤情況的nginx端數據包)

(出現 upstream prematurely closed connection 錯誤的nginx端數據包)

(出現 104: Connection reset by peer 錯誤情況的tomcat端數據包)

(出現 upstream prematurely closed connection 錯誤情況的tomcat端數據包)

社區對於該問題的Issues

在nginx-ingress社區對這個問題也有人提出來，可以看以下幾個連接：

https://github.com/kubernetes/ingress-nginx/issues/3099

https://github.com/kubernetes/ingress-nginx/pull/3222

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。