Nginx 報錯 connect() failed (110: Connection timed out) while connecting to upstream

本文轉載自查看原文 2021-09-03 10:29 203

轉自

作者：棧木頭
鏈接：https://www.jianshu.com/p/f0f05c02e93a

背景
在對應用服務進行壓力測試時，Nginx在持續壓測請求1min左右后開始報錯，花了一些時間對報錯的原因進行排查，並最終定位到問題，現將過程總結下。

壓測工具
這里壓測使用的是siege, 其非常容易指定並發訪問數以及並發時間，以及有非常清晰的結果反饋，成功訪問數，失敗數，吞吐率等性能結果。

壓測指標
單接口壓測，並發100，持續1min。

壓測工具報錯

The server is now under siege...
[error] socket: unable to connect sock.c:249: Connection timed out
[error] socket: unable to connect sock.c:249: Connection timed out

Nginx error.log 報錯

2018/11/21 17:31:23 [error] 15622#0: *24993920 connect() failed (110: Connection timed out) while connecting to upstream, client: 192.168.xx.xx, server: xx-qa.xx.com, request: "GET /guide/v1/activities/1107 HTTP/1.1", upstream: "http://192.168.xx.xx:8082/xx/v1/activities/1107", host: "192.168.86.90"

2018/11/21 18:21:09 [error] 4469#0: *25079420 connect() failed (110: Connection timed out) while connecting to upstream, client: 192.168.xx.xx, server: xx-qa.xx.com, request: "GET /guide/v1/activities/1107 HTTP/1.1", upstream: "http://192.168.xx.xx:8082/xx/v1/activities/1107", host: "192.168.86.90"

排查問題

看到 timed out 第一感覺是，應用服務存在性能問題，導致並發請求時無法響應請求；通過排查應用服務的日志，發現其實應用服務並沒有任何報錯；
觀察應用服務的CPU負載(Docker 容器 docker state id) ，發現其在並發請求時CPU使用率升高，再無其他異常，屬於正常情況。不過持續觀察發現，在壓測報錯開始后，應用服務所在的CPU負載降低，應用服務日志里也沒有了請求日志，暫時可以判定無法響應請求應該來自應用服務鏈路的前一節點，也就是Nginx；

通過命令排查Nginx所在服務器，壓測時的TCP連接情況

# 查看當前80端口的連接數
netstat -nat|grep -i "80"|wc -l
5407

# 查看當前TCP連接的狀態
netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
LISTEN 12
SYN_RECV 1
ESTABLISHED 454
FIN_WAIT1 1
TIME_WAIT 5000

發現在TCP的連接有兩個異常點

竟然有5k多個連接
TCP狀態TIME_WAIT 到5000個后停止增長

關於這兩點開始進行分析：

理論上100個並發用戶數壓測，應該只有100個連接才對，造成這個原因應該是 siege 壓測時創建了5000個連接

# 查看siege配置
vim ~/.siege/siege.conf

# 真相大白，原來siege在壓測時，連接默認是close，也就是說在持續壓測時，每個請求結束后，直接關閉連接，然后再創建新的連接，那么就可以理解為什么壓測時Nginx所在服務器TCP連接數5000多，而不是100；

# Connection directive. Options "close" and "keep-alive" Starting with
# version 2.57, siege implements persistent connections in accordance 
# to RFC 2068 using both chunked encoding and content-length directives
# to determine the page size. 
#
# To run siege with persistent connections set this to keep-alive. 
#
# CAUTION:        Use the keep-alive directive with care.
# DOUBLE CAUTION: This directive does not work well on HPUX
# TRIPLE CAUTION: We don't recommend you set this to keep-alive
# ex: connection = close
#     connection = keep-alive
#
connection = close

TIME_WAIT 到5000分析，這要先弄清楚，TCP狀態TIME_WAIT是什么含義

TIME-WAIT：等待足夠的時間以確保遠程TCP接收到連接中斷請求的確認；TCP要保證在所有可能的情況下使得所有的數據都能夠被正確送達。當你關閉一個socket時，主動關閉一端的socket將進入TIME_WAIT狀態，而被動關閉一方則轉入CLOSED狀態，這的確能夠保證所有的數據都被傳輸。

從TIME-WAIT定義中分析得知，當壓測工具關閉連接后，實際上Nginx所在機器連接並未立刻CLOSED，而是進入TIME-WAIT狀態，網上可以搜到非常多講解TIME-WAIT過多導致丟包的情況，與我在壓測時所遇到情況一樣。

# 查看Nginx所在服務器的配置
cat /etc/sysctl.conf 
# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

vm.swappiness = 0
net.ipv4.neigh.default.gc_stale_time=120


# see details in https://help.aliyun.com/knowledge_detail/39428.html
net.ipv4.conf.all.rp_filter=0
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.lo.arp_announce=2
net.ipv4.conf.all.arp_announce=2


# see details in https://help.aliyun.com/knowledge_detail/41334.html
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_synack_retries = 2
kernel.sysrq = 1
fs.file-max = 65535
net.ipv4.ip_forward = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_syn_retries = 3
net.ipv4.tcp_max_orphans = 8192
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_window_scaling = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.icmp_echo_ignore_all = 0
net.ipv4.tcp_max_tw_buckets = 50005000表示系統同時保持TIME_WAIT套接字的最大數量，如果超過這個數字，TIME_WAIT套接字將立刻被清除並打印警告信息。

優化方案
參照在網上搜索獲取的信息，調整Linux內核參數優化：

net.ipv4.tcp_syncookies = 1 表示開啟SYN Cookies。當出現SYN等待隊列溢出時，啟用cookies來處理，可防范少量SYN攻擊，默認為0，表示關閉；

net.ipv4.tcp_tw_reuse = 1 表示開啟重用。允許將TIME-WAIT sockets重新用於新的TCP連接，默認為0，表示關閉；

net.ipv4.tcp_tw_recycle = 1 表示開啟TCP連接中TIME-WAIT sockets的快速回收，默認為0，表示關閉。

net.ipv4.tcp_fin_timeout = 30 表示如果套接字由本端要求關閉，這個參數決定了它保持在FIN-WAIT-2狀態的時間。

net.ipv4.tcp_keepalive_time = 1200 表示當keepalive起用的時候，TCP發送keepalive消息的頻度。缺省是2小時，改為20分鍾。

net.ipv4.ip_local_port_range = 1024 65000 表示用於向外連接的端口范圍。缺省情況下很小：32768到61000，改為1024到65000。

net.ipv4.tcp_max_syn_backlog = 8192 表示SYN隊列的長度，默認為1024，加大隊列長度為8192，可以容納更多等待連接的網絡連接數。

net.ipv4.tcp_max_tw_buckets = 5000表示系統同時保持TIME_WAIT套接字的最大數量，如果超過這個數字，TIME_WAIT套接字將立刻被清除並打印警告信息。默認為180000，改為5000。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 nginx 報錯 connect() failed (111: Connection refused) while connecting to upstream Nginx報錯 connect() failed (111: Connection refused) while connecting to upstream 的解決方法 upstream timed out (110: Connection timed out) while reading response header from upstream docker nginx localhost connection refused connect() failed (111: Connection refused)while connecting to upstream connect() failed (111: Connection refused) while connecting to upstream nginx proxy超時報錯 upstream timed out (110: Connec... 使用phpmailer插件發郵件失敗提示:SMTP -> ERROR: Failed to connect to server: Connection timed out (110) smtp connect（） failed； nginx 的 upstream timed out 問題解決java.net.ConnectException: Connection timed out: connect報錯 ssh啟動報錯：org.dom4j.DocumentException: Connection timed out: connect Nested exception: Connection timed out: connect