nginx 50x故障分析

本文轉載自查看原文 2013-08-23 22:26 5576 Nginx

近期經歷了一系列的nginx 50x錯誤，在此總結一下如何處理錯誤，以及各個錯誤可能根源。

錯誤處理提前需要了解

1 代碼發布時間

2 php error log

3 nginx access log

4 nginx error log

5 每個接口訪問時間log

錯誤處理流程

1. 確認是否有人剛發過代碼。根據故障時間線&代碼發布時間線，如果能找到精確對應關系，基本上可以判定這次事故的原因為代碼發布事故，回滾代碼往往是解決問題最直接有效的方式。

2. 線上測試服務器，測試接口。線上測試服務器訪問量較小，不存在nginx訪問壓力過大造成的其他隱患，可以直接測試后端的存儲服務器是否有故障。

3. 從日志挖掘有效信息。

3.1 php日志，檢查是否有大量的php報錯信息。

3.2 nginx日志，確定接口開始出現大量50x錯誤的時間點

3.3 接口請求時間日志（自行記錄），查看接口請求時間是否有異常。

3.4 配合xhprof等工具，分析耗時請求的時間分布。

50x原因分析：

分析前需要了解

1. php.ini

2. php-fpm.conf (訪問<?php phpinfo(); 查找"Loaded Configuration File"可以找到php-fpm.conf的位置 php -i | grep PATH | grep php; cd ../etc # 找到php-fpm.conf存儲位置)

3. nginx.conf

504:

1. 在nginx.conf keepalive_timeout時間內php-fpm沒有返回結果

2. php-fpm設置的過少，請求過多達到php-fpm.conf pm.max_children

pm = dynamic
; The number of child processes to be created when pm is set to 'static' and the 
; maximum number of child processes to be created when pm is set to 'dynamic'.
; This value sets the limit on the number of simultaneous requests that will be
; served. Equivalent to the ApacheMaxClients directive with mpm_prefork.
; Equivalent to the PHP_FCGI_CHILDREN environment variable in the original PHP 
; CGI.
; Note: Used when pm is set to either 'static' or 'dynamic'
; Note: This value is mandatory.
pm.max_children = 4096

; The number of child processes created on startup.
; Note: Used only when pm is set to 'dynamic'
; Default Value: min_spare_servers + (max_spare_servers - min_spare_servers) / 2 
pm.start_servers = 768 

; The desired minimum number of idle server processes.
; Note: Used only when pm is set to 'dynamic'
; Note: Mandatory when pm is set to 'dynamic'
pm.min_spare_servers = 512

這里面我覺得最重要的參數是max_children: 代表了dynamic狀態下，fpm的最大數量。

3. nginx請求排隊超時

fpm到達上限，nginx會將fpm放入請求隊列，如果在keepalive_timeout時間內始終沒有空閑fpm，返回504

504的access&error log

2013/08/14 20:48:31 [error] 20370#0: *1948283 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: lv.com.cn, request: "GET /a.php HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "lv.com"
127.0.0.1 - - [14/Aug/2013:20:48:31 +0800] "GET /a.php HTTP/1.1" 504 183 "-" "curl/7.15.5 (x86_64-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5"

502原因分析，502錯誤出現的原因一般都不是nginx的問題。

1. php-fpm request_terminate_timeout超時

request_terminate_timeout用於設置當某個php腳本運行最長時間，若超出php-fpm進程管理器強行中止當前程序，並關閉fastcgi和nginx的網絡連接，然后nginx中就會出現Connection reset by peer的錯誤了。

warning: php.ini中的max_execution_time在fpm中一般是不生效的，因為max_execution_time不計入網絡請求，系統請求，對於網絡請求，大部分都是訪問數據庫，很少有純粹的計算，因此很難超時。

盲目的延長request_terminate_timeout並不能解決問題，一般對於線上請求1s就已經非常長了，所以如果超時，更應該去查找哪個步驟耗時，並優化。

2. php-fpm進程出錯

想寫一個將php-fpm出現段錯誤，意外退出也是一件比較難的事情。大部分情況都是因為某些擴展的某些bug。(redis->pconnect遇到過, 如下)

PHP Notice: Redis::setex(): send of 869 bytes failed with errno=32 Broken pipe in /data/home/xxx.php on line 43

3. php.ini的memory_limit過小

4. nginx.conf client head buffer，fastcgi buffer size過小

nginx錯誤日志: pstream sent too big header while reading response header from upstream

502的access&error log

2013/08/23 17:14:26 [error] 20370#0: *2529767 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.18.128.37, server: lv.com.cn, request: "GET /a.php HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "lv.com"
10.18.128.37 - - [23/Aug/2013:17:14:26 +0800] "GET /a.php HTTP/1.1" 502 575 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17"

參考文獻
http://blog.xiuwz.com/2012/09/25/php-max-execution-time-internal/

http://www.cnblogs.com/zhengyun_ustc/archive/2013/06/06/3120967.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Nginx 50X錯誤總結網絡丟包故障分析 x86機器(VMware安裝Linux系統)啟動日志、日志分析、故障分析 hyperledger fabric各類節點及其故障分析 druid連接泄露故障分析 [文章]Linux宕機故障分析案例 Oracle ORA-27090故障分析 Linux宕機故障分析案例 Linux系統故障分析與排查--日志分析 Linux系統故障分析與排查--日志分析