(注:這2個問題與阿里雲一點關系沒有)
一、503錯誤
今天13:00~13:10左右,出現了503錯誤。出錯原因是當時的並發請求數超出了IIS應用程序池的隊列長度(Queue Length),當時用的是IIS的默認設置1000(見下圖)。
我們將這里的Queue Length由1000改為2000解決了問題(最大可以設置為65535)。
后來發現可以通過 Performance Monitor 監測 "HTTP Service Request queue" -> "Arrival Rate" 來設定 Queue Length。
比如上圖中顯示"Arrival Rate"的最大值是400,那么Queue Length最好大於400。
看一下當時的負載均衡中一台Web服務器的CPU監控圖:
(紅色曲線表示%Processor Time,綠色曲線表示Request Execution Time)
不知當時這台雲服務器發生了什么異常情況?看來503錯誤的根源是雲服務器的CPU異常,已向阿里雲提交工單了解情況。
更新:
經過仔細排查,503錯誤是當時應用程序池崩潰引起的,應用程序池崩潰是Couchbase客戶端引起的,當時正在進行Couchbase集群增/減服務器的操作。
證據來自Windows事件日志:
Exception: System.NullReferenceException Message: Object reference not set to an instance of an object. StackTrace: at Hammock.RestClient.CompleteWithQuery(WebQuery query, RestRequest request, RestCallback callback, WebQueryAsyncResult result) at Hammock.RestClient.<>c__DisplayClass18.<BeginRequestImpl>b__15(Object sender, WebQueryResponseEventArgs args) at System.EventHandler`1.Invoke(Object sender, TEventArgs e) at Hammock.Web.WebQuery.OnQueryResponse(WebQueryResponseEventArgs args) at Hammock.Web.WebQuery.HandleWebException(WebException exception) at Hammock.Web.WebQuery.GetAsyncResponseCallback(IAsyncResult asyncResult) at System.Net.LazyAsyncResult.Complete(IntPtr userToken) at System.Threading.ExecutionContext.runTryCode(Object userData) at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Net.ContextAwareResult.Complete(IntPtr userToken) at System.Net.HttpWebRequest.SetResponse(Exception E) at System.Net.ConnectionReturnResult.SetResponses(ConnectionReturnResult returnResult) at System.Net.Connection.CompleteConnectionWrapper(Object request, Object state) at System.Net.PooledStream.ConnectionCallback(Object owningObject, Exception e, Socket socket, IPAddress address) at System.Net.ServicePoint.ConnectSocketCallback(IAsyncResult asyncResult) at System.Net.LazyAsyncResult.Complete(IntPtr userToken) at System.Net.ContextAwareResult.Complete(IntPtr userToken) at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped) at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)
Application: w3wp.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.NullReferenceException Stack: at System.Net.ServicePoint.ConnectSocketCallback(System.IAsyncResult) at System.Net.LazyAsyncResult.Complete(IntPtr) at System.Net.ContextAwareResult.Complete(IntPtr) at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2 Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000 Exception code: 0xc0000005 Fault offset: 0x000007ff0033cbed Faulting process id: 0x10b4 Faulting application start time: 0x01ce42fb6c5d3e18 Faulting application path: c:\windows\system32\inetsrv\w3wp.exe Faulting module path: unknown Report Id: 30767fd7-aef7-11e2-8bf7-e5d3e0390d57
2. Couchbase集群CPU占用不均衡
(Couchbase管理控制台)
(Linux top命令運行結果)
兩台Couchbase組建的集群,CPU占用卻相差很大,Couchbase版本是2.0.0。
google之后找到High cpu usage in memcached process,原來是Couchbase 2.0.0的bug,升級至最新版Couchbase 2.0.1可以解決這個問題。
升級操作方法:
1. 在兩台Couchbase服務器上下載好安裝包:wget http://packages.couchbase.com/releases/2.0.1/couchbase-server-enterprise_x86_64_2.0.1.rpm
2. 進入Coucbase管理控制台,從集群中摘掉1台服務器,具體操作方法見 couchbase-getting-started-upgrade-online
3. 升級Couchbase至2.0.1:rpm -U couchbase-server-enterprise_x86_64_2.0.1.rpm (升級之后最好重啟一下couchbase服務:service couchbase restart)
4. 將升級后的Couchbase服務器重新加入集群。
5. 對另一台Couchbase服務器進行同樣的升級操作。
升級后,問題解決