20130427遇到的2個問題:503錯誤與Couchbase集群CPU占用不均衡


(注:這2個問題與阿里雲一點關系沒有)

一、503錯誤

今天13:00~13:10左右,出現了503錯誤。出錯原因是當時的並發請求數超出了IIS應用程序池的隊列長度(Queue Length),當時用的是IIS的默認設置1000(見下圖)。

我們將這里的Queue Length由1000改為2000解決了問題(最大可以設置為65535)。

后來發現可以通過 Performance Monitor 監測 "HTTP Service Request queue" -> "Arrival Rate" 來設定 Queue Length。

比如上圖中顯示"Arrival Rate"的最大值是400,那么Queue Length最好大於400。

看一下當時的負載均衡中一台Web服務器的CPU監控圖:

(紅色曲線表示%Processor Time,綠色曲線表示Request Execution Time)

不知當時這台雲服務器發生了什么異常情況?看來503錯誤的根源是雲服務器的CPU異常,已向阿里雲提交工單了解情況。

更新:

經過仔細排查,503錯誤是當時應用程序池崩潰引起的,應用程序池崩潰是Couchbase客戶端引起的,當時正在進行Couchbase集群增/減服務器的操作。

證據來自Windows事件日志:

Exception: System.NullReferenceException
Message: Object reference not set to an instance of an object.
StackTrace:    at Hammock.RestClient.CompleteWithQuery(WebQuery query, RestRequest request, RestCallback callback, WebQueryAsyncResult result)
   at Hammock.RestClient.<>c__DisplayClass18.<BeginRequestImpl>b__15(Object sender, WebQueryResponseEventArgs args)
   at System.EventHandler`1.Invoke(Object sender, TEventArgs e)
   at Hammock.Web.WebQuery.OnQueryResponse(WebQueryResponseEventArgs args)
   at Hammock.Web.WebQuery.HandleWebException(WebException exception)
   at Hammock.Web.WebQuery.GetAsyncResponseCallback(IAsyncResult asyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
   at System.Threading.ExecutionContext.runTryCode(Object userData)
   at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Net.ContextAwareResult.Complete(IntPtr userToken)
   at System.Net.HttpWebRequest.SetResponse(Exception E)
   at System.Net.ConnectionReturnResult.SetResponses(ConnectionReturnResult returnResult)
   at System.Net.Connection.CompleteConnectionWrapper(Object request, Object state)
   at System.Net.PooledStream.ConnectionCallback(Object owningObject, Exception e, Socket socket, IPAddress address)
   at System.Net.ServicePoint.ConnectSocketCallback(IAsyncResult asyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
   at System.Net.ContextAwareResult.Complete(IntPtr userToken)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)
Application: w3wp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException
Stack:
   at System.Net.ServicePoint.ConnectSocketCallback(System.IAsyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr)
   at System.Net.ContextAwareResult.Complete(IntPtr)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x000007ff0033cbed
Faulting process id: 0x10b4
Faulting application start time: 0x01ce42fb6c5d3e18
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: unknown
Report Id: 30767fd7-aef7-11e2-8bf7-e5d3e0390d57

2.  Couchbase集群CPU占用不均衡

(Couchbase管理控制台)

(Linux top命令運行結果)

兩台Couchbase組建的集群,CPU占用卻相差很大,Couchbase版本是2.0.0。

google之后找到High cpu usage in memcached process,原來是Couchbase 2.0.0的bug,升級至最新版Couchbase 2.0.1可以解決這個問題。

升級操作方法:

1. 在兩台Couchbase服務器上下載好安裝包:wget http://packages.couchbase.com/releases/2.0.1/couchbase-server-enterprise_x86_64_2.0.1.rpm

2. 進入Coucbase管理控制台,從集群中摘掉1台服務器,具體操作方法見 couchbase-getting-started-upgrade-online

3. 升級Couchbase至2.0.1:rpm -U couchbase-server-enterprise_x86_64_2.0.1.rpm (升級之后最好重啟一下couchbase服務:service couchbase restart)

4. 將升級后的Couchbase服務器重新加入集群。

5. 對另一台Couchbase服務器進行同樣的升級操作。

升級后,問題解決


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM