大家好,非常抱歉,在昨天下午(12月3日)的訪問高峰,園子迎來更高的並發,在這樣的高並發下,突發的數據庫連接故障造成博客站點全線崩潰,由此給您帶來很大的麻煩,請您諒解。
最近,我們一邊在忙於AWS合作項目,一邊在加快產品的改進速度,一邊在統一全園UI,一邊在忙於解決高並發下出現的各種問題。園子正處於發展的關鍵時期,我們正全力應對挑戰,迎接園子的新階段。感謝大家的支持,也請大家諒解這段時間給大家帶來的麻煩。
今天下午的故障開始於 14:09 左右,最開始出現的故障是訪問博客后台502。
發生故障時博客后台第1條錯誤日志是 SqlClient 連接 SQL Server 數據庫失敗(我們用的是阿里雲 RDS SQL Server 實例)
2020-12-03T14:09:48 ERR [Path:/healthz]/[Action:]/[Version:]
Health check "blogdb" completed after 0.3522ms with status Unhealthy and description 'null'
Microsoft.Data.SqlClient.SqlException (0x80131904): Connection Timeout Expired. The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was unable to respond back in time. This failure occurred while attempting to connect to the Principle server. The duration spent while attempting to connect to this server was - [Pre-Login] initialization=20025; handshake=3;
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
3分鍾后,博客站點也開始出現故障,表現為訪問有時出現500錯誤。
發生故障時博客站點第1個錯誤日志是 SqlClient 解析數據庫服務器名稱失敗
2020-12-03 14:12:46.729 [Error] An exception occurred while iterating over the results of a query for context type '"BlogServer.Infrastructure.Data.EfUnitOfWork"'."
""Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)
---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
at System.Net.Dns.GetHostEntryOrAddressesCore(String hostName, Boolean justAddresses)
at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Connect(String serverName, Int32 port, TimeSpan timeout, Boolean isInfiniteTimeout, String cachedFQDN, SQLDNSInfo& pendingDNSInfo)
之后就是博客后台一直 502,博客站點訪問速度慢,頻繁出現500錯誤。
在之后的故障處理過程中,我們進行了數據庫服務器的主備切換,切換后博客后台恢復了正常。但高並發壓力下的博客站點怎么也無法恢復正常,數據庫主備切換后,數據庫連接數飆升

之后我們使勁渾身解數,也無法讓博客站點完全恢復正常,恢復到一定程度后發現,訪問有時飛快有時非常緩慢,這與請求落在哪個 pod 有關,后來我們向 k8s 集群添加了更多服務器,scale 更多 pod ,然后強制一個一個停用運行時間最早的一批 pod ,這才有所緩解,但真正恢復是在過了訪問高峰之后。
先發布這篇博文向大家匯報一下故障的大致情況,對於故障的原因,我們需要進一步排查與分析,再次請大家諒解這次故障給您帶來的麻煩。