數據庫服務異常停止分析


sqlserver數據庫服務異常停止分析

背景

2019年11月13日早上9:37分鍾,我們的運維監控系統告警提示某項目數據庫服務1433端口無法訪問,我們趕緊登陸服務器查看詳情,發現 數據庫服務以及 數據庫agent服務都已停止,重新手工啟動對應的服務恢復生產。接下來我們就需趕緊分析引起這個問題的原因以及如何去修復;
服務器環境情況如下:
Microsoft SQL Server 2017 (RTM-CU14) (KB4484710) - 14.0.3076.1 (X64) Mar 12 2019 19:29:19 Copyright (C) 2017 Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2012 R2 Datacenter 6.3 <X64> (Build 9600: ) (Hypervisor) 
硬件配置:8CPU , 32G memory,數據盤 SSD雲盤   550 GiB (18300 IOPS)

分析過程

第一步 查詢系統日志
a.有個eventid 1000的錯誤
- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="Application Error" /> 
  <EventID Qualifiers="0">1000</EventID> 
  <Level>2</Level> 
  <Task>100</Task> 
  <Keywords>0x80000000000000</Keywords> 
  <TimeCreated SystemTime="2019-11-13T01:37:40.000000000Z" /> 
  <EventRecordID>8587</EventRecordID> 
  <Channel>Application</Channel> 
  <Computer>****************</Computer> 
  <Security /> 
  </System>
- <EventData>
  <Data>sqlservr.exe</Data> 
  <Data>2017.140.3076.1</Data> 
  <Data>5c88918c</Data> 
  <Data>ntdll.dll</Data> 
  <Data>6.3.9600.19304</Data> 
  <Data>5c7f684f</Data> 
  <Data>c0000374</Data> 
  <Data>00000000000f1cd0</Data> 
  <Data>1b8c</Data> 
  <Data>01d564b93362c212</Data> 
  <Data>E:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\Binn\sqlservr.exe</Data> 
  <Data>C:\Windows\SYSTEM32\ntdll.dll</Data> 
  <Data>2db02342-05b6-11ea-80ba-a469ceaa5caa</Data> 
  <Data /> 
  <Data /> 
  </EventData>
  </Event>
 
 
 
b.有個eventid 19019的錯誤;
- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="MSSQLSERVER" /> 
  <EventID Qualifiers="16384">19019</EventID> 
  <Level>2</Level> 
  <Task>2</Task> 
  <Keywords>0x80000000000000</Keywords> 
  <TimeCreated SystemTime="2019-11-13T01:37:51.000000000Z" /> 
  <EventRecordID>8589</EventRecordID> 
  <Channel>Application</Channel> 
  <Computer>*******************</Computer> 
  <Security /> 
  </System>
- <EventData>
  <Data>MSSQLSERVER 服務意外終止.</Data> 
  <Binary>B9370000190000001000000069005A006100630075006900650062006400780031007300370069005A000000070000006D00610073007400650072000000</Binary> 
  </EventData>
  </Event>
 
 
我是覺的這些信息不是非常明了,只是告知MSSQLSERVER服務確實在問題時間點意外終止;
c.夾在兩個錯誤日志中間,注意到一個eventid 1001 
- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="Windows Error Reporting" /> 
  <EventID Qualifiers="0">1001</EventID> 
  <Level>4</Level> 
  <Task>0</Task> 
  <Keywords>0x80000000000000</Keywords> 
  <TimeCreated SystemTime="2019-11-13T01:37:50.000000000Z" /> 
  <EventRecordID>8588</EventRecordID> 
  <Channel>Application</Channel> 
  <Computer>**********************</Computer> 
  <Security /> 
  </System>
- <EventData>
  <Data /> 
  <Data>0</Data> 
  <Data>APPCRASH</Data> 
  <Data>不可用</Data> 
  <Data>0</Data> 
  <Data>sqlservr.exe</Data> 
  <Data>2017.140.3076.1</Data> 
  <Data>5c88918c</Data> 
  <Data>StackHash_f18b</Data> 
  <Data>6.3.9600.19304</Data> 
  <Data>5c7f684f</Data> 
  <Data>c0000374</Data> 
  <Data>PCH_99_FROM_ntdll+0x0000000000090D0A</Data> 
  <Data /> 
  <Data /> 
  <Data>C:\Users\MSSQLSERVER\AppData\Local\Temp\WERA49D.tmp.appcompat.txt C:\Users\MSSQLSERVER\AppData\Local\Temp\WERA6EF.tmp.WERInternalMetadata.xml C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_sqlservr.exe_4b5e12dc9b4c86bd34896c996afe289e6149ec98_cac3d1cb_cab_4d0da6ee\triagedump.dmp WERGenerationLog.txt</Data> 
  <Data>C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_sqlservr.exe_4b5e12dc9b4c86bd34896c996afe289e6149ec98_cac3d1cb_cab_4d0da6ee</Data> 
  <Data /> 
  <Data>0</Data> 
  <Data>2db02342-05b6-11ea-80ba-a469ceaa5caa</Data> 
  <Data>4</Data> 
  <Data /> 
  </EventData>
  </Event>
 
 
 
發現系統已打出了dump文件了,直接把上述地址的文件,拷到本地來分析。
通過 windbg查看如下:
 # Child-SP          RetAddr           Call Site
00 000000e1`14c192b8 00007ff8`6012b61e ntdll!NtWaitForMultipleObjects+0xa
01 000000e1`14c192c0 00007ff8`6012b17c ntdll!RtlReportExceptionEx+0x452
02 000000e1`14c19890 00007ff8`60151d26 ntdll!RtlReportException+0xbc
03 000000e1`14c19920 00007ff8`600e1e26 ntdll!RtlReportCriticalFailure$filt$0+0x33
04 000000e1`14c19950 00007ff8`600f030e ntdll!_C_specific_handler+0x96
05 000000e1`14c199c0 00007ff8`600f349d ntdll!_GSHandlerCheck_SEH+0x76
06 000000e1`14c199f0 00007ff8`600b48d7 ntdll!RtlpExecuteHandlerForException+0xd
07 000000e1`14c19a20 00007ff8`600b3afd ntdll!RtlDispatchException+0x197
08 000000e1`14c1a0f0 00007ff8`60151cd0 ntdll!RtlRaiseException+0x18d
09 000000e1`14c1a8b0 00007ff8`60154f12 ntdll!RtlReportCriticalFailure+0x8c
0a 000000e1`14c1a9c0 00007ff8`60155b10 ntdll!RtlpHeapHandleError+0x12
0b 000000e1`14c1a9f0 00007ff8`6010a5ff ntdll!RtlpLogHeapFailure+0xa4
0c 000000e1`14c1aa20 00007ff8`40d269d8 ntdll!RtlFreeHeap+0x74f2f
0d 000000e1`14c1aac0 00007ff8`40d27692 msvcr120!free+0x1c
0e 000000e1`14c1aaf0 00007ff8`31aa748e msvcr120!realloc+0x2e
0f 000000e1`14c1ab20 00000000`00000000 psqlodbc35w+0x5748e
從分析結果來看像是第三方功能引起,接下來就要去數據庫去驗證猜測
第二步 數據庫驗證猜測
select * from sys.dm_os_loaded_modules
 
 
 
ODBC中設置如下:
數據庫中有個dblink如下:
                
跟對應項目負責人員了解到,最近他們有個臨時需求使用job定時在9:30通過dblink與postgres做數據同步,基本時間點也能對上了,那問題基本已明確是第三方的dblink 調用造成sql server 異常停止;
那么接下的問題就是分析9:30調用的job的代碼,就是運行一個存儲過程,存儲過程的詳細代碼如下:
--同步並清洗**銷量數據
--1、分別將通過**link獲取到的**和**銷量數據同步到sfa_t_pflh_sales和sfa_t_drf_sales中
--2、將同步過來的銷量數據進行清洗,並按客戶需求保存數據到sfa_t_KaSales
CREATE PROCEDURE [dbo].[sfa_p_KaSalesLYNC]
AS
--同步****銷量數據
INSERT INTO sfa_t_pflh_sales
SELECT *
FROM OPENQUERY
     (POSTGRES,
      'select * from "tn_bflh_sales" where "tn_tenant_id" = ''****''and tn_product_code<>''合計''AND tn_dealstatus=0 '
     );
--更新*****銷量銷量數據同步狀態tn_dealstatus=1代表已同步
UPDATE OPENQUERY
       (POSTGRES, 'select * from "tn_bflh_sales"')
SET tn_dealstatus = 1
WHERE tn_tenant_id = '****'
      AND tn_dealstatus = 0;

--同步***銷量數據
INSERT INTO sfa_t_drf_sales
SELECT *
FROM OPENQUERY
     (POSTGRES, 'select * from "tn_drf_salses" where "tn_tenant_id" = ''****'' AND tn_dealstatus=0 ');

--更新****銷量銷量數據同步狀態tn_dealstatus=1代表已同步
UPDATE OPENQUERY
       (POSTGRES, 'select * from "tn_drf_salses"')
SET tn_dealstatus = 1
WHERE tn_tenant_id = '****'
      AND tn_dealstatus = 0;

--同步***link作業計划日志
TRUNCATE  TABLE sfa_ka_tasklog;
INSERT INTO sfa_ka_tasklog
SELECT * 
FROM OPENQUERY
     (POSTGRES,
      'SELECT t.tn_task_id,t.tn_ka_id,tn_ka_code,tn_dealers_account,t.tn_task_date,i.tn_status,i.tn_task_steps,i.tn_reak FROM tn_ka_task t LEFT JOIN tn_ka_task_info i on t.tn_task_id=i.tn_task_id
where t.tn_tenant_id=''*****'''
     )
	 ORDER BY tn_task_date,tn_task_steps

DELETE FROM sfa_t_KaSales WHERE DATEDIFF(DAY, tn_date, GETDATE()) = 1--清洗數據前,先刪除之前清洗好的前一天銷量
INSERT INTO sfa_t_KaSales
--清洗******前一天銷量數據
SELECT tn_tenant_id,'****'kaname,
       tn_dealer_number,
	   case    WHEN tn_supplier_code = '******' THEN
               '***'
           WHEN tn_supplier_code = '*****' THEN
               '***'
       END quyu,
       tn_supplier_code,
       tn_supplier_name,
       tn_end_date tn_date,
       tn_product_code,
       tn_product_desc producname,
       tn_store_code,
       tn_store_name,
       MAX(tn_sale_quantity) AS tn_day_sales,
       CAST(MAX(tn_sale_amount) AS NVARCHAR(20)) AS tn_sale_amount
FROM sfa_t_pflh_sales
WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 0
      --AND tn_dealstatus = 0
GROUP BY tn_tenant_id,
         tn_dealer_number,
         tn_supplier_code,
         tn_supplier_name,
         tn_end_date,
         tn_product_code,
         tn_product_desc,
         tn_store_code,
         tn_store_name
UNION
--清洗****前一天銷量數據
SELECT tn_tenant_id,'****'kaname,
       tn_dealer_number,
	   CASE
           WHEN tn_dealer_number = '*****' THEN
               '***'
           WHEN tn_dealer_number = '****' THEN
               '***'
           WHEN tn_dealer_number = '****' THEN
               '***'
       END quyu,
       CASE
           WHEN tn_dealer_number = '****' THEN
               '****'
           WHEN tn_dealer_number = '*****' THEN
               '****'
           WHEN tn_dealer_number = '****' THEN
               '***'
       END tn_supplier_code,
       '******' tn_supplier_name,
       DATEADD(DAY, -1, tn_date) tn_date,
       tn_product_code,
       p.producname,
	   '' tn_store_code,
       tn_store_name,      
       MAX(tn_day_sales) tn_day_sales,
       '' tn_sale_amount
FROM dbo.sfa_t_drf_sales s
    LEFT JOIN dbo.sfa_t_KaProduct p
        ON s.tn_product_code = p.productcode
           AND p.KaName = 'drf'
WHERE DATEDIFF(DAY, s.tn_create_date, GETDATE()) = 0
      --AND s.tn_dealstatus = 0
GROUP BY tn_tenant_id,
         tn_dealer_number,
         tn_product_code,
         tn_store_name,
         p.producname,
         DATEADD(DAY, -1, tn_date);

--為保持數據清洗速度,各***系統銷量臨時表只保留15天數據
DELETE FROM  sfa_t_pflh_sales
WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 15

DELETE FROM  sfa_t_drf_sales
WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 15
GO
 
 
 
/*--同步並清洗**銷量數據
--1、分別將通過**link獲取到的**和**銷量數據同步到sfa_t_pflh_sales和sfa_t_drf_sales中
--2、將同步過來的銷量數據進行清洗,並按客戶需求保存數據到sfa_t_KaSales
*/
CREATE PROCEDURE [dbo].[sfa_p_KaSalesLYNC]
AS
/*--同步****銷量數據*/
INSERT INTO sfa_t_pflh_sales
SELECT *
FROM OPENQUERY
     (POSTGRES,
      'select * from "tn_bflh_sales" where "tn_tenant_id" = ''****''and tn_product_code<>''合計''AND tn_dealstatus=0 '
     );
/*--更新*****銷量銷量數據同步狀態tn_dealstatus=1代表已同步*/
UPDATE OPENQUERY
       (POSTGRES, 'select * from "tn_bflh_sales"')
SET tn_dealstatus = 1
WHERE tn_tenant_id = '****'
      AND tn_dealstatus = 0;

/*--同步***銷量數據*/
INSERT INTO sfa_t_drf_sales
SELECT *
FROM OPENQUERY
     (POSTGRES, 'select * from "tn_drf_salses" where "tn_tenant_id" = ''****'' AND tn_dealstatus=0 ');

/*--更新****銷量銷量數據同步狀態tn_dealstatus=1代表已同步*/
UPDATE OPENQUERY
       (POSTGRES, 'select * from "tn_drf_salses"')
SET tn_dealstatus = 1
WHERE tn_tenant_id = '****'
      AND tn_dealstatus = 0;

/*--同步***link作業計划日志*/
TRUNCATE  TABLE sfa_ka_tasklog;
INSERT INTO sfa_ka_tasklog
SELECT * 
FROM OPENQUERY
     (POSTGRES,
      'SELECT t.tn_task_id,t.tn_ka_id,tn_ka_code,tn_dealers_account,t.tn_task_date,i.tn_status,i.tn_task_steps,i.tn_reak FROM tn_ka_task t LEFT JOIN tn_ka_task_info i on t.tn_task_id=i.tn_task_id
where t.tn_tenant_id=''*****'''
     )
	 ORDER BY tn_task_date,tn_task_steps

DELETE FROM sfa_t_KaSales WHERE DATEDIFF(DAY, tn_date, GETDATE()) = 1  /*--清洗數據前,先刪除之前清洗好的前一天銷量*/
INSERT INTO sfa_t_KaSales
/*--清洗******前一天銷量數據*/
SELECT tn_tenant_id,'****'kaname,
       tn_dealer_number,
	   case    WHEN tn_supplier_code = '******' THEN
               '***'
           WHEN tn_supplier_code = '*****' THEN
               '***'
       END quyu,
       tn_supplier_code,
       tn_supplier_name,
       tn_end_date tn_date,
       tn_product_code,
       tn_product_desc producname,
       tn_store_code,
       tn_store_name,
       MAX(tn_sale_quantity) AS tn_day_sales,
       CAST(MAX(tn_sale_amount) AS NVARCHAR(20)) AS tn_sale_amount
FROM sfa_t_pflh_sales
WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 0
/*      --AND tn_dealstatus = 0 */
GROUP BY tn_tenant_id,
         tn_dealer_number,
         tn_supplier_code,
         tn_supplier_name,
         tn_end_date,
         tn_product_code,
         tn_product_desc,
         tn_store_code,
         tn_store_name
UNION
/*--清洗****前一天銷量數據*/
SELECT tn_tenant_id,'****'kaname,
       tn_dealer_number,
	   CASE
           WHEN tn_dealer_number = '*****' THEN
               '***'
           WHEN tn_dealer_number = '****' THEN
               '***'
           WHEN tn_dealer_number = '****' THEN
               '***'
       END quyu,
       CASE
           WHEN tn_dealer_number = '****' THEN
               '****'
           WHEN tn_dealer_number = '*****' THEN
               '****'
           WHEN tn_dealer_number = '****' THEN
               '***'
       END tn_supplier_code,
       '******' tn_supplier_name,
       DATEADD(DAY, -1, tn_date) tn_date,
       tn_product_code,
       p.producname,
	   '' tn_store_code,
       tn_store_name,      
       MAX(tn_day_sales) tn_day_sales,
       '' tn_sale_amount
FROM dbo.sfa_t_drf_sales s
    LEFT JOIN dbo.sfa_t_KaProduct p
        ON s.tn_product_code = p.productcode
           AND p.KaName = 'drf'
WHERE DATEDIFF(DAY, s.tn_create_date, GETDATE()) = 0
/*      --AND s.tn_dealstatus = 0  */
GROUP BY tn_tenant_id,
         tn_dealer_number,
         tn_product_code,
         tn_store_name,
         p.producname,
         DATEADD(DAY, -1, tn_date);

/*--為保持數據清洗速度,各***系統銷量臨時表只保留15天數據  */
DELETE FROM  sfa_t_pflh_sales
WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 15

DELETE FROM  sfa_t_drf_sales
WHERE DATEDIFF(DAY, tn_create_date, GETDATE()) = 15
GO
 
 
 

后記

后續通過觀察,發現該job某些時間段是可以正常執行,某些時間段是會失敗;
直接報錯信息如下:
Msg 109, Level 20, State 0, Line 6
A transport-level error has occurred when receiving results from the server. (provider: Shared Memory Provider, error: 0 - The pipe has been ended.)
 
 
 
微軟官方有個 hotfix針對這種情況,可惜下載執行提示已不適應當前系統了;
該job使用dblink的內存與sql server設置的最大內存以及是否鎖定內存頁都沒有直接關系,我們甚至把max memroy 值從28G調小至10G,該現象還是會出現,觀察該job執行的時候,其中有一段腳本reads超15萬,查詢有類似的案例如下:
后續跟該項目負責人溝通之后,他們評估認為直接停用該job,該問題是沒有再重現了。但我認為如 他們要實現需求還可以嘗試:
  • 替換現有的ODBC的驅動;
  • 在job中代碼做些優化;
  • 或換其他模式如SSIS等;
  • 或在數據庫啟動參數中加 -g ;

結論

從這個案例來看,直接使用 DBlink做數據對接,且每次數據量超5萬以上不是一個首選方案;

參考

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM