MySQL異常OOM排查


收到告警,提示mysql掛了,此時看監控,負載已經比較高,服務器已經無法登錄。看見監控此時的負載情況如下:

除了系統層面的監控還可以看到,mysql層面的監控已經斷圖,等負載降下來的時候mysql已經被oom,自動重啟了。

 從mysql錯誤日志看到如下信息:

2020-08-01T09:14:11.778918+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4271ms. The settings might not be optimal. (flushed=386 and evicted=119, during the time.)

這基本上說明,數據庫收到的寫入過多,導致BufferPool充滿臟頁。這會觸發PageCleaner進行操作並清除臟頁。由於臟頁比平時多,因此PageCleaner清除緩沖區需要花費更多時間。

由於監控在負載異常高的這段時間msyql層面已經斷圖,看不到相關監控,那么如何排查呢?那只能從binlog入手了,好,那么分析一下binlog,看有沒有大事務之類的。

mysqlbinlog mysql-bin.005298 | grep "GTID$(printf '\t')last_committed" -B 1  | egrep -E '^# at|^#20' | awk '{print $1,$2,$3}' | sed 's/server//' | sed 'N;s/\n/ /' | awk 'NR==1 {tmp=$1} NR>1 {print $4,$NF,($3-tmp);tmp=$3}' | sort -k 3 -n -r | head -n 20

 可以看見再8:56:44左右,可以看見最大的一個事務在100M,這是比較大了。通過在該時間范圍內繼續解析binlog,發現有大范圍的更新數據。這不夠直觀,我們寫一段代碼,把DML趨勢放入ES來分析一下,代碼比較簡單,如下:

#!/usr/bin/python
#coding: utf8
""" 跟蹤一段時間內dml趨勢,用ES分析
Usage:
python dml2es.py | logstash -f /etc/logstash/conf.d/t_binlog.conf

"""
import json
import os
import sys
import logging
from datetime import datetime
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.event import QueryEvent, RotateEvent
from pymysqlreplication.row_event import (
    WriteRowsEvent,
    UpdateRowsEvent,
    DeleteRowsEvent,
)

def init_log():
    # get root logger
    mylogger = logging.getLogger()
    mylogger.setLevel(logging.DEBUG)

    # create file handler and add formatter to handler
    verbose_log_file = "/tmp/%s.verbose.log" % (os.path.basename(sys.argv[0]))
    fh = logging.FileHandler(verbose_log_file)
    fh.setLevel(logging.DEBUG)
    fh.setFormatter(logging.Formatter(
        fmt="%(asctime)s %(filename)s:%(lineno)d %(levelname)s %(message)s",
        datefmt="%a %d %b %Y %H:%M:%S"
    ))

    # create stream handler and add formatter to handler
    sh = logging.StreamHandler(sys.stdout)
    sh.setLevel(logging.INFO)
    sh.setFormatter(logging.Formatter(
        fmt="%(asctime)s %(levelname)-8s %(message)s",
        datefmt="%H:%M:%S"
    ))

    # add two handler to logger
    mylogger.addHandler(fh)
    mylogger.addHandler(sh)
    return mylogger


if __name__ == "__main__":
    logger = init_log()
    mysql_settings = {
        'host': '127.0.0.1',
        'port': 3306,
        'user': 'xxx',
        'passwd': 'xxx'
    }

    binlog_file = 'mysql-bin.005298'
    start_position = 123
    stop_position = 1031510925

    # 不需要blocking
    stream = BinLogStreamReader(
        connection_settings = mysql_settings,
        server_id=9999,
        log_file=binlog_file,
        log_pos=start_position,
        resume_stream=True,
        blocking=False
    )

    for binlogevent in stream:
        data = {}
        if binlogevent.packet.log_pos >= stop_position:
            break

        # 當輸出 "@timestamp" : binlogevent.timestamp
        # 即 "@timestamp" => 1582624830
        # logstash 會有warning
        # [WARN ] 2020-02-25 18:00:31.313 [[main]<stdin] Event - Unrecognized @timestamp value type=class org.jruby.RubyFixnum
        # 所以需要進行下面的格式轉換
        # 使用 datetime.fromtimestamp 將解析為 當前操作系統時區的 datetime
        if isinstance(binlogevent, WriteRowsEvent):
            data = {
                "schema" : binlogevent.schema,
                "table" : binlogevent.table,
                "dmltype" : "insert",
                "@timestamp" :datetime.fromtimestamp(binlogevent.timestamp).strftime('%Y-%m-%dT%H:%M:%S') ,
            }
            print json.dumps(data)
        elif isinstance(binlogevent, DeleteRowsEvent):
            data = {
                "schema" : binlogevent.schema,
                "table" : binlogevent.table,
                "dmltype" : "delete",
                "@timestamp" :datetime.fromtimestamp(binlogevent.timestamp).strftime('%Y-%m-%dT%H:%M:%S') ,
            }
            print json.dumps(data)
        elif isinstance(binlogevent, UpdateRowsEvent):
            data = {
                "schema" : binlogevent.schema,
                "table" : binlogevent.table,
                "dmltype" : "update",
                "@timestamp" :datetime.fromtimestamp(binlogevent.timestamp).strftime('%Y-%m-%dT%H:%M:%S') ,
            }
            print json.dumps(data)
        elif isinstance(binlogevent, RotateEvent):
            if binlogevent.packet.log_pos <= 1655:
                logger.debug( u"next_binlog:" + binlogevent.next_binlog )

    stream.close()
View Code

我們在kibana里面把時間縮短到秒級看看這段時間的DML趨勢

 可以看見和我們直接分析binlog得到的結果一致,就是這個時間點的幾個大事務導致的大量的數據更新。由於innodb_page_cleaners參數已經是4,不做調整。

 

總結:

1.杜絕大事務,同時注意SQL是否合理利用索引
2.mysql相關參數調整,比如innodb_io_capacity,innodb_io_capacity_max,由於這台機器使用的NVME接口的SSD,故把innodb_io_capacity參數調整到10000

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM