關於刪除MySQL Logs的一點記錄


五一前,一個DBA同事反饋,在日常環境中刪除一個大的slow log文件(假設文件大小10G以上吧),然后在MySQL中執行flush slow logs,會發現mysqld hang住。

 

今天嘗試着重現了此問題,這里簡要分析下原因。

 

重現步驟:

1. 構造slow log (將long_query_time設成了0);

2. 觀察刪rm slow log瞬間,tps/qps變化;

3. 觀察執行flush slow logs瞬間,tps/qps變化;

4. 記錄flush slow logs執行時, pstack打出的調用棧情況;

 

第一步,沒啥好說的。

第二步,tps/qps沒啥變化。

第三步,會發現tps/qps瞬間跌0,如下所示:

[ 639s] threads: 32, tps: 1121.00, reads/s: 15843.98, writes/s: 4490.99
[ 640s] threads: 32, tps: 792.99, reads/s: 10803.89, writes/s: 3150.97
[ 641s] threads: 32, tps: 0.00, reads/s: 0.00, writes/s: 0.00
[ 642s] threads: 32, tps: 0.00, reads/s: 0.00, writes/s: 0.00
[ 643s] threads: 32, tps: 471.01, reads/s: 6860.08, writes/s: 1908.02

mysql命令行會發現,flush slow logs執行時間剛好為3s左右。

第四步,我們看下pstack的輸出結果,只記錄相關的:

610 Thread 5 (Thread 0x2afdc4101700 (LWP 30762)):
611 #0  0x0000003c6e40a7d6 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
612 #1  0x0000000000825135 in inline_mysql_rwlock_rdlock ()
613 #2  0x0000000000838004 in LOGGER::lock_shared() ()
614 #3  0x00000000008283bf in LOGGER::slow_log_print(THD*, char const*, unsigned int, unsigned long long) ()
615 #4  0x0000000000832b30 in slow_log_print(THD*, char const*, unsigned int, unsigned long long) ()
616 #5  0x0000000000609f23 in log_slow_statement(THD*) ()
617 #6  0x00000000006099d1 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
618 #7  0x0000000000606e02 in do_command(THD*) ()
619 #8  0x00000000006f070f in do_handle_one_connection(THD*) ()
620 #9  0x00000000006f020d in handle_one_connection ()
621 #10 0x0000003c6e4077f1 in start_thread () from /lib64/libpthread.so.0
622 #11 0x0000003c6e0e570d in clone () from /lib64/libc.so.6

623 Thread 4 (Thread 0x2afdd0080700 (LWP 30763)):
624 #0  0x0000003c6e40a7d6 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
625 #1  0x0000000000825135 in inline_mysql_rwlock_rdlock ()
626 #2  0x0000000000838004 in LOGGER::lock_shared() ()
627 #3  0x00000000008283bf in LOGGER::slow_log_print(THD*, char const*, unsigned int, unsigned long long) ()
628 #4  0x0000000000832b30 in slow_log_print(THD*, char const*, unsigned int, unsigned long long) ()
629 #5  0x0000000000609f23 in log_slow_statement(THD*) ()
630 #6  0x00000000006099d1 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
631 #7  0x0000000000606e02 in do_command(THD*) ()
632 #8  0x00000000006f070f in do_handle_one_connection(THD*) ()
633 #9  0x00000000006f020d in handle_one_connection ()
634 #10 0x0000003c6e4077f1 in start_thread () from /lib64/libpthread.so.0
635 #11 0x0000003c6e0e570d in clone () from /lib64/libc.so.6

636 Thread 3 (Thread 0x2afdd0101700 (LWP 30764)):
637 #0  0x0000003c6e40a7d6 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
638 #1  0x0000000000825135 in inline_mysql_rwlock_rdlock ()
639 #2  0x0000000000838004 in LOGGER::lock_shared() ()
640 #3  0x00000000008283bf in LOGGER::slow_log_print(THD*, char const*, unsigned int, unsigned long long) ()
641 #4  0x0000000000832b30 in slow_log_print(THD*, char const*, unsigned int, unsigned long long) ()
642 #5  0x0000000000609f23 in log_slow_statement(THD*) ()
643 #6  0x00000000006099d1 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
644 #7  0x0000000000606e02 in do_command(THD*) ()
645 #8  0x00000000006f070f in do_handle_one_connection(THD*) ()
646 #9  0x00000000006f020d in handle_one_connection ()
647 #10 0x0000003c6e4077f1 in start_thread () from /lib64/libpthread.so.0
648 #11 0x0000003c6e0e570d in clone () from /lib64/libc.so.6

649 Thread 2 (Thread 0x2afe18080700 (LWP 30855)):
650 #0  0x0000003c6e40e54d in close () from /lib64/libpthread.so.0
651 #1  0x00000000008f56ed in my_close ()
652 #2  0x0000000000825c16 in inline_mysql_file_close ()
653 #3  0x000000000082b305 in MYSQL_LOG::close(unsigned int) ()
654 #4  0x000000000082b634 in MYSQL_QUERY_LOG::reopen_file() ()
655 #5  0x0000000000828283 in LOGGER::flush_slow_log() ()
656 #6  0x000000000071d8fc in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, int*) ()
657 #7  0x0000000000610200 in mysql_execute_command(THD*) ()
658 #8  0x000000000061534d in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
659 #9  0x00000000006086a0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
660 #10 0x0000000000606e02 in do_command(THD*) ()
661 #11 0x00000000006f070f in do_handle_one_connection(THD*) ()
662 #12 0x00000000006f020d in handle_one_connection ()
663 #13 0x0000003c6e4077f1 in start_thread () from /lib64/libpthread.so.0
664 #14 0x0000003c6e0e570d in clone () from /lib64/libc.so.6

會發現Thread 2在執行flush slow logs操作,其他的線程都在等待鎖LOCK_log上邊。

 

背后的原因其實很簡單,在shell中執行rm slow log操作時,由於mysqld仍有文件句柄打開此文件,所以實際上此時文件並未刪除。執行flush logs操作,其實際執行的是1)close;2)open 操作(logger.flush_slow_log -> mysql_slow_log.reopen_file),在close操作執行時,文件系統真正刪除文件,此時該線程占用着LOCK_log鎖。

刪除時會執行刷臟(當然我構造這個場景很極端,基本所有slow log文件的內容都在文件系統緩存中),這個會很耗時間,比如我執行這個語句耗了3s。此時間段內,如果連接發來的語句需要記log(server層的log:slow log/binlog/general log共有LOCK_log這把鎖)就會處於等待狀態,那么系統對外的反應就是hang住了。

 

flush slow logs中調用執行的close所需時間和文件大小、以及文件系統緩存中該文件臟頁比例都有關系,比如我在執行flush slow logs前使用sysctl vm.drop_caches=3清空

了文件系統緩存的話,同樣大小的flush slow logs操作執行時間是0.42s,相應的阻塞時間也會減少不少。

可以考慮在slow logs的文件句柄上執行posix_fadvise調用,促使不會緩存很大的log文件內容(slow log也沒啥需要緩存的),這有篇霸爺的文章,可以參考下 posix_fadvise清除緩存的誤解和改進措施 。

另外,peter在07年就討論過這個問題, Be careful rotating MySQL logs 其給出的建議是先mv file,然后flush logs,再執行刪除文件的操作,讓真正的刪除行為由自己而不是mysqld完成。比較遺憾的是,七年過去了,LOCK_log這把鎖的問題還沒有完整的解決掉。

 

PS:

文章結尾記一點備忘,通過close/rm操作刪除一個10G大小的文件,在執行sysctl vm.drop_caches=3清空緩存后,此操作的耗時仍在百毫秒量級(我的機器上是200ms+),其背后做了什么事情還需要找內核組的同事了解下。

 

bug report:  http://bugs.mysql.com/bug.php?id=72529


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM