Log4J & elk 事故總結

本文轉載自查看原文 2015-12-12 17:23 5474 事故分析

周六的早晨8點，應用出現了大面積的登錄超時問題。

作為一款日活15W、用戶量700W+的應用，這是致命的問題。

唯一的安慰是——好在今天是周末，加班的公司才會使用。雖然如此，客服、產品的電話也被打爆了。

初步懷疑，問題與前一天晚上的更新有關，運維的同事回滾了更新，應用全部回滾完畢，然而，問題依然沒有解決，服務依然不可用。

運維開始束手無策，9點鍾的時候，基本所有的開發teamleader都過來了，加上架構部的，十幾號人開始分析問題，客服、運營、產品們忙着安慰客戶，發公告。總監、副總裁都過來了，看着一群開發忙來忙去的找問題。

我們最先懷疑的是后端的基礎系統故障（歷史經驗，這一塊出問題的可能性比較大），mongo, mysql, redis, memcache, rabbitmq一個個排查，它們的表現都正常：群集讀寫的壓力都很小，請求處理時間短。

在排除了以上系統的問題之后，我們把懷疑的對象對准了一個Http服務，這是一個古老的服務，使用oauth 1.0 ，底層的http不能指定超時（而新的服務都是使用async-http實現，有超時設置），屏蔽服務后，問題依然存在。

后續的分析，集中在內部服務接口的請求上。通過vpn直接請求內部接口，測了幾個在網關層請求都超時的服務，它們的返回都正常，最終將問題鎖定在了網關層的服務。

網關層服務是一個輕量級的服務，它的主要職責是兩件事：（1）鑒權（移動端、ERP）（2）路由（按業務），理論上這個服務不應該出現問題，不管它，先dump內存看看。

將網關層應用的內存dump下來后，發現了問題：

"qtp1056944384-232" prio=10 tid=0x00007f54900d0800 nid=0x63b3 waiting for monitor entry [0x00007f54492d0000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.log4j.Category.callAppenders(Category.java:205)
	- waiting to lock <0x00000007e81c4830> (a org.apache.log4j.spi.RootLogger)
	at org.apache.log4j.Category.forcedLog(Category.java:391)
	at org.apache.log4j.Category.log(Category.java:856)
	at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:368)

總計有200多個log4j的線程在等待鎖"0x00000007e81c4830"，而這把鎖被誰持有呢？通過搜索，找到以下dump信息：

"qtp1056944384-218" prio=10 tid=0x00007f54800bb800 nid=0x63a5 runnable [0x00007f544a0de000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:141)
	at net.logstash.log4j.SocketAppender.append(SocketAppender.java:190)
	at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
	- locked <0x00000007e8210868> (a net.logstash.log4j.SocketAppender)
	at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
	at org.apache.log4j.Category.callAppenders(Category.java:206)
	- locked <0x00000007e81c4830> (a org.apache.log4j.spi.RootLogger)
	at org.apache.log4j.Category.forcedLog(Category.java:391)
	at org.apache.log4j.Category.log(Category.java:856)
	at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:368)

第218號線程持有了鎖"0x00000007e81c4830"，然后其它log4j的線程都在等待這把鎖的釋放。可以看到，這把鎖的持有者，一個關鍵字“logstash.log4j”，因此初步診斷和最近添加的elk有關。elk 即 Logstash+ElasticSearch+Kibana4，它是架構部近期引入的一個實時日志分析系統，最近的調整是：log4j.rootLogger 參數，添加了實時將日志打到logstash，而logstash在早晨8點的時候日志片段如下：

WARNING: [logstash-elk_0_47-16997-7944] [gc][young][2284683][25218] duration [1.1s], collections [1]/[1.8s], total [1.1s]/[12.2m], memory [1gb]->[915.1mb]/[3.9gb], all_pools {[young] [269.4mb]->[6.4mb]/[532.5mb]}{[survivor] [50.3mb]->[50.3mb]/[66.5mb]}{[old] [742.8mb]->[858.4mb]/[3.3gb]}
{:timestamp=>"2015-12-12T08:01:00.606000+0800", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-12-12T08:01:00.636000+0800", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-12-12T08:01:00.637000+0800", :message=>"retrying failed action with response code: 503", :level=>:warn}
{:timestamp=>"2015-12-12T08:01:00.637000+0800", :message=>"retrying failed action with response code: 503", :level=>:warn}

logstash出現了大量的503，查看elasticsearch的日志片段：

[2015-12-12 08:00:00,244][INFO ][cluster.metadata         ] [Iron Man] [logstash-2015.12.12] creating index, cause [auto(bulk api)], templates [logstash], shards [5]/[0], mappings [_default_]
[2015-12-12 08:00:00,285][INFO ][cluster.metadata         ] [Iron Man] [app-logs-2015.12.12] creating index, cause [auto(bulk api)], templates [], shards [5]/[0], mappings []
[2015-12-12 08:00:18,053][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node
[2015-12-12 08:00:48,053][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node
[2015-12-12 08:00:48,054][INFO ][cluster.routing.allocation.decider] [Iron Man] high disk watermark exceeded on one or more nodes, rerouting shards
[2015-12-12 08:01:18,087][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node
[2015-12-12 08:01:48,054][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node
[2015-12-12 08:02:18,054][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node
[2015-12-12 08:02:18,054][INFO ][cluster.routing.allocation.decider] [Iron Man] high disk watermark exceeded on one or more nodes, rerouting shards
[2015-12-12 08:02:48,054][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.5gb[7%], shards will be relocated away from this node
[2015-12-12 08:03:18,054][WARN ][cluster.routing.allocation.decider] [Iron Man] high disk watermark [10%] exceeded on [o-oE92qNSyyoqbzUoRTS7Q][Iron Man] free: 34.7gb[7%], shards will be relocated away from this node
[2015-12-12 08:03:18,055][INFO ][cluster.routing.allocation.decider] [Iron Man] high disk watermark exceeded on one or more nodes, rerouting shards

運維的同事說，elk是單節點，而此時，elasticsearch由於磁盤空間不夠，出現了服務不可用（沒有添加預警），logstash阻塞。logstash阻塞，導致網關服務整個集群不可用。

初步的處理是，停用logstash同步寫，同時將生產環境日志臨時調整為fatal（為了減少日志量）。

為什么log4j會出現幾百個線程等待一個鎖的問題呢？后面筆者分析的log4j的Category.callAppenders源碼：

  /**
     Call the appenders in the hierrachy starting at
     <code>this</code>.  If no appenders could be found, emit a
     warning.

     <p>This method calls all the appenders inherited from the
     hierarchy circumventing any evaluation of whether to log or not
     to log the particular log request.

     @param event the event to log.  */
public void callAppenders(LoggingEvent event) {
    int writes = 0;

    for(Category c = this; c != null; c=c.parent) {
      // Protected against simultaneous call to addAppender, removeAppender,...
      synchronized(c) {
        if(c.aai != null) {
            writes += c.aai.appendLoopOnAppenders(event);
        }
        if(!c.additive) {
            break;
        }
      }
    }

    if(writes == 0) {
      repository.emitNoAppenderWarning(this);
    }
  }

log4j版本1.x中，使用的是古老的synchronized(this)，所有線程共用一個Category，而它通過log4j.properties指定。同一個Category下的線程打log時，需要進行全局同步，因此它的效率會很低，log4j 1.x版不適合高並發的場景。

為了杜絕這樣的問題，后續需要吸取教訓：

1. 盡量減少不必要的日志，在成熟的接口上，關閉日志輸出，這樣有利於提高效率。

2. 替換底層log的實現類，不再使用log4j 1.x，使用logback（推薦）或者新的log4j 2.x版本。

最后，附帶兩篇關於log4j 1.x中，日志系統死鎖的分析（我們最開始懷疑是這個問題）：

https://bz.apache.org/bugzilla/show_bug.cgi?id=50213

http://javaeesupportpatterns.blogspot.com/2012/09/log4j-thread-deadlock-case-study.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ELK環境配置+log4j日志記錄 log4j（一）——為什么要用log4j？ LOG4J 的配置 log4j的使用 log4j使用 log4j log4j 例子【Log】【Log4j】【1】log4j日志的輸出級別 logback 、log4j 和 log4j2 Log4j和Log4j2的區別