【轉】reduce端緩存數據過多出現FGC,導致reduce生成的數據無法寫到hdfs


轉自  http://blog.csdn.net/bigdatahappy/article/details/41726389

轉這個目的,是因為該貼子中調優思路不錯,值得學習

 

     搜索推薦有一個job,1000多個map,200個reduce,運行到最后只剩一個reduce(10.39.6.130上)的時候,出現以下異常,導致job失敗:

 

[plain] view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. 2014-12-04 15:49:04,297 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 12 segments left of total size: 11503294914 bytes  
  2. 2014-12-04 15:49:04,314 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id  
  3. 2014-12-04 15:49:04,394 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.lzo_deflate]  
  4. 2014-12-04 16:02:26,889 WARN [ResponseProcessor for block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086  
  5. java.io.IOException: Bad response ERROR for block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086 from datanode 10.39.5.193:50010  
  6.     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)  
  7. 2014-12-04 16:02:26,889 WARN [ResponseProcessor for block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223  
  8. java.io.IOException: Bad response ERROR for block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223 from datanode 10.39.1.90:50010  
  9.     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)  
  10. 2014-12-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/user-r-00158 block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086 in pipeline 10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010: bad datanode 10.39.5.193:50010  
  11. 2014-12-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/exposure-r-00158 block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223 in pipeline 10.39.6.130:50010, 10.39.1.89:50010, 10.39.1.90:50010: bad datanode 10.39.1.90:50010  
  12. java.io.EOFException: Premature EOF: no length prefix available  
  13.     at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987)  
  14.     at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)  
  15.     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)  
  16.   
  17. 2014-12-04 16:05:23,743 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space  
  18.     at java.util.Arrays.copyOf(Arrays.java:2734)  
  19.     at java.util.Vector.ensureCapacityHelper(Vector.java:226)  
  20.     at java.util.Vector.add(Vector.java:728)  
  21.     at rec.CommonUtil.pack_Treeset(CommonUtil.java:395)  
  22.     at rec.ConvertExposure10$MyReducer.collect_exposure(ConvertExposure10.java:259)  
  23.     at rec.ConvertExposure10$MyReducer.reduce(ConvertExposure10.java:329)  
  24.     at rec.ConvertExposure10$MyReducer.reduce(ConvertExposure10.java:234)  
  25.     at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)  
  26.     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)  
  27.     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)  
  28.     at java.security.AccessController.doPrivileged(Native Method)  
  29.     at javax.security.auth.Subject.doAs(Subject.java:396)  
  30.     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)  
  31.     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)  
從異常上看,首先是reduce在往hdfs寫數據時,發現建pipeline時,沒有收到pipeline上最后一個節點的回應:
[plain] view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. 2014-12-04 16:02:26,889 WARN [ResponseProcessor for block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086  
  2. java.io.IOException: Bad response ERROR for block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086 from datanode 10.39.5.193:50010  
  3.     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)  
  4. 2014-12-04 16:02:26,889 WARN [ResponseProcessor for block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223  
  5. java.io.IOException: Bad response ERROR for block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223 from datanode 10.39.1.90:50010  
  6.     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)  
  7. 2014-12-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/user-r-00158 block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086 in pipeline 10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010: bad datanode 10.39.5.193:50010  
  8. 2014-12-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/exposure-r-00158 block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1386326728-10.39.2.131-1382089338395:blk_1394153869_320473223 in pipeline 10.39.6.130:50010, 10.39.1.89:50010, 10.39.1.90:50010: bad datanode 10.39.1.90:50010  
  9. java.io.EOFException: Premature EOF: no length prefix available  
  10.     at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987)  
  11.     at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)  
  12.     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)  
這里以寫block:blk_1394149732_320469086為例,pipeline[10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010]上面的最后一個DN是10.39.5.193,到10.39.5.193查看該block的日志信息:
2014-12-04 16:00:57,424 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086
[plain] view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.193:50010 remote=/10.39.5.185:58225]  
  2.         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)  
  3.         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)  
  4.         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)  
  5.         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)  
  6.         at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)  
  7.         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)  
  8.         at java.io.DataInputStream.read(DataInputStream.java:132)  
  9.         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)  
  10.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)  
  11.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)  
  12.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)  
  13.         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)  
  14.         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)  
  15.         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)  
  16.         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)  
  17.         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)  
  18.         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)  
  19.         at java.lang.Thread.run(Thread.java:662)  
  20. 2014-12-04 16:00:57,424 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.  
10.39.5.193上面日志顯示,在讀取pipeline上一個節點10.39.5.185的Packet時,一直讀取不到,直到10分鍾超時:
[plain] view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.193:50010 remote=/10.39.5.185:58225]  
那我們來看以下pipeline上第二個節點10.39.5.185,dn日志如下:
2014-12-04 16:00:57,988 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086, type=HAS_DOWNSTREAM_IN_PIPELINE
[plain] view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. java.io.EOFException: Premature EOF: no length prefix available  
  2.         at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987)  
  3.         at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)  
  4.         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1083)  
  5.         at java.lang.Thread.run(Thread.java:662)  
  6. 2014-12-04 16:00:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086  
  7. java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.185:50010 remote=/10.39.6.130:59083]  
  8.         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)  
  9.         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)  
  10.         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)  
  11.         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)  
  12.         at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)  
  13.         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)  
  14.         at java.io.DataInputStream.read(DataInputStream.java:132)  
  15.         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)  
  16.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)  
  17.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)  
  18.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)  
  19.         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)  
  20.         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)  
  21.         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)  
  22.         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)  
  23.         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)  
  24.         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)  
  25.         at java.lang.Thread.run(Thread.java:662)  
  26. 2014-12-04 16:00:58,009 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.  
和10.39.5.193日志類似,也是在等待讀取pipeline的第一個節點10.39.6.130的Packet時超時:
[plain] view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.185:50010 remote=/10.39.6.130:59083]  
這樣說來,問題出在10.39.6.130上,也即reduce任務運行的節點上,該節點DN日志如下:
[plain] view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. 2014-12-04 16:00:59,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1386326728-10.39.2.131-1382089338395:blk_1394149732_320469086  
  2. java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.6.130:50010 remote=/10.39.6.130:45259]  
  3.         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)  
  4.         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)  
  5.         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)  
  6.         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)  
  7.         at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)  
  8.         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)  
  9.         at java.io.DataInputStream.read(DataInputStream.java:132)  
  10.         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)  
  11.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)  
  12.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)  
  13.         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)  
  14.         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)  
  15.         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)  
  16.         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)  
  17.         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)  
  18.         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)  
  19.         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)  
  20.         at java.lang.Thread.run(Thread.java:662)  
但是根據日志信息,10.39.6.130的DN也是在等待Packet,但是一直等到超時也沒等到:

java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.6.130:50010 remote=/10.39.6.130:45259]

看來不是10.39.6.130節點DN的問題,如果pipeline上面三個DN都沒有問題,那問題肯定就出在dfsclient端了,也就是reduce任務在往hdfs寫數據的時候根本就沒有寫得出去,在dfsclient上面就給堵住了,接下來查看dfsclient,也就是reduce任務進程的執行情況:

在10.39.6.130上,根據任務id:attempt_1415948652989_195149_r_000158_3 找到進程id 31050,查看內存使用情況:

 

[plain]  view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. jstat -gcutil 31050 1000:  
  2.     S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT  
  3.  68.95   0.00   0.00  92.98  66.32    111   16.825    10   25.419   42.244  
  4.  68.95   0.00   0.00  92.98  66.32    111   16.825    10   25.419   42.244  
  5.  68.95   0.00   0.00  92.98  66.32    111   16.825    10   25.419   42.244  
  6.  68.95   0.00   0.00  92.98  66.32    111   16.825    10   25.419   42.244  
  7.  68.95   0.00   0.00  92.98  66.32    111   16.825    10   25.419   42.244  
  8.  68.95   0.00   0.00  92.98  66.32    111   16.825    10   25.419   42.244  
  9.  68.95   0.00  26.75 100.00  66.32    111   16.825    10   25.419   42.244  
  10.   0.00   0.00  31.85 100.00  68.16    111   16.825    10   44.767   61.591  
  11.   0.00   0.00  35.37 100.00  68.16    111   16.825    10   44.767   61.591  
  12.   0.00   0.00  40.64 100.00  68.16    111   16.825    10   44.767   61.591  
  13.   0.00   0.00  45.35 100.00  68.16    111   16.825    10   44.767   61.591  
  14.   0.00   0.00  48.87 100.00  68.16    111   16.825    10   44.767   61.591  
  15.   0.00   0.00  54.14 100.00  68.16    111   16.825    10   44.767   61.591  
  16.   0.00   0.00  58.85 100.00  68.16    111   16.825    10   44.767   61.591  

果然,JVM老年代被占滿,不斷執行FGC,直接stop the world,導致JVM沒法對外提供服務,導致dfsclient掛起,一直不能往pipeline上面的節點寫Packet,直到socket超時。

根據reduce任務最后的日志,也從側面驗證了這種觀點:

Error running child : java.lang.OutOfMemoryError: Java heap space

既然是OOM導致的job失敗,那是什么對象導致的內存泄露呢:

執行:

 

[plain]  view plain copy 在CODE上查看代碼片 派生到我的代碼片
 
  1. jmap -histo:live 31050 > jmap.log  
  2. cat jmap.log :  
  3. num     #instances         #bytes  class name  
  4. ----------------------------------------------  
  5.    1:      71824177     2872967080  java.util.TreeMap$Entry  
  6.    2:      71822939     1723750536  java.lang.Long  
  7.    3:         10684       24777776  [B  
  8.    4:         47174        6425152  <methodKlass>  
  9.    5:         47174        6408120  <constMethodKlass>  
  10.    6:          3712        4429776  <constantPoolKlass>  
  11.    7:         66100        3979224  <symbolKlass>  
  12.    8:          3712        2938192  <instanceKlassKlass>  
  13.    9:          3125        2562728  <constantPoolCacheKlass>  
  14.   10:          3477        1267752  [I  
  15.   11:         12923        1180224  [C  
  16.   12:          1794         772488  <methodDataKlass>  
  17.   13:         13379         428128  java.lang.String  
  18.   14:          4034         419536  java.lang.Class  
  19.   15:          6234         410312  [S  
  20.   16:          6409         352576  [[I  
  21.   17:          7567         242144  java.util.HashMap$Entry  
  22.   18:           293         171112  <objArrayKlassKlass>  
  23.   19:          4655         148960  java.util.Hashtable$Entry  
  24.   20:          1535         135080  java.lang.reflect.Method  
  25.   21:           842         121696  [Ljava.util.HashMap$Entry;  

果然啊,reduce代碼中使用了TreeMap,往里面放置了大量對象,導致出現OOM,TreeMap的Entry就站用了2.8G內存,而我們reduce設置的內存也就1.5G。

 

總結:對該job出現的異常,一般在以下幾種情況下發生:

1、寫數據塊的DN出現問題,不能寫入,就像之前出現的DN由於本地讀問題導致xceivers(每個DN用於並發數據傳輸處理最大線程數)達到4096,耗盡了所有的線程,沒法對新發起的輸出寫入做出相應。

2、網絡出現異常,DN節點進或出的帶寬被耗盡,導致數據寫不出去或者寫不進來,這種情況可以觀察ganglia看節點帶寬使用情況,這種情況一般比較少。當該job出現問題的時候,也懷疑過是帶寬被耗盡的問題,查看了一下相關節點ganglia帶寬使用情況,最大in/out 85M/s左右,最后排除是帶寬問題。

3、dfsclient出現問題,長時間沒反應,導致已經發起的socket超時。由於dfsclient情況比較復雜,出現問題的情況比較多,比如本問就是因為reduce出現內存溢出,jvm不斷進行FGC,導致dfsclient掛起,最終socket出現超時。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM