問題背景
春節假期間,接連收到監控程序發出的數據異常問題,趕忙連接上跳板機檢查各服務間的狀態,發現Datanode在第二台、第三台從節點都掉線了,通過查看Datanode和Namenode運行日志,發現了問題所在,記錄下這次驚心的處理過程,供參考。
問題描述
Namonode主節點運行時報出內存溢出的問題,截取運行日志如下:

java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Long.valueOf(Long.java:577) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$StorageBlockReportProto.<init>(DatanodeProtocolProtos.java:17327) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$StorageBlockReportProto.<init>(DatanodeProtocolProtos.java:17250) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$StorageBlockReportProto$1.parsePartialFrom(DatanodeProtocolProtos.java:17381) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$StorageBlockReportProto$1.parsePartialFrom(DatanodeProtocolProtos.java:17376) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
Datanode數據節點運行時報出Socket連接主節點Namenode超時異常,

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-029006-xxx WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.net.SocketTimeoutException: Call From xxx/xxx to xxx:xxx failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx:xxx remote=xxx/xxx]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751) at org.apache.hadoop.ipc.Client.call(Client.java:1480) at org.apache.hadoop.ipc.Client.call(Client.java:1407) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy13.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:153) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:553) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:653) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:823) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx:xxx remote=xxx/xxx]
解決方案
修改Hadoop集群服務中各服務組件的內存配置,更新hadoop-env.sh文件
其中hadoop-env.sh文件所在位置:
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
Hadoop為各個守護進程(namenode、secondaryNamenode、jobtracker、datanode、tasktracker)統一分配的內存在hadoop-env.sh中設置,參數為HADOOP_HEAPSIZE,默認大小為1000MB。
大部分情況下,這個統一設置的值可能並不適合。例如對於NameNode節點,1000M的內存只能存儲幾百萬個文件的數據塊的引用。如果我想單獨設置NameNode的內存,可以通HADOOP_NAMENODE_OPTS來設置。同樣的,可以通過HADOOP_SECONDARYNAMENODE_OPTS來設置SecondaryNamenode的內存,使得它與NameNode保持一致。當然,還有HADOOP_DATANODE_OPTS、HADOOP_BALANCER_OPTS、HADOOP_JOBTRACKER_OPTS變量供你使用。
針對上面提到的問題,我們需要提高NameNode和SecondaryNamenode的內存,即修改HADOOP_NAMENODE_OPTS參數,添加配置 -Xmx2048m ,可設置為2048MB,供參考。同樣通過設置HADOOP_SECONDARYNAMENODE_OPTS參數來提高SecondaryNamenode的使用內存,添加參數配置, -Xmx2048m ,也可以設置為2048MB,供參考。根據實際的數據量來調整,數據量越大可適當調高,另需注意服務器的實際內存大小。

# Command specific options appended to HADOOP_OPTS when specified export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} -Xmx2048m $HADOOP_NAMENODE_OPTS" export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS -Xmx2048m $HADOOP_DATANODE_OPTS" export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} -Xmx2048m $HADOOP_SECONDARYNAMENODE_OPTS"