一、錯誤起因
Active NameNode日志出現異常IPC‘s epoch [X] is less than the last promised epoch [X+1],出現短期的雙Active
我配置的ha自動切換,但是發現STandByNameNode是active,我強制手動切換了三次,STandByNameNode就無法訪問了,估計是這個問題。
二.內部原因
【HDFS機制】:該問題屬於hdfs對於腦列的異常保護,屬於正常行為,不影響業務。
1)ZKFC1對NameNode1(Active)進行健康檢查,因為長時間監控不到NN1的回復,認為該NameNode1不健康,主動釋 放zk中的ActiveStandbyElectorLock,此時NN1還是active(因為zkfc與NameNode1連接異常,不能將其 shutdown)。
zkfc log:
2014-06-16 02:11:02,720 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at namenode01/172.21.248.14:9005: Call From namenode01/1 72.21.248.14 to namenode02:9005 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[co nnected local=/172.21.248.14:47271 remote=namenode01/172.21.248.14:9005]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2014-06-16 02:12:12,825 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at namenode02/172.21.248.13:9005 standby (unable to connect) java.net.SocketTimeoutException: Call From namenode01/172.21.248.14 to namenode02:9005 failed on socket timeout exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.248.14:59156 remote=namenode02/172.21.248.13:9005]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
2)ZKFC2在zk中競爭到ActiveStandbyElectorLock,將NameNode2(原來的Standby)變成Active,同時會更新JN中的epoch使其+1。
3)NameNode1(原先的Active)再次去操作JournalNode的editlog時發現自己的epoch比JN的epoch小1,促使自己重啟,成為Standby NameNode。
NN1 log:
2014-08-26 12:20:59,017 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.1.1.107:8485, 192.10.1.208:8485, 192.10.1.209:8485], stream=QuorumOutputStream starting at txid 22795230)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown: 192.10.1.208:8485: IPC‘s epoch 115 is less than the last promised epoch 116
三.解決方案
可以在core-site.xml文件中修改ha.health-monitor.rpc-timeout.ms參數值,來擴大zkfc監控檢查超時時間。
<property> <name>ha.health-monitor.rpc-timeout.ms</name> <value>180000</value> </property>