在執行mapreduce時,map成功后,reduce一直hang在17%。現象如下:
[tianyc@TkHbase hadoop]$ hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -mapper /home/tianyc/study/mapred/python/mapper.py -reducer /home/tianyc/study/mapred/python/reduce.py -input 111/* -output 111-output2
packageJobJar: [/tmp/hadoop-tianyc/hadoop-unjar5068413447400834397/] [] /tmp/streamjob7965021791749826156.jar tmpDir=null
13/02/20 15:45:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/02/20 15:45:07 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/20 15:45:07 INFO mapred.FileInputFormat: Total input paths to process : 16
13/02/20 15:45:07 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-tianyc/mapred/local]
13/02/20 15:45:07 INFO streaming.StreamJob: Running job: job_201302201459_0006
13/02/20 15:45:07 INFO streaming.StreamJob: To kill this job, run:
13/02/20 15:45:07 INFO streaming.StreamJob: /home/tianyc/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=http://TeletekHbase:9001 -kill job_201302201459_0006
13/02/20 15:45:07 INFO streaming.StreamJob: Tracking URL: http://TkHbase:50030/jobdetails.jsp?jobid=job_201302201459_0006
13/02/20 15:45:08 INFO streaming.StreamJob: map 0% reduce 0%
13/02/20 15:45:21 INFO streaming.StreamJob: map 13% reduce 0%
13/02/20 15:45:22 INFO streaming.StreamJob: map 25% reduce 0%
13/02/20 15:45:28 INFO streaming.StreamJob: map 38% reduce 0%
13/02/20 15:45:30 INFO streaming.StreamJob: map 50% reduce 0%
13/02/20 15:45:34 INFO streaming.StreamJob: map 63% reduce 0%
13/02/20 15:45:36 INFO streaming.StreamJob: map 75% reduce 4%
13/02/20 15:45:39 INFO streaming.StreamJob: map 75% reduce 8%
13/02/20 15:45:40 INFO streaming.StreamJob: map 88% reduce 8%
13/02/20 15:45:42 INFO streaming.StreamJob: map 100% reduce 8%
13/02/20 15:45:48 INFO streaming.StreamJob: map 100% reduce 17%
我輾轉於baidu和google,嘗試了各種方法:
1. 有一台slave主機的hostname帶下划線,已經解決。
2. 將/etc/hosts中的主機名與/etc/sysconfig/network中的HOSTNAME一致
3. 將/etc/hosts中127.0.0.1對應的記錄刪除
4. 關閉防火牆
。。。
在這個帖子中介紹到:這個問題是發生在reduce階段,而提示的消息應該是取不到map階段的結果,既然在Failed fetch notification #1 for task attempt_201110022127_0003_m_000000_0中有取不到的任務分塊的名字,說明namenode正常工作,namenode通知reduce節點進行reduce操作,而它卻取不到,只能說明它沒法和那些節點通信,又由於我在配置hadoop的時候用的是主機的名字,不是ip,所以想到解決辦法應該是把各個datanode節點的映射互相加到/etc/hosts中。試了一下,果然正確。所以在此記錄。
有一點啟發,但hosts我已經設置好了。
轉而查看主節點job日志和從節點task日志(各種嘗試前我竟然沒有分析日志,汗):
job日志提示:
2013-02-20 15:52:57,956 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for map task: attempt_201302201459_0006_m_000002_0 running on tracker: tracker_TkTest:127.0.0.1/127.0.0.1:50861 and reduce task: attempt_201302201459_0006_r_000000_0 running on tracker: tracker_TkHbase2:127.0.0.1/127.0.0.1:55837
一個從節點task日志一直提示:
2013-02-20 16:21:37,304 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302201459_0006_r_000000_0 0.16666667% reduce > copy (8 of 16 at 0.00 MB/s) >
2013-02-20 16:21:40,346 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302201459_0006_r_000000_0 0.16666667% reduce > copy (8 of 16 at 0.00 MB/s) >
2013-02-20 16:21:46,378 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302201459_0006_r_000000_0 0.16666667% reduce > copy (8 of 16 at 0.00 MB/s) >
……
最終在這個帖子的最下面,提示應該打開50060的防火牆端口,嘗試了一下,才好使。其實我看到的另一個帖子里也介紹了,只是當時我沒留意。
出現這個問題,主要是在hosts與hostname不統一,或防火牆上。
看到這里,問題來了:當初我測試了關閉防火牆,為什么仍然出錯呢?還是基本功不扎實,臨時關閉防火牆應該使用service iptables stop,我用的卻是chkconfig iptables off。參考這里。