哎!
前兩天把master1的/var/lib下所有文件夾都rm -rf了,真是恨啊,自己手怎么這么塊
反正呢,cdh集群主要節點都在master1上,所以出問題了,想跑路的心都有了。
先說出現的問題:
1.cloudera manager service 異常,沒有任何監測數據了,應該是有配置和數據保存在/var/lib下
2.其次kafka報紅
3.還有隱藏的問題,比如hadoop集群看似還在正常跑,但實際是要hadoop進程斷掉,就再也跑不起來,因為配置文件都在/var/lib
首先這里給自己提個醒:
1.重要:確認集群提供方有沒有做每日自動快照的動作,沒有的話,自己要想辦法做這個工作,保證你自己有條退路,據說阿里雲是有每日備份的
2.如果第一點沒有做,集群還在正常跑,那么馬上把重要資料下載被分,包括hdfs上的,或者Hbase的等等
3.假如要做恢復動作的話,第一要做的就是防止數據寫入:我本來想用extundelete恢復,但是/var/lib是在系統盤下的,根本恢復不了,系統盤沒辦法unmount
開始着手重裝:(大概先寫,后續補全)
1.重裝mysql,因為Mysql的data目錄在/var/lib下
2.重裝cloudera-scm-deamons,cloudera-scm-server,cloudera-scm-agent
3.啟動cm,安裝cloudera management service,查看節點狀態
4.安裝zookeepr等等大數據服務
遇到問題:
1.不良 : 該主機已與 Cloudera Manager Server 建立聯系。 該主機未與 Host Monitor 建立聯系
我的做法是:刪除/var/lib/cloudera-scm-agent目錄下的所有文件並且清空主節點CM數據庫-清空數據庫就是把mysql與cm有關的都刪掉,再創建一次,等於說重裝cm。
2.重裝前的cm的/data/dfs/dn/current/VERSION已經不適用了,會導致所有datanode無法啟動
做法就是將master1的/data/dfs/nn2/current/VERSION里的clusterID復制到/data/dfs/dn/current/VERSION,再重啟datanode即可
3.還有個問題:待解決
存在隱患 : 3 racks are required for the erasure coding policies: RS-3-2-1024k. The number of racks is only 1.

4.怎么用新安裝的hdfs讀取之前的數據?新的hdfs目錄只有/temp
不完整的解決方法:將原來保存元素據的目錄,比如我的是/data/dfs/nn和snn下的元素據包括多個edits_xxx_xxx和fsimage_xxx和fsimage_xxx.md5復制到新建的hdfs服務的同樣目錄下(這里我在重新添加hdfs服務時,會讓指定目錄,這里namenode需要另外指定一個目錄,不要把原來的目錄覆蓋了,yarn的安裝沒關系。),然后重啟hdfs,就能讀取到原來hdfs的數據,但是cm會報警告,如下:
5.安裝yarn出現問題:
Caused by: java.net.BindException: Problem binding to [jzmaster1:8031] java.net.BindException: Address already in use; For more details see: htt p://wiki.apache.org/hadoop/BindException
2020-09-11 11:16:47,126 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerServ ice failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [xxxxx :8031] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
77 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [xxxxx:8031] java.net.BindException: Addre ss already in use; For more details see: http://wiki.apache.org/hadoop/BindException
78 at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138)
79 at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
80 at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
81 at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:232)
82 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
83 at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
84 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:786)
85 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
86 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1159)
87 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1199)
88 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1195)
89 at java.security.AccessController.doPrivileged(Native Method)
90 at javax.security.auth.Subject.doAs(Subject.java:422)
91 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
92 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1195)
93 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1235)
94 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
95 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1437)
96 Caused by: java.net.BindException: Problem binding to [jzmaster1:8031] java.net.BindException: Address already in use; For more details see: htt p://wiki.apache.org/hadoop/BindException
97 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
98 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
99 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
100 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
解決:首先查看誰占用了端口號 : netstat -anp | grep 8031,然后查看進程的具體信息 : ps -ef | grep xxxxx,最有Kill 掉,再啟動rm即可
