哎!
前两天把master1的/var/lib下所有文件夹都rm -rf了,真是恨啊,自己手怎么这么块
反正呢,cdh集群主要节点都在master1上,所以出问题了,想跑路的心都有了。
先说出现的问题:
1.cloudera manager service 异常,没有任何监测数据了,应该是有配置和数据保存在/var/lib下
2.其次kafka报红
3.还有隐藏的问题,比如hadoop集群看似还在正常跑,但实际是要hadoop进程断掉,就再也跑不起来,因为配置文件都在/var/lib
首先这里给自己提个醒:
1.重要:确认集群提供方有没有做每日自动快照的动作,没有的话,自己要想办法做这个工作,保证你自己有条退路,据说阿里云是有每日备份的
2.如果第一点没有做,集群还在正常跑,那么马上把重要资料下载被分,包括hdfs上的,或者Hbase的等等
3.假如要做恢复动作的话,第一要做的就是防止数据写入:我本来想用extundelete恢复,但是/var/lib是在系统盘下的,根本恢复不了,系统盘没办法unmount
开始着手重装:(大概先写,后续补全)
1.重装mysql,因为Mysql的data目录在/var/lib下
2.重装cloudera-scm-deamons,cloudera-scm-server,cloudera-scm-agent
3.启动cm,安装cloudera management service,查看节点状态
4.安装zookeepr等等大数据服务
遇到问题:
1.不良 : 该主机已与 Cloudera Manager Server 建立联系。 该主机未与 Host Monitor 建立联系
我的做法是:删除/var/lib/cloudera-scm-agent目录下的所有文件并且清空主节点CM数据库-清空数据库就是把mysql与cm有关的都删掉,再创建一次,等于说重装cm。
2.重装前的cm的/data/dfs/dn/current/VERSION已经不适用了,会导致所有datanode无法启动
做法就是将master1的/data/dfs/nn2/current/VERSION里的clusterID复制到/data/dfs/dn/current/VERSION,再重启datanode即可
3.还有个问题:待解决
存在隐患 : 3 racks are required for the erasure coding policies: RS-3-2-1024k. The number of racks is only 1.
4.怎么用新安装的hdfs读取之前的数据?新的hdfs目录只有/temp
不完整的解决方法:将原来保存元素据的目录,比如我的是/data/dfs/nn和snn下的元素据包括多个edits_xxx_xxx和fsimage_xxx和fsimage_xxx.md5复制到新建的hdfs服务的同样目录下(这里我在重新添加hdfs服务时,会让指定目录,这里namenode需要另外指定一个目录,不要把原来的目录覆盖了,yarn的安装没关系。),然后重启hdfs,就能读取到原来hdfs的数据,但是cm会报警告,如下:
5.安装yarn出现问题:
Caused by: java.net.BindException: Problem binding to [jzmaster1:8031] java.net.BindException: Address already in use; For more details see: htt p://wiki.apache.org/hadoop/BindException

2020-09-11 11:16:47,126 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerServ ice failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [xxxxx :8031] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
77 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [xxxxx:8031] java.net.BindException: Addre ss already in use; For more details see: http://wiki.apache.org/hadoop/BindException
78 at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138)
79 at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
80 at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
81 at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:232)
82 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
83 at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
84 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:786)
85 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
86 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1159)
87 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1199)
88 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1195)
89 at java.security.AccessController.doPrivileged(Native Method)
90 at javax.security.auth.Subject.doAs(Subject.java:422)
91 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
92 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1195)
93 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1235)
94 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
95 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1437)
96 Caused by: java.net.BindException: Problem binding to [jzmaster1:8031] java.net.BindException: Address already in use; For more details see: htt p://wiki.apache.org/hadoop/BindException
97 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
98 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
99 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
100 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
解决:首先查看谁占用了端口号 : netstat -anp | grep 8031,然后查看进程的具体信息 : ps -ef | grep xxxxx,最有Kill 掉,再启动rm即可