记录重新覆盖安装cdh6


哎!

前两天把master1的/var/lib下所有文件夹都rm -rf了,真是恨啊,自己手怎么这么块

反正呢,cdh集群主要节点都在master1上,所以出问题了,想跑路的心都有了。


 

先说出现的问题:

1.cloudera manager service 异常,没有任何监测数据了,应该是有配置和数据保存在/var/lib下

2.其次kafka报红

3.还有隐藏的问题,比如hadoop集群看似还在正常跑,但实际是要hadoop进程断掉,就再也跑不起来,因为配置文件都在/var/lib


 

首先这里给自己提个醒:

1.重要:确认集群提供方有没有做每日自动快照的动作,没有的话,自己要想办法做这个工作,保证你自己有条退路,据说阿里云是有每日备份的

2.如果第一点没有做,集群还在正常跑,那么马上把重要资料下载被分,包括hdfs上的,或者Hbase的等等

3.假如要做恢复动作的话,第一要做的就是防止数据写入:我本来想用extundelete恢复,但是/var/lib是在系统盘下的,根本恢复不了,系统盘没办法unmount


 

开始着手重装:(大概先写,后续补全)

1.重装mysql,因为Mysql的data目录在/var/lib下

2.重装cloudera-scm-deamons,cloudera-scm-server,cloudera-scm-agent

3.启动cm,安装cloudera management service,查看节点状态

4.安装zookeepr等等大数据服务


 

遇到问题:

1.不良 : 该主机已与 Cloudera Manager Server 建立联系。 该主机未与 Host Monitor 建立联系

我的做法是:删除/var/lib/cloudera-scm-agent目录下的所有文件并且清空主节点CM数据库-清空数据库就是把mysql与cm有关的都删掉,再创建一次,等于说重装cm。

2.重装前的cm的/data/dfs/dn/current/VERSION已经不适用了,会导致所有datanode无法启动

做法就是将master1的/data/dfs/nn2/current/VERSION里的clusterID复制到/data/dfs/dn/current/VERSION,再重启datanode即可

3.还有个问题:待解决

存在隐患 : 3 racks are required for the erasure coding policies: RS-3-2-1024k. The number of racks is only 1.

 

 

4.怎么用新安装的hdfs读取之前的数据?新的hdfs目录只有/temp

不完整的解决方法:将原来保存元素据的目录,比如我的是/data/dfs/nn和snn下的元素据包括多个edits_xxx_xxx和fsimage_xxx和fsimage_xxx.md5复制到新建的hdfs服务的同样目录下(这里我在重新添加hdfs服务时,会让指定目录,这里namenode需要另外指定一个目录,不要把原来的目录覆盖了,yarn的安装没关系。),然后重启hdfs,就能读取到原来hdfs的数据,但是cm会报警告,如下:

 

5.安装yarn出现问题:

Caused by: java.net.BindException: Problem binding to [jzmaster1:8031] java.net.BindException: Address already in use; For more details see:  htt     p://wiki.apache.org/hadoop/BindException

2020-09-11 11:16:47,126 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerServ     ice failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [xxxxx     :8031] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
  77 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [xxxxx:8031] java.net.BindException: Addre     ss already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
  78     at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138)
  79     at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
  80     at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
  81     at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:232)
  82     at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
  83     at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
  84     at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:786)
  85     at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
  86     at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1159)
  87     at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1199)
  88     at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1195)
  89     at java.security.AccessController.doPrivileged(Native Method)
  90     at javax.security.auth.Subject.doAs(Subject.java:422)
  91     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)                                                      
  92     at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1195)
  93     at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1235)
  94     at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
  95     at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1437)
  96 Caused by: java.net.BindException: Problem binding to [jzmaster1:8031] java.net.BindException: Address already in use; For more details see:  htt     p://wiki.apache.org/hadoop/BindException
  97     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  98     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  99     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 100     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
View Code

解决:首先查看谁占用了端口号 : netstat -anp | grep 8031,然后查看进程的具体信息 : ps -ef | grep xxxxx,最有Kill 掉,再启动rm即可

 


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM