環境描述:redhat7.3 CDH5.15.1 采用parcels方式部署
報錯描述:airflow調度程序,最近2周偶爾報錯,報錯類型有2類:1、無法初始化集群配置;2、讀取配置權限問題
報錯一:
Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:143) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:108) at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:101) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:477) at org.apache.hadoop.mapred.JobClient.(JobClient.java:455)
報錯二
19/12/21 04:06:27 ERROR conf.Configuration: error parsing conf core-site.xml java.io.FileNotFoundException: /etc/hive/conf.cloudera.hive/core-site.xml (Permission denied) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at java.io.FileInputStream.(FileInputStream.java:93) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
報錯三
setfacl: Permission denied. user=dip is not the owner of inode=.hive-staging_hive_2019-12-22_07-44-25_997_2557429548076828737-1 java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hive/conf.cloudera.hive/hive-site.xml (Permission denied) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2811) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2663) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2559) at org.apache.hadoop.conf.Configuration.get(Configuration.java:1340) at org.apache.hadoop.hive.conf.HiveConf.getVar(HiveConf.java:2756) at org.apache.hadoop.hive.conf.HiveConf.getVar(HiveConf.java:2777) at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:2849) at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:2792)
檢查集群中的所有機器,發現有一台機器配置文件一直在瘋狂更新:
/etc/hive/conf.cloudera.hive
/etc/hadoop/conf.cloudera.yarn
/etc/hbase/conf.cloudera.hbase
程序報錯原因
配置文件每分鍾更新一次,而程序在運行中會讀取配置文件,也許在某次讀取中,他正處於跟新狀態,不可讀。
配置文件瘋狂更新原因:
檢查:
該機器: /var/lib/alternatives 下面有空文件
操作:(刪除下面的文件,重啟agent后,agent會把需要的文件重新拷貝過來)
刪除/var/lib/alternatives 下面所有文件,
重啟agent
cd /var/lib/alternatives rm -rf * systemctl restart cloudera-scm-agent.service
觀察 : 配置文件不再瘋狂更新,問題解決。