spark-env.sh增加HADOOP_CONF_DIR使得spark讀寫的是hdfs文件
剛裝了spark,運行wordcount程序,local方式,執行的spark-submit,讀和寫的文件都是宿主機,而不是hdfs。測試命令修改了spark-env.sh導致spark-submit命令執行的時候讀和寫的都是hdfs文件。
yarn執行spark shell
spark-shell --master yarn-client
第一個報錯
Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn-client' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:290)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:251)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:120)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
修改spark-env.sh
/usr/local/spark-2.4.0-bin-hadoop2.7/conf/spark-env.sh
添加export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
重啟hadoop集群、spark集群
報了第二錯誤
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://master:9000/user/root/README.MD
問題解決與分析
1原來集群中執行程序其實將當前目錄的README.MD計算並輸出當前主機的目錄。修改了spark-env之后,重啟集群,spark就讀取的是hdfs中的文件。
2.將本地README.MD通過hdfs命令上傳到/user/root/README.MD。執行spark-submit命令成功
3.嘗試把spark-env.sh修改回退,重啟集群,執行spark-submit就沒有報錯。