獲取spark-submit --files的文件內容

本文轉載自查看原文 2018-11-15 12:41 1453 Spark

參考https://community.hortonworks.com/questions/9265/how-can-i-add-configuration-files-to-a-spark-job-r.html

If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder: 
hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508

application_1449220589084_0508 is an example of yarn application ID!

1. find the spark staging directory by below code: (but you need to have the hdfs uri and your username)

System.getenv("SPARK_YARN_STAGING_DIR"); --> .sparkStaging/application_1449220589084_0508

2. find the complete comma separated file paths by using:

System.getenv("SPARK_YARN_CACHE_FILES"); -->
hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,
hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,
hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt

我的總結（以--files README.md為例）：
方法1：按照上面所說，--files會把文件上傳到hdfs的.sparkStagin/applicationId目錄下，使用上面說的方法先獲取到hdfs對應的這個目錄，然后訪問hdfs的這個文件。
spark.read().textFile(System.getenv("SPARK_YARN_STAGING_DIR") + "/README.md")解決。textFile不指定hdfs、file或者其他前綴的話默認是hdfs://yourcluster/user/your_username下的相對路徑。不知道是不是我使用的集群是這樣設置的。

方法2：
SparkFiles.get(filePath)，我獲取的結果是：/hadoop/yarn/local/usercache/research/appcache/application_1504461219213_9796/spark-c39002ee-01a4-435f-8682-2ba5950de230/userFiles-e82a7f84-51b1-441a-a5e3-78bf3f4a8828/README.md，不知道為什么，無論本地還是hdfs都沒有找到該文件。看了一下，本地是有/hadoop/yarn/local/usercache/research/...目錄下的確有README.md。worker和driver的本地README.md路徑不一樣。
原因：
https://stackoverflow.com/questions/35865320/apache-spark-filenotfoundexception
https://stackoverflow.com/questions/41677897/how-to-get-path-to-the-uploaded-file
SparkFiles.get()獲取的目錄是driver node下的本地目錄，所以sc.textFile無法在worker節點訪問該目錄文件。不能這么用。

"""I think that the main issue is that you are trying to read the file via the textFile method. 
What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. 
When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node."""

關於--files和addfile，可以看下這個問題：https://stackoverflow.com/questions/38879478/sparkcontext-addfile-vs-spark-submit-files

cluster模式下本地文件使用addFile是找不到文件的，因為只有本地有，所以必須使用--files上傳。

結論：不要使用textFile讀取--files或者addFile傳來的文件。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 spark-submit 參數總結 spark-submit之使用pyspark spark提交應用的方法(spark-submit) 后台運行spark-submit命令的方法聊聊spark-submit的幾個有用選項 spark-submit提交方式測試Demo Spark源碼分析之Spark-submit和Spark-class Spark提交應用程序之Spark-Submit分析 spark-submit 提交任務及參數說明 spark-submit提交python腳本過程記錄