If you add your external files using "spark-submit --files" your files will be uploaded to this HDFS folder:
hdfs://your-cluster/user/your-user/.sparkStaging/application_1449220589084_0508 application_1449220589084_0508 is an example of yarn application ID! 1. find the spark staging directory by below code: (but you need to have the hdfs uri and your username) System.getenv("SPARK_YARN_STAGING_DIR"); --> .sparkStaging/application_1449220589084_0508 2. find the complete comma separated file paths by using: System.getenv("SPARK_YARN_CACHE_FILES"); -->
hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar#__spark__.jar,
hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/your-spark-job.jar#__app__.jar,
hdfs://yourcluster/user/hdfs/.sparkStaging/application_1449220589084_0508/test_file.txt#test_file.txt
我的總結(以--files README.md為例):
方法1:按照上面所說,--files會把文件上傳到hdfs的.sparkStagin/applicationId目錄下,使用上面說的方法先獲取到hdfs對應的這個目錄,然后訪問hdfs的這個文件。
spark.read().textFile(System.getenv("SPARK_YARN_STAGING_DIR") + "/README.md")解決。textFile不指定hdfs、file或者其他前綴的話默認是hdfs://yourcluster/user/your_username下的相對路徑。不知道是不是我使用的集群是這樣設置的。
方法2:
SparkFiles.get(filePath),我獲取的結果是:/hadoop/yarn/local/usercache/research/appcache/application_1504461219213_9796/spark-c39002ee-01a4-435f-8682-2ba5950de230/userFiles-e82a7f84-51b1-441a-a5e3-78bf3f4a8828/README.md,不知道為什么,無論本地還是hdfs都沒有找到該文件。看了一下,本地是有/hadoop/yarn/local/usercache/research/...目錄下的確有README.md。worker和driver的本地README.md路徑不一樣。
原因:
https://stackoverflow.com/questions/35865320/apache-spark-filenotfoundexception
https://stackoverflow.com/questions/41677897/how-to-get-path-to-the-uploaded-file
SparkFiles.get()獲取的目錄是driver node下的本地目錄,所以sc.textFile無法在worker節點訪問該目錄文件。不能這么用。
"""I think that the main issue is that you are trying to read the file via the textFile method.
What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed.
When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node."""
關於--files和addfile,可以看下這個問題:https://stackoverflow.com/questions/38879478/sparkcontext-addfile-vs-spark-submit-files
cluster模式下本地文件使用addFile是找不到文件的,因為只有本地有,所以必須使用--files上傳。
結論:不要使用textFile讀取--files或者addFile傳來的文件。