Spark On Yarn的各種Bug


今天將代碼以Spark On Yarn Cluster的方式提交,遇到了很多很多問題.特地記錄一下.

代碼通過--master yarn-client提交是沒有問題的,但是通過--master yarn-cluster總是報錯,而且是各種各樣的錯誤.

1.ClassCastException

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
	at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
	at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1405)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2284)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
...

這個bug通常會提示我們是否將Jar包部署到所有的slave上了,但是yarn-cluster一般會通過RPC框架分發Jar包,即使將Jar包一一部署到slave機器中,並沒有任何效果,仍然報這個錯誤.

開始通過google,stackoverflow查找相關信息.產生這種問題的原因可謂錯綜復雜,有的說類加載器的問題,有的說UDF的問題.其中有一個引起了我的注意:

如果在代碼中引用了Java代碼,最好將代碼打成的Jar放在$SPARK_HOME/jars目錄下,確保jar包是在classpath下.

按照這個解答的方式安排了一下jar包,然后重新執行.通過yarn的web頁面觀察運行日志,沒有這個報錯了.但是任務失敗了,報了另一個錯誤:

2.FileNotFoundException

java.io.FileNotFoundException: File does not exist: hdfs://master:9000/xxx/xxxx/xxxx/application_1495996836198_0003/__spark_libs__1200479165381142167.zip
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
...

這個錯誤就讓我很熟悉了,我在代碼創建sparkSession的時候設置了master,master地址是spark masterurl,所以當在yarn上提交任務的時候,最終會按照代碼中的配置開始standalone模式,這會造成混亂,所以會產生一些莫名其妙的bug.

修改一下代碼重新打包就好了

解決辦法:

val spark = SparkSession.builder()
//    .master("spark://master:7077")  //注釋掉master的設置
	.appName("xxxxxxx")
    .getOrCreate();

中間還遇到了其他很多bug,比如無法反序列化

SerializerInstance.deserialize(JavaSerializer.scala:114)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

再或者這種類型轉換錯誤

org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: scala.Tuple2 cannot be cast to com.xxx.xxxxx.ResultMerge

這些報錯通過注釋掉master的設置后都會消失.

各種異常交錯出現,這是很容易讓人迷惑的.

幸好最后報了一個熟悉的錯誤java.io.FileNotFoundException,問題才得以解決.

3.HDFS的bug

報錯如下:

java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1729427003-192.168.1.219-1527744820505:blk_1073742492_1669; getBlockSize()=24; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.1.219:50010,DS-e478076c-c3aa-4870-adce-7ffd6a49efe4,DISK], DatanodeInfoWithStorage[192.168.1.21:50010,DS-af806575-7404-45fd-bae0-0fcc59de7598,DISK]]}

這是因為在操作一個正在寫入的hdfs文件,通常可能出現在flume寫入的文件未正常關閉,或者hdfs重啟導致的文件問題.

可以通過命令查看一下哪些文件是OPENFORWRITTING或者MISSING:

hadoop fsck / -openforwrite | egrep "MISSING|OPENFORWRITE"

通過上面的命令可以確定具體文件,然后將其刪除即可.


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM