今天將代碼以Spark On Yarn Cluster
的方式提交,遇到了很多很多問題.特地記錄一下.
代碼通過--master yarn-client
提交是沒有問題的,但是通過--master yarn-cluster
總是報錯,而且是各種各樣的錯誤.
1.ClassCastException
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1405)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2284)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
...
這個bug通常會提示我們是否將Jar
包部署到所有的slave
上了,但是yarn-cluster一般會通過RPC
框架分發Jar包,即使將Jar
包一一部署到slave機器中,並沒有任何效果,仍然報這個錯誤.
開始通過google
,stackoverflow
查找相關信息.產生這種問題的原因可謂錯綜復雜,有的說類加載器的問題,有的說UDF的問題.其中有一個引起了我的注意:
如果在代碼中引用了
Java
代碼,最好將代碼打成的Jar
放在$SPARK_HOME/jars
目錄下,確保jar包是在classpath
下.
按照這個解答的方式安排了一下jar
包,然后重新執行.通過yarn的web頁面觀察運行日志,沒有這個報錯了.但是任務失敗了,報了另一個錯誤:
2.FileNotFoundException
java.io.FileNotFoundException: File does not exist: hdfs://master:9000/xxx/xxxx/xxxx/application_1495996836198_0003/__spark_libs__1200479165381142167.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
...
這個錯誤就讓我很熟悉了,我在代碼創建sparkSession
的時候設置了master
,master
地址是spark master
的url
,所以當在yarn上提交任務的時候,最終會按照代碼中的配置開始standalone
模式,這會造成混亂,所以會產生一些莫名其妙的bug.
修改一下代碼重新打包就好了
解決辦法:
val spark = SparkSession.builder()
// .master("spark://master:7077") //注釋掉master的設置
.appName("xxxxxxx")
.getOrCreate();
中間還遇到了其他很多bug,比如無法反序列化
SerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
再或者這種類型轉換錯誤
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: scala.Tuple2 cannot be cast to com.xxx.xxxxx.ResultMerge
這些報錯通過注釋掉master
的設置后都會消失.
各種異常交錯出現,這是很容易讓人迷惑的.
幸好最后報了一個熟悉的錯誤java.io.FileNotFoundException
,問題才得以解決.
3.HDFS的bug
報錯如下:
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1729427003-192.168.1.219-1527744820505:blk_1073742492_1669; getBlockSize()=24; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.1.219:50010,DS-e478076c-c3aa-4870-adce-7ffd6a49efe4,DISK], DatanodeInfoWithStorage[192.168.1.21:50010,DS-af806575-7404-45fd-bae0-0fcc59de7598,DISK]]}
這是因為在操作一個正在寫入的hdfs
文件,通常可能出現在flume寫入的文件未正常關閉,或者hdfs重啟導致的文件問題.
可以通過命令查看一下哪些文件是OPENFORWRITTING
或者MISSING
:
hadoop fsck / -openforwrite | egrep "MISSING|OPENFORWRITE"
通過上面的命令可以確定具體文件,然后將其刪除即可.