spark錯誤記錄總結


1、執行spark-submit時出錯

執行任務如下:

# ./spark-submit --class org.apache.spark.examples.SparkPi  /hadoop/spark/examples/jars/spark-examples_2.11-2.4.0.jar 100  

報錯如下:

2019-02-22 09:56:26 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/1 is now RUNNING
2019-02-22 09:56:26 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, kvm-test, 36768, None)
2019-02-22 09:56:26 INFO  BlockManagerMasterEndpoint:54 - Registering block manager kvm-test:36768 with 366.3 MB RAM, BlockManagerId(driver, kvm-test, 36768, None)
2019-02-22 09:56:26 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, kvm-test, 36768, None)
2019-02-22 09:56:26 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, kvm-test, 36768, None)
2019-02-22 09:56:26 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5aae8eb5{/metrics/json,null,AVAILABLE,@Spark}
2019-02-22 09:56:27 INFO  EventLoggingListener:54 - Logging events to hdfs://hadoop-cluster/spark/eventLog/app-20190222015626-0020.snappy
2019-02-22 09:56:27 INFO  StandaloneSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2019-02-22 09:56:28 INFO  SparkContext:54 - Starting job: reduce at SparkPi.scala:38
2019-02-22 09:56:28 INFO  DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 100 output partitions
2019-02-22 09:56:28 INFO  DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
2019-02-22 09:56:28 INFO  DAGScheduler:54 - Parents of final stage: List()
2019-02-22 09:56:28 INFO  DAGScheduler:54 - Missing parents: List()
2019-02-22 09:56:28 INFO  DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
2019-02-22 09:56:28 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/1 is now EXITED (Command exited with code 1)
2019-02-22 09:56:28 INFO  StandaloneSchedulerBackend:54 - Executor app-20190222015626-0020/1 removed: Command exited with code 1
2019-02-22 09:56:28 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20190222015626-0020/2 on worker-20190111083714-172.20.1.1-45882 (172.20.1.1:45882) with 1 core(s)
2019-02-22 09:56:28 INFO  StandaloneSchedulerBackend:54 - Granted executor ID app-20190222015626-0020/2 on hostPort 172.20.1.1:45882 with 1 core(s), 512.0 MB RAM
2019-02-22 09:56:28 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/2 is now RUNNING
2019-02-22 09:56:28 INFO  BlockManagerMaster:54 - Removal of executor 1 requested
2019-02-22 09:56:28 INFO  CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 1
2019-02-22 09:56:28 INFO  BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2019-02-22 09:56:28 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 366.3 MB)
2019-02-22 09:56:28 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1236.0 B, free 366.3 MB)
2019-02-22 09:56:28 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on kvm-test:36768 (size: 1236.0 B, free: 366.3 MB)
2019-02-22 09:56:28 INFO  SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161
2019-02-22 09:56:28 INFO  DAGScheduler:54 - Submitting 100 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2019-02-22 09:56:28 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 100 tasks
2019-02-22 09:56:29 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/2 is now EXITED (Command exited with code 1)
2019-02-22 09:56:29 INFO  StandaloneSchedulerBackend:54 - Executor app-20190222015626-0020/2 removed: Command exited with code 1
2019-02-22 09:56:29 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20190222015626-0020/3 on worker-20190111083714-172.20.1.1-45882 (172.20.1.1:45882) with 1 core(s)
2019-02-22 09:56:29 INFO  BlockManagerMaster:54 - Removal of executor 2 requested
2019-02-22 09:56:29 INFO  CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 2
2019-02-22 09:56:29 INFO  StandaloneSchedulerBackend:54 - Granted executor ID app-20190222015626-0020/3 on hostPort 172.20.1.1:45882 with 1 core(s), 512.0 MB RAM
2019-02-22 09:56:29 INFO  BlockManagerMasterEndpoint:54 - Trying to remove executor 2 from BlockManagerMaster.
2019-02-22 09:56:29 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/3 is now RUNNING
2019-02-22 09:56:31 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/3 is now EXITED (Command exited with code 1)
2019-02-22 09:56:31 INFO  StandaloneSchedulerBackend:54 - Executor app-20190222015626-0020/3 removed: Command exited with code 1
2019-02-22 09:56:31 INFO  BlockManagerMasterEndpoint:54 - Trying to remove executor 3 from BlockManagerMaster.
2019-02-22 09:56:31 INFO  BlockManagerMaster:54 - Removal of executor 3 requested
2019-02-22 09:56:31 INFO  CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 3
2019-02-22 09:56:31 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20190222015626-0020/4 on worker-20190111083714-172.20.1.1-45882 (172.20.1.1:45882) with 1 core(s)
2019-02-22 09:56:31 INFO  StandaloneSchedulerBackend:54 - Granted executor ID app-20190222015626-0020/4 on hostPort 172.20.1.1:45882 with 1 core(s), 512.0 MB RAM
2019-02-22 09:56:31 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/4 is now RUNNING
2019-02-22 09:56:33 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/4 is now EXITED (Command exited with code 1)
2019-02-22 09:56:33 INFO  StandaloneSchedulerBackend:54 - Executor app-20190222015626-0020/4 removed: Command exited with code 1
2019-02-22 09:56:33 INFO  BlockManagerMasterEndpoint:54 - Trying to remove executor 4 from BlockManagerMaster.
2019-02-22 09:56:33 INFO  BlockManagerMaster:54 - Removal of executor 4 requested  

從報錯看出來,,任務一直在請求,但是executor莫名退出了,日志后面還有一個警告,如下:

2019-02-22 09:42:58 WARN  TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

分析:從這個信息可以看出來,,task沒有獲取到資源

解決:

第一種情況:資源不足(可能是CPU,也可能是內存),這種情況可以調整內存(driver或者executor)或者CPU大小

例如,按如下調整,很多情況都是executor內存設置的過大,超出了實際的內存大小

# ./spark-submit --class org.apache.spark.examples.SparkPi  --executor-memory 512M --total-executor-cores 2 --driver-memory 512M /hadoop/spark/examples/jars/spark-examples_2.11-2.4.0.jar 100   

第二種情況:也是我遇到的。我有一個spark集群+一個spark客戶端,我在spark集群里面執行任務可以正常執行,但是放到spark客戶端執行的時候就報錯了。機器內存,cpu都足夠大。導致錯誤的原因竟然是主機名和ip對應出錯了,

由於spark集群是以前搭建的,今天做了一個spark,忘記在spark集群里面添加spark客戶端的主機和ip映射了。添加上好了。

 

總結:

出現這類問題一般有幾個可能的原因,逐一檢查排除即可:
 
(1).因為提交任務的節點不能和worker節點交互,因為提交完任務后提交任務節點上會起一個進程,展示任務進度,大多端口為4044,工作節點需要反饋進度給該該端口,所以如果主機名或者IP在hosts中配置不正確。所以檢查下主機名和ip是否配置正確。

(2).也有可能是內存不足造成的。內存設置可以根據情況調整下。另外,也檢查下web UI看看,確保worker節點處於alive狀態。 

 

2、錯誤日志如下

19/04/08 23:47:19 ERROR ContextCleaner: Error cleaning broadcast 11700946
 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
   at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
   at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
   at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
   at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
   at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:148)
   at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:321)
   at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
   at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
   at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238)
   at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
   at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
   at scala.Option.foreach(Option.scala:257)
   at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
   at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
   at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)

同時,日志里面還報了java.lang.OutOfMemoryError: Java heap space

分析:從上面日志分析,是由於spark內存不夠,導致gc,gc會使得executor與driver通信中斷。

解決:(1)、增加硬件資源 ,修改executor內存;

            (2)、增大作業並發度;

            (3)、修改spark-defaults.conf ,加大executor通信超時時間spark.executor.heartbeatInterval

 

 

spark拍錯與優化


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM