異常信息
20/02/27 19:36:21 INFO TaskSetManager: Starting task 17.1 in stage 3.0 (TID 56, 725.slave.adh, executor 50, partition 17, RACK_LOCAL, 9698 bytes) 20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 6603499ms passed since the last invocation, timeout is currently set to 3600000 at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434) at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364) at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$2.hasNext(HBaseTableScan.scala:187) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:216) at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:183) at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:195) at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:192) at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$3.hasNext(HBaseTableScan.scala:215) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 39288877, already closed? at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:97) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:266) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:350) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:324) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64) ... 3 more
---
首先查到了需要調整參數 base.client.scanner.timeout.period,項目使用shc 不是外部維護的conf,配置如何加是個問題
方式1 改本地配置,找到兩個可能的配置文件
/opt/hbase/conf/hbase-site.xml
/opt/hadoop/etc/hadoop/hbase-site.xml
添加
<property>
<name>hbase.client.scanner.timeout.period</name>
<value>36100000</value>
</property>
提交,問題依舊
方式2 官方 readme.md 有相關的示例
https://github.com/hortonworks-spark/shc
./bin/spark-submit --class your.application.class --master yarn-client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --jars /usr/hdp/current/phoenix-client/phoenix-server.jar --files /etc/hbase/conf/hbase-site.xml /To/your/application/jar
主要看到提交了 --files /etc/hbase/conf/hbase-site.xml 文件
更改本地 hbase-site.xml 添加
<property> <name>hbase.client.scanner.timeout.period</name> <value>36100000</value> </property>
后 spark-submit --files /etc/hbase/conf/hbase-site.xml
線上任務失敗報錯,任務無法執行,猜測是線上本身有hbase-site.xml和本地的hbase-site.xml 不一致,提交本地hbase-site.xml文件,覆蓋了原本正常的配置,導致異常
可以找hbase的維護方,要一個完整的線上配置文件,再添加hbase.client.scanner.timeout.period 項后提交。
方式3 在沒有線上原始hbase-site.xml的情況下,試試提交hbase-default.xml
新建文件 hbase-default.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.client.scanner.timeout.period</name> <value>3620000</value> </property> </configuration>
后 spark-submit --files /etc/hbase/conf/hbase-default.xml
報錯
20/02/27 22:53:40 INFO SparkContext: Successfully stopped SparkContext
20/02/27 22:53:40 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.lang.RuntimeException: hbase-default.xml file seems to be for an older version of H
Base (null), this version is 1.2.2
at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:71)
首先報錯,原因是hbase-site.xml檢查版本,hbase-default.xml版本不一致,雖然報錯,不過看到希望了,有檢測,表示會加載
添加項 hbase.defaults.for.version和線上hbase版本一致
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.client.scanner.timeout.period</name> <value>3620000</value> </property> <property> <name>hbase.defaults.for.version</name> <value>1.2.2</value> </property> </configuration>
提交任務執行正常
但依然報錯
20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 3803499ms passed since the last invocation, timeout is currently set to 3600000
hbase-default.xml的配置根本就沒有生效,比較奇怪,有檢測版本的異常,則應該是加載hbase-default.xml文件,配置已經加進去了,先放下
---
方法4
官方
https://github.com/hortonworks-spark/shc/issues/160
There are two ways to do this:
(1) put your extra configurations in a file, and make the file as the value of HBaseRelation.HBASE_CONFIGFILE. Refer to here.
(2) put your extra configurations in json format, and make the json as the value of HBaseRelation.HBASE_CONFIGURATION.
沒有指定HBaseRelation.HBASE_CONFIGFILE則用path下的配置,但上面幾種改hbase-default.xml,hbase-site.xml的方式都失敗了
試試 HBaseRelation.HBASE_CONFIGURATION.
val hBaseConfiguration = parameters.get(HBaseRelation.HBASE_CONFIGURATION).map( parse(_).extract[Map[String, String]]) al conf = HBaseConfiguration.create hBaseConfiguration.foreach(_.foreach(e => conf.set(e._1, e._2))) hBaseConfigFile.foreach(e => conf.set(e._1, e._2)) conf
parse轉json字符串串,再提取extract為 k:v 結構,問時是看這樣子json串里的配置會被hbase-site.xml里的替換掉,不知道線上hbase-site.xml里有沒有這相配置
試試
.options(Map(
HBaseTableCatalog.tableCatalog -> catalog.catalogEsDocByFields(hTable, fields),
HBaseRelation.HBASE_CONFIGURATION ->"{\"hbase.client.scanner.timeout.period\": \"3820000\"}"
))
提交任務,任務執行
20/02/28 03:35:15 ERROR Executor: Exception in task 16.1 in stage 3.0 (TID 50)
org.apache.hadoop.hbase.client.ScannerTimeoutException: 4092211ms passed since the last invocation, timeout is currently set to 3820000
3820000 雖然報錯,但base.client.scanner.timeout.period這個參數是終於生效了
問題解決,補充,因為不同yarn集群path下的hbase-site.xml內容可能不同,方案並不適用全部場景