spark shc hbase 超時問題 hbase.client.scanner.timeout.period 配置

本文轉載自查看原文 2020-02-28 09:55 1080 spark/ hbase/ shc

異常信息

20/02/27 19:36:21 INFO TaskSetManager: Starting task 17.1 in stage 3.0 (TID 56, 725.slave.adh, executor 50, partition 17, RACK_LOCAL, 9698 bytes)
20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 6603499ms passed since the last invocation, timeout is currently set to 3600000
	at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434)
	at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364)
	at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$2.hasNext(HBaseTableScan.scala:187)
	at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:216)
	at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:183)
	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:195)
	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:192)
	at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$3.hasNext(HBaseTableScan.scala:215)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 39288877, already closed?
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
	at java.lang.Thread.run(Thread.java:745)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:97)
	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:266)
	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:350)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:324)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
	at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
	... 3 more

---

首先查到了需要調整參數 base.client.scanner.timeout.period，項目使用shc 不是外部維護的conf，配置如何加是個問題

方式1 改本地配置，找到兩個可能的配置文件
/opt/hbase/conf/hbase-site.xml
/opt/hadoop/etc/hadoop/hbase-site.xml

添加

<property>
<name>hbase.client.scanner.timeout.period</name>
<value>36100000</value>
</property>

提交，問題依舊

方式2 官方 readme.md 有相關的示例

https://github.com/hortonworks-spark/shc

./bin/spark-submit --class your.application.class --master yarn-client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --jars /usr/hdp/current/phoenix-client/phoenix-server.jar --files /etc/hbase/conf/hbase-site.xml /To/your/application/jar

主要看到提交了 --files /etc/hbase/conf/hbase-site.xml 文件

更改本地 hbase-site.xml 添加

<property>
<name>hbase.client.scanner.timeout.period</name>
<value>36100000</value>
</property>

后 spark-submit --files /etc/hbase/conf/hbase-site.xml

線上任務失敗報錯，任務無法執行，猜測是線上本身有hbase-site.xml和本地的hbase-site.xml 不一致,提交本地hbase-site.xml文件，覆蓋了原本正常的配置，導致異常

可以找hbase的維護方，要一個完整的線上配置文件，再添加hbase.client.scanner.timeout.period 項后提交。

方式3 在沒有線上原始hbase-site.xml的情況下，試試提交hbase-default.xml

新建文件 hbase-default.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.client.scanner.timeout.period</name>
<value>3620000</value>
</property>
</configuration>

后 spark-submit --files /etc/hbase/conf/hbase-default.xml

報錯
20/02/27 22:53:40 INFO SparkContext: Successfully stopped SparkContext
20/02/27 22:53:40 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.lang.RuntimeException: hbase-default.xml file seems to be for an older version of H
Base (null), this version is 1.2.2
at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:71)

首先報錯，原因是hbase-site.xml檢查版本,hbase-default.xml版本不一致，雖然報錯，不過看到希望了，有檢測，表示會加載

添加項 hbase.defaults.for.version和線上hbase版本一致

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.client.scanner.timeout.period</name>
<value>3620000</value>
</property>
<property>
<name>hbase.defaults.for.version</name>
<value>1.2.2</value>
</property> 
</configuration>

提交任務執行正常

但依然報錯

20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 3803499ms passed since the last invocation, timeout is currently set to 3600000

hbase-default.xml的配置根本就沒有生效，比較奇怪，有檢測版本的異常，則應該是加載hbase-default.xml文件，配置已經加進去了，先放下

---

方法4

官方

https://github.com/hortonworks-spark/shc/issues/160

There are two ways to do this:
(1) put your extra configurations in a file, and make the file as the value of HBaseRelation.HBASE_CONFIGFILE. Refer to here.

(2) put your extra configurations in json format, and make the json as the value of HBaseRelation.HBASE_CONFIGURATION.

沒有指定HBaseRelation.HBASE_CONFIGFILE則用path下的配置，但上面幾種改hbase-default.xml，hbase-site.xml的方式都失敗了

試試 HBaseRelation.HBASE_CONFIGURATION.

val hBaseConfiguration = parameters.get(HBaseRelation.HBASE_CONFIGURATION).map(
parse(_).extract[Map[String, String]])

al conf = HBaseConfiguration.create
hBaseConfiguration.foreach(_.foreach(e => conf.set(e._1, e._2)))
hBaseConfigFile.foreach(e => conf.set(e._1, e._2))
conf

parse轉json字符串串，再提取extract為 k:v 結構，問時是看這樣子json串里的配置會被hbase-site.xml里的替換掉，不知道線上hbase-site.xml里有沒有這相配置

試試

.options(Map(
HBaseTableCatalog.tableCatalog -> catalog.catalogEsDocByFields(hTable, fields),
HBaseRelation.HBASE_CONFIGURATION ->"{\"hbase.client.scanner.timeout.period\": \"3820000\"}"
))

提交任務，任務執行

20/02/28 03:35:15 ERROR Executor: Exception in task 16.1 in stage 3.0 (TID 50)
org.apache.hadoop.hbase.client.ScannerTimeoutException: 4092211ms passed since the last invocation, timeout is currently set to 3820000

3820000 雖然報錯，但base.client.scanner.timeout.period這個參數是終於生效了

問題解決，補充，因為不同yarn集群path下的hbase-site.xml內容可能不同，方案並不適用全部場景

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Nginx的超時timeout配置詳解 pip解決超時問題（timeout）解決HBase 出現client.RpcRetryingCaller: Call exception, tries=11, retries=35的超時問題解決 Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. 的問題 StackExchange.Redis Timeout performing 超時問題記一次數據庫查詢超時的原因 Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. The statement has been terminated Feign Client 超時時間配置【Spark】配置項spark.network.timeout 的單位是什么 redis之timeout連接超時問題解決 windbg 解決 "System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool." 過程中碰見的問題