原文地址: https://blog.csdn.net/njpjsoftdev/article/details/52956508
我們在生產環境中使用Druid也遇到了很多問題,通過閱讀官網文檔、源碼以及社區提問解決或部分解決了很多問題,現將遇到的問題、解決方案以及調優經驗總結如下:
問題一:Hadoop batch ingestion失敗,日志錯誤為“No buckets?…“
解決方案:這個問題當初困擾了我們大概一周的時間,對於大部分剛接觸Druid人來說基本都會遇到時區問題。
其實問題很簡單,主要在於集群工作時區與導入數據時區不一致。由於Druid是時間序列數據庫,所以對時間非常敏感。Druid底層采用絕對毫秒數存儲時間,如果不指定時區,默認輸出為零時區時間,即ISO8601中yyyy-MM-ddThh:mm:ss.SSSZ。我們生產環境中采用東八區,也就是Asia/Hong Kong時區,所以需要將集群所有UTC時間調整為UTC+08:00;同時導入的數據的timestamp列格式必須為:yyyy-MM-ddThh:mm:ss.SSS+08:00
問題二:Druid與Hadoop在Jackson上出現版本沖突,日志錯誤信息:
解決方案:當前版本的Druid是用Hadoop-2.3.0版本進行編譯的,針對上述出現的問題,要么更換Hadoop版本為2.3.0(我們生產環境就是這么做的),要么按照該文章中的方法解決:
https://github.com/druid-io/druid/blob/master/docs/content/operations/other-hadoop.md
問題三:Segment的建議大小在300-700MB之間,我們目前是一個小時聚合一次,每小時的原始數據大概在50G左右,生成的Segments數據在10G左右,Segments過大影響查詢性能,該如何處理?
解決方案:
目前有兩種途徑:
- 減小SegmentGranularity
按小時粒度聚合,可以考慮減少到分鍾級別,比如20分鍾。 - 在TunningConfig中增加partitionsSpec,官方解釋如下:
問題四:索引任務在handoff失敗后很長時間內無法釋放,同時一直waiting for handoff,錯誤日志信息如下:
2016-05-26 22:05:25,805 ERROR i.d.s.r.p.CoordinatorBasedSegmentHandoffNotifier [coordinator_handoff_scheduled_0] Exception while checking handoff for dataSource[hm_flowinfo_analysis] Segment[SegmentDescriptor{interval=2016-05-26T21:00:00.000+0800/2016-05-26T22:00:00.000+0800, version=’2016-05-26T21:00:00.000+0800’, partitionNumber=0}], Will try again after [60000]secs
解決方案:
這個問題在Druid使用中非常常見,在社區提問的相似問題也很多,問題原因主要集中在Historical Node沒有內存去加載Segment。排查此類問題的方法總結如下:
-
首先引用開發者對該問題的解答
I guess due to the network storage the segment is not being pushed to deep storage at all.
Do you see a segment metadata entry in DB for above segment ?
If no, then check for any exception in the task logs or overlord logs related to segment publishing.
If the metadata entry is present in the db, make sure you have enough free space available on the historical nodes to load the segments and there are no exceptions in coordinator/historical while loading the segment. -
以及對handoff階段工作流的簡單解釋
The coordinator is the process that detects the new segment built by the indexing task and signals the historical nodes to load the segment. The indexing task will only complete once it gets notification that a historical has picked up the segment so it knows it can stop serving it. The coordinator logs should help determine whether or not the coordinator noticed the new segment, if it tried to signal a historical to load it but failed, if there were rules preventing it from loading, etc. Historical logs would show you if a historical received the load order but failed for some reason (e.g. out of memory). -
所以,綜合上述兩方面,此問題的根本原因總結如下
Indexing task如果長時間沒有釋放,是因為沒有收到Historical Nodes加載成功后的返回信息。CoordinatorBasedSegmentHandoffNotifier類主要負責注冊等待Handoff的Segments以及檢查待Handoff的Segments的狀態。
注冊等待Handoff的Segments主要使用內部的ConcurrentMap保存Segments的相關信息;
檢查待Handoff的Segments的狀態主要通過:
(1)首先通過Apache curator向Zookeeper上查詢Coordinator實例;
(2)CoordinatorClient內部封裝了一個HttpClient,向存活的Coordinator實例發送HttpGet請求,獲取當前集群中所有Segments的load信息List;
(3)對比內部緩存的ConcurrentMap與List,成功Handoff則刪除信息,失敗則會打印出上述log,同時Coordinator會每隔一分鍾去元信息庫中同步已發布的Segments,所以Handoff失敗也會每隔一分鍾去重試。
最終解決方案總結如下
-
檢查元信息庫是否已注冊此Segment的信息,如果沒有,那么檢查task logs(Middle Manager)或者Overlord關於此Segment發布時的log,是否有異常拋出;
-
如果元信息庫中已注冊該Segment,那么檢查Coordinator是否已檢測到此Segments已生成,如果Coordinator試圖通知Historical Nodes去加載但是失敗了,檢查是否有rules 阻止加載此Segments等;以及檢查在Coordinator中通知load此Segment的Historical Nodes在加載該Segment的過程中是否出現了異常。
問題五:在0.9.1.1版本中新引入的Kafka Indexing Service,如果設置了過短的Kafka Retention時間,同時Druid消費速度又小於Retention速率,那么會出現offset過期,即還未來得及消費的數據已經被Kafka刪除了,在Peon日志會一直出現offsetOutofRangeException,並且后續的所有任務全部失敗。
解決方案:
這個問題出現的情況比較極端,由於我們使用場景數據量大,同時集群帶寬資源不足,所有Kafka Retention時間設置為2小時,不過Kafka提供auto.offset.reset這個策略已應對offset過期的問題,但是在spec文件中配置了{“auto.offset.reset” : “latest” },表示如果offset過期則自動rewind到最新的offset,通過跟蹤Overlord日志以及Peon日志發現,Overlord日志中該配置項已生效,但是在Peon日志中該配置項被設置為“None”,“None”表示offset過期不作任何處理,只拋出異常,即offsetOutofRangeException。在Peon中不生效是因為在代碼中寫死了該項,主要是為了滿足Exactly-once Semantics,不過開發者在寫死的同時,並未考慮到如何解決這種死循環的問題。
對於該問題,開發者給出的回答如下:
Because of your 2 hour data retention, I guess you’re hitting a case where the Druid Kafka indexing tasks are trying to read offsets that have already been deleted. This causes problems with the exactly-once transaction handling scheme, which requires that all offsets be read in order, without skipping any. The Github issue https://github.com/druid-io/druid/issues/3195 is about making this better – basically you would have an option to reset the Kafka indexing to latest (this would involve resetting the ingestion metadata Druid stores for the datasource).
In the meantime, maybe it’s possible to make this happen less often by either extending your Kafka retention, or by setting your Druid taskDuration lower than the default of 1 hour.
所以,對於此問題,目前並沒有徹底的解決方案,不過以下方案可以部分或暫時性解決:
-
盡量增大Kafka Retention時間,我們設置2小時確實過於極端
-
減少taskDuration;
-
在上述兩個都無法徹底解決問題的情況下,可以清空元信息庫中druid_dataSource表,這張表中記錄了所有消費的Kafka topic對應的partition以及offset信息,同時重啟MiddleManager節點。
不過,在0.9.2-milestone版本中,該特性應該會進一步改進,以下引入自Github :
The Kafka indexing service can get into a stuck state where it is trying to read the Kafka offset following the last one recorded in the dataSource metadata table but can’t read it because Kafka’s retention period has elapsed and that message is no longer available. We probably don’t want to automatically jump to a valid Kafka offset since we would have missed events and would no longer have ingested each message exactly once, so currently we start throwing exceptions and the user needs to acknowledge what happened by removing the last committed offset from the dataSource table.
It would be nice to include an API that will help them by either removing the dataSource table entry or setting it to a valid Kafka offset, but would still require a manual action/acknowledgment by the user.