hbase問題總結

本文轉載自查看原文 2012-06-12 14:29 26626 java

1 java.io.IOException: java.io.IOException: java.lang.IllegalArgumentException: offset (0) + length (8) exceed the capacity of the array: 4

做簡單的incr操作時出現，原因是之前put時放入的是int 長度為 vlen=4 ，不適用增加操作，只能改為long型 vlen=8

2 寫數據到column時 org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: NotServingRegionException: 1 time, servers with issues: 10.xx.xx.37:60020, 或是 org.apache.hadoop.hbase.NotServingRegionException: Region is not online: 這兩種出錯，master-status中出現Regions in Transition 長達十幾分鍾，一直處於PENDING_OPEN狀態，導致請求阻塞。目前把10.xx.xx.37這台機器下線，運行一夜穩定，沒有出現因split造成的阻塞。懷疑是機器問題。Hmaster的日志顯示這台region server 不停的open close，不做任何split 或flush

RIT 的全稱是region in transcation. 每次hbase master 對region 的一個open 或一個close 操作都會向Master 的RIT中插入一條記錄,因為master 對region 的操作要保持原子性,region 的 open 和 close 是通過Hmaster 和 region server 協助來完成的. 所以為了滿足這些操作的協調,回滾,和一致性.Hmaster 采用了 RIT 機制並結合Zookeeper 中Node的狀態來保證操作的安全和一致性.

OFFLINE, // region is in an offline state
PENDING_OPEN, // sent rpc to server to open but has not begun
OPENING, // server has begun to open but not yet done
OPEN, // server opened region and updated meta
PENDING_CLOSE, // sent rpc to server to close but has not begun
CLOSING, // server has begun to close but not yet done
CLOSED, // server closed region and updated meta
SPLITTING, // server started split of a region
SPLIT // server completed split of a region

進一步發現是load balance的問題 region server不停重復的被open close，參考http://abloz.com/hbase/book.html#regions.arch.assignment 重啟了region server正常

后來的代碼運行中又出現region not on line ，是NotServingRegionException拋出的，原因是“Thrown by a region server if it is sent a request for a region it is not serving.”

為什么會不斷請求一個離線的region？且這種錯誤集中在150個中的3個region，追蹤服務器端log，region 會被CloseRegionHandler關掉，過了20分鍾左右才重新打開，關掉后客戶端請求的region仍然是這個關閉的region？

3 設置開關不寫入hbase並不生效

代碼初上線，增加了開關，萬一hbase有問題則關閉掉開關。但是出現問題了發現程序卡死，目前認為原因是不斷加長的retry機制，60秒超時，1-32秒的10次retry，萬一出問題，切換開關也沒有用。

需要配置rpc超時參數和retry time解決它

4 flush、split、compact導致stop-the-world

出現長時間的flush split操作導致hbase服務器端無法響應請求。需要調整region大小，並測試獲取flush次數

5 hbase參數設置

hbase.regionserver.handler.count

考慮到sas盤的io能力，設置為50

hbase.hregion.memstore.block.multiplier

當memstore的大小為hbase.hregion.memstore.flush.size的multiplier倍數時，阻塞讀寫進行flush，默認為2

6 region server crush

Regionserver crash的原因是因為GC時間過久導致Regionserver和zookeeper之間的連接timeout。

Zookeeper內部的timeout如下：

minSessionTimeout 單位毫秒，默認2倍tickTime。

maxSessionTimeout 單位毫秒，默認20倍tickTime。

（tickTime也是一個配置項。是Server內部控制時間邏輯的最小時間單位）

如果客戶端發來的sessionTimeout超過min-max這個范圍，server會自動截取為min或max，然后為這個Client新建一個Session對象。

默認的tickTime是2s，也就是客戶端最大的timeout為40s，及時regionserver的zookeeper.session.timeout設置為60s也沒用。

改動：

將zookeeper集群的tickTime修改為9s，最大的timeout為180s，同時修改zookeeper.session.timeout為120s，這樣可以避免GC引發timeout。
添加參數hbase.regionserver.restart.on.zk.expire為true，改參數的作用是當regionserver和zookeeper之間timeout之后重啟regionserver，而不是關掉regionserver。

7 代碼問題導致死鎖

master慢查詢日志中一個查詢達到了2小時，最終導致服務器響應變慢，無法應對大寫入。追究原因是getColumns操作一下取出十幾萬的數據，沒有做分頁；更改程序分頁500條左右，目前沒有出現問題

8 operation too slow

2012-07-26 05:30:39,141 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"processingtimems":69315,"ts":9223372036854775807,"client":"10.75.0.109:34780","starttimems":1343251769825,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"delete","totalColumns":1,"table":"trackurl_status_list","families":{"sl":[{"timestamp":1343251769825,"qualifier":"zzzn1VlyG","vlen":0}]},"row":""}
刪除一行數據用了69315s

而且神奇的是row為""，row無法設置null進去，但可以增加空串。做了一輪測試
空row-key  刪除不存在的column  耗時 700ms

空row-key  刪除存在的column  耗時 5ms

非空row-key  刪除任意的column  耗時 3ms

不清楚是否是個bug，也還不知道怎么就傳了個空row-key進去，目前策略為在代碼端避免對空row-key做操作。

9 responseTooSlow

2012-07-31 17:52:06,619 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":1156438,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3dbb29e5), rpc version=1, client version=29, methodsFingerPrint=-1508511443","client":"10.75.0.109:35245","starttimems":1343727170177,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}

引用hbase說明：The output is tagged with operation e.g. (operationTooSlow) if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for. If not, it is tagged (responseTooSlow) and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself.

目前做法是取消了對某個key多個column的批量delete操作避免阻塞，沒有發現新問題

10 output error

2012-07-31 17:52:06,812 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call get([B@61574be4, {"timeRange":[0,9223372036854775807],"totalColumns":1,"cacheBlocks":true,"families":{"c":["ALL"]},"maxVersions":1,"row":"zOuu6TK"}), rpc version=1, client version=29, methodsFingerPrint=-1508511443 from 10.75.0.151:52745: output error

11 rollbackMemstore問題

較頻繁出現這樣的log：
2012-08-07 10:21:49,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: rollbackMemstore rolled back 0 keyvalues from start:0 to end:0

方法解釋為：Remove all the keys listed in the map from the memstore. This method is called when a Put has updated memstore but subequently fails to update the wal. This method is then invoked to rollback the memstore.

很奇怪的是開始和結束的index都為0

方法中循環： for (int i = start; i < end; i++) {

因此是空數據，空回滾。需要進一步調查

12 新上線一個region server 導致region not on line

往錯誤的region server服務器請求region

13 請求不存在的region，重新建立tablepool也不起作用

請求的時間戳 1342510667

最新region rowkey相關時間戳 1344558957

最終發現維持region location表的屬性是在HConnectionManager中

get Get，delete Delete，incr Increment 是在 ServerCallable類 withRetries處理

情景1 若有出錯（SocketTimeoutException ConnectException RetriesExhaustedExcetion），則清理regionServer location

情景2 numRetries 若設置為1 ，則循環只執行一次，connect(tries!=0) 為connect(false),即reload=false，不會進行location更新，當為numRetries>1的時候才會重新獲取

get Gets List, put Put或Puts List，delete Deletes List 則調用HConnectionManager的 processBatch去處理，當發現批量get或者put、delete操作結果有問題，則刷新regionServer location

設置 numRetries為>1次，我這里是3次，解決問題

14 zookeeper.RecoverableZooKeeper(195): Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master

這是在我單機做測試時出現的，無論是從ide或是bin啟動hbase，從shell里可以正常連接，從測試程序中無法連接，zookeeper端口是2181,客戶端端口應該與zookeeper無關才對，

最終更改配置21818端口換為2181 運行正常，應該是單機環境才要做這種更改。

<property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
    <description>Property from ZooKeeper's config zoo.cfg.
    The port at which the clients will connect.
    </description>
  </property>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用Java連接HBASE過程中問題總結 hbase總結~hbase配置和使用 HBase工程師線上工作經驗總結----HBase常見問題及分析 HBase 與 ES 框架總結 HBase原理總結 HBase總結 LSM理解 Java操作hbase總結 Hbase FilterList使用總結 Hbase各種查詢總結 HBase（十）HBase性能調優總結