HBase預分區實戰篇
作者:尹正傑
版權聲明:原創作品,謝絕轉載!否則將追究法律責任。
一.HBase預分區概述
每一個region維護着startRow與endRowKey,如果加入的數據符合某個region維護的rowKey范圍,則該數據交給這個region維護。那么依照這個原則,我們可以將數據所要投放的分區提前大致的規划好,以提高HBase性能。 預分區是否可以基於時間戳作為ROWKEY? 不推薦基於時間戳來作為ROWKEY,其原因很簡單,很容易出現數據傾斜。隨着數據量的增大,Region也會被切分,而基於時間戳切分的Region有一個明顯特點,前面的Region的數據固定,后面的Region數量不斷有數據寫入導致數據傾斜現象,從而頻繁發生Region的切分。 如何規划預分區? 這需要您對數據的總量有一個判斷,舉個例子,目前HBase有50億條數據,每條數據平均大小是10k,那么數據總量大小為5000000000*10k/1024/1024=47683G(約47T).我們假設每個Region可以存儲100G的數據,那么要存儲到現有數據只需要477個Region足以。 但是這個Region數量僅能存儲現有數據,實際情況我們需要預估公司未來5-10年的數量量的增長,比如每年增長10億條數據,按5年來做預估,那么5年后的數據會增加50億條數據,因此預估5年后總數居大小為10000000000*10k/1024/1024=95367(約95T)。 綜上所述,按照每個Region存儲100G的數據,最少需要954個Region,接下來我們就得考慮如何將數據盡量平均的存儲在這954個Region中,而不是將數據放在某一個Region中導致數據傾斜的情況發生。
二.手動設置預分區
1>.手動設置預分區

hbase(main):001:0> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000'] Created table staff1 Took 3.0235 seconds => Hbase::Table - staff1 hbase(main):002:0>

hbase(main):006:0> describe 'staff1' Table staff1 is ENABLED staff1 COLUMN FAMILIES DESCRIPTION {NAME => 'info', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTE R => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} {NAME => 'partition1', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} 2 row(s) QUOTAS 0 row(s) Took 0.0648 seconds hbase(main):007:0>
2>.訪問Hmaster的WebUI
瀏覽器訪問HMaster的地址: http://hadoop101.yinzhengjie.org.cn:16010/
3>.查看staff1表的Region信息
三.生成16進制序列預分區
1>.生成16進制序列預分區

hbase(main):007:0> create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'} Created table staff2 Took 2.4445 seconds => Hbase::Table - staff2 hbase(main):008:0>

hbase(main):008:0> describe 'staff2' Table staff2 is ENABLED staff2 COLUMN FAMILIES DESCRIPTION {NAME => 'info', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTE R => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} {NAME => 'partition2', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} 2 row(s) QUOTAS 0 row(s) Took 0.0880 seconds hbase(main):009:0>
2>.訪問Hmaster的WebUI
瀏覽器訪問HMaster的地址: http://hadoop101.yinzhengjie.org.cn:16010/
3>.查看staff2表的Region信息
四.按照文件中設置的規則預分區
1>.按照文件中設置的規則預分區

[root@hadoop101.yinzhengjie.org.cn ~]# vim splits.txt [root@hadoop101.yinzhengjie.org.cn ~]# [root@hadoop101.yinzhengjie.org.cn ~]# cat splits.txt AAAAA BBBBB CCCCC DDDDD EEEEE [root@hadoop101.yinzhengjie.org.cn ~]#

hbase(main):009:0> create 'staff3','partition3',SPLITS_FILE => 'splits.txt' Created table staff3 Took 2.3140 seconds => Hbase::Table - staff3 hbase(main):010:0>

hbase(main):010:0> describe 'staff3' Table staff3 is ENABLED staff3 COLUMN FAMILIES DESCRIPTION {NAME => 'partition3', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} 1 row(s) QUOTAS 0 row(s) Took 0.0955 seconds hbase(main):011:0>
2>.訪問Hmaster的WebUI
瀏覽器訪問HMaster的地址: http://hadoop101.yinzhengjie.org.cn:16010/
3>.查看staff3表的Region信息
五.基於JavaAPI創建預分區
1>.自定義生成3個Region的預分區表

package cn.org.yinzhengjie.bigdata.hbase.util; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.Connection; import org.apache.hadoop.hbase.client.ConnectionFactory; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Table; import org.apache.hadoop.hbase.util.Bytes; import java.io.IOException; /** * HBase操作工具類 */ public class HBaseUtil { //創建ThreadLocal對象(會在各個線程里單獨開辟一塊共享內存空間),目的是為了同一個線程內實現數據共享,它的作用並不是解決線程安全問題喲~ private static ThreadLocal<Connection> connHolder = new ThreadLocal<Connection>(); /** * 將構造方法私有化,禁止該工具類被實例化 */ private HBaseUtil(){} /** * 獲取HBase連接對象 */ public static void makeHbaseConnection() throws IOException { //獲取連接 Connection conn = connHolder.get(); //第一次獲取連接為空,因此需要手動創建Connection對象 if (conn == null){ //使用HBaseConfiguration的單例方法實例化,該方法會自動幫咱們加載"hbase-default.xml"和"hbase-site.xml"文件. Configuration conf = HBaseConfiguration.create(); conn = ConnectionFactory.createConnection(conf); connHolder.set(conn); } } /** * 增加數據 */ public static void insertData (String tableName, String rowKey, String columnFamily, String column, String value) throws IOException{ //獲取連接 Connection conn = connHolder.get(); //獲取表 Table table = conn.getTable(TableName.valueOf(tableName)); //創建Put對象,需要指定往哪個"RowKey"插入數據 Put put = new Put(Bytes.toBytes(rowKey)); //記得添加列族信息 put.addColumn(Bytes.toBytes(columnFamily),Bytes.toBytes(column),Bytes.toBytes(value)); //執行完這行代碼數據才會被真正寫入到HBase喲~ table.put(put); //記得關閉表 table.close(); } /** * 關閉連接 */ public static void close() throws IOException { //獲取連接 Connection conn = connHolder.get(); if (conn != null){ conn.close(); //關閉連接后記得將其從ThreadLocal的內存中移除以釋放空間。 connHolder.remove(); } } /** * 生成分區鍵 */ public static byte[][] genRegionKeys(int regionCount){ //設置3個Region就會生成2分區鍵 byte[][] bs = new byte[regionCount -1][]; for (int i = 0;i<regionCount -1;i++){ bs[i] = Bytes.toBytes(i + "|"); } return bs; } /** * 生成分區號 */ public static String genRegionNumber(String rowkey,int regionCount){ //計算分區號 int regionNumber; //獲取一個散列值 int hash = rowkey.hashCode(); /** * (regionCount & (regionCount - 1)) == 0 : * 用於判斷regionCount這個數字是否是2的N次方: * 如果是則將該數字減去1和hash值做與運算(&)目的是可以隨機匹配到不同的分區號; * 如果不是則將該數字和hash值做取余運算(%)目的也是可以隨機匹配到不同的分區號; */ if (regionCount > 0 && (regionCount & (regionCount - 1)) == 0){ // (regionCount -1)算的是分區鍵,舉個例子:分區數為3,則分區鍵為2. regionNumber = hash & (regionCount -1); }else { regionNumber = hash % regionCount; } return regionNumber + "_" + rowkey; } }

package cn.org.yinzhengjie.bigdata.hbase; import cn.org.yinzhengjie.bigdata.hbase.util.HBaseUtil; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.Admin; import org.apache.hadoop.hbase.client.Connection; import org.apache.hadoop.hbase.client.ConnectionFactory; import org.apache.hadoop.hbase.util.Bytes; import java.io.IOException; public class PrePartition { public static void main(String[] args) throws IOException { //創建配置對象 Configuration conf = HBaseConfiguration.create(); //獲取HBase的連接對象 Connection conn = ConnectionFactory.createConnection(conf); //獲取管理Hbase的對象 Admin admin = conn.getAdmin(); //獲取表對象 HTableDescriptor td = new HTableDescriptor(TableName.valueOf("yinzhengjie2020:course1")); //為表設置列族 HColumnDescriptor cd = new HColumnDescriptor("info"); td.addFamily(cd); //設置3個Region就會生成2分區鍵 byte[][] spiltKeys = HBaseUtil.genRegionKeys(3); for (byte[] spiltKey : spiltKeys) { System.out.println(Bytes.toString(spiltKey)); } //創建表並指定分區鍵(二維數組)可用增加預分區 admin.createTable(td,spiltKeys); System.out.println("預分區表創建成功...."); //釋放資源 admin.close(); conn.close(); } }
2>.查看WebUI
瀏覽器訪問: http://hadoop105.yinzhengjie.org.cn:16010/table.jsp?name=yinzhengjie2020:course
3>.往自定義的分區插入數據

package cn.org.yinzhengjie.bigdata.hbase.util; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.Connection; import org.apache.hadoop.hbase.client.ConnectionFactory; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Table; import org.apache.hadoop.hbase.util.Bytes; import java.io.IOException; /** * HBase操作工具類 */ public class HBaseUtil { //創建ThreadLocal對象(會在各個線程里單獨開辟一塊共享內存空間),目的是為了同一個線程內實現數據共享,它的作用並不是解決線程安全問題喲~ private static ThreadLocal<Connection> connHolder = new ThreadLocal<Connection>(); /** * 將構造方法私有化,禁止該工具類被實例化 */ private HBaseUtil(){} /** * 獲取HBase連接對象 */ public static void makeHbaseConnection() throws IOException { //獲取連接 Connection conn = connHolder.get(); //第一次獲取連接為空,因此需要手動創建Connection對象 if (conn == null){ //使用HBaseConfiguration的單例方法實例化,該方法會自動幫咱們加載"hbase-default.xml"和"hbase-site.xml"文件. Configuration conf = HBaseConfiguration.create(); conn = ConnectionFactory.createConnection(conf); connHolder.set(conn); } } /** * 增加數據 */ public static void insertData (String tableName, String rowKey, String columnFamily, String column, String value) throws IOException{ //獲取連接 Connection conn = connHolder.get(); //獲取表 Table table = conn.getTable(TableName.valueOf(tableName)); //創建Put對象,需要指定往哪個"RowKey"插入數據 Put put = new Put(Bytes.toBytes(rowKey)); //記得添加列族信息 put.addColumn(Bytes.toBytes(columnFamily),Bytes.toBytes(column),Bytes.toBytes(value)); //執行完這行代碼數據才會被真正寫入到HBase喲~ table.put(put); //記得關閉表 table.close(); } /** * 關閉連接 */ public static void close() throws IOException { //獲取連接 Connection conn = connHolder.get(); if (conn != null){ conn.close(); //關閉連接后記得將其從ThreadLocal的內存中移除以釋放空間。 connHolder.remove(); } } /** * 生成分區鍵 */ public static byte[][] genRegionKeys(int regionCount){ //設置3個Region就會生成2分區鍵 byte[][] bs = new byte[regionCount -1][]; for (int i = 0;i<regionCount -1;i++){ bs[i] = Bytes.toBytes(i + "|"); } return bs; } /** * 生成分區號 */ public static String genRegionNumber(String rowkey,int regionCount){ //計算分區號 int regionNumber; //獲取一個散列值 int hash = rowkey.hashCode(); /** * (regionCount & (regionCount - 1)) == 0 : * 用於判斷regionCount這個數字是否是2的N次方: * 如果是則將該數字減去1和hash值做與運算(&)目的是可以隨機匹配到不同的分區號; * 如果不是則將該數字和hash值做取余運算(%)目的也是可以隨機匹配到不同的分區號; */ if (regionCount > 0 && (regionCount & (regionCount - 1)) == 0){ // (regionCount -1)算的是分區鍵,舉個例子:分區數為3,則分區鍵為2. regionNumber = hash & (regionCount -1); }else { regionNumber = hash % regionCount; } return regionNumber + "_" + rowkey; } }

package cn.org.yinzhengjie.bigdata.hbase; import cn.org.yinzhengjie.bigdata.hbase.util.HBaseUtil; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.*; import org.apache.hadoop.hbase.util.Bytes; import java.io.IOException; public class InsertPrePartition { public static void main(String[] args) throws IOException { //創建配置對象 Configuration conf = HBaseConfiguration.create(); //獲取HBase的連接對象 Connection conn = ConnectionFactory.createConnection(conf); //獲取管理Hbase的對象 Admin admin = conn.getAdmin(); //獲取要操作的表對象 Table courseTable = conn.getTable(TableName.valueOf("yinzhengjie2020:course")); //編輯待插入數據,將rowkey均勻分配到不同的分區中,效果可以模擬HashMap數據存儲的規則。 String rowkey = "yinzhengjie8"; rowkey = HBaseUtil.genRegionNumber(rowkey, 3); Put put = new Put(Bytes.toBytes(rowkey)); put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("age"),Bytes.toBytes(20)); //往表中插入數據 courseTable.put(put); System.out.println("數據插入成功......"); } }