HBase預分區實戰篇


              HBase預分區實戰篇

                                 作者:尹正傑

版權聲明:原創作品,謝絕轉載!否則將追究法律責任。

 

 

 

一.HBase預分區概述

  每一個region維護着startRow與endRowKey,如果加入的數據符合某個region維護的rowKey范圍,則該數據交給這個region維護。那么依照這個原則,我們可以將數據所要投放的分區提前大致的規划好,以提高HBase性能。

  預分區是否可以基於時間戳作為ROWKEY?
    不推薦基於時間戳來作為ROWKEY,其原因很簡單,很容易出現數據傾斜。隨着數據量的增大,Region也會被切分,而基於時間戳切分的Region有一個明顯特點,前面的Region的數據固定,后面的Region數量不斷有數據寫入導致數據傾斜現象,從而頻繁發生Region的切分。

  如何規划預分區?
    這需要您對數據的總量有一個判斷,舉個例子,目前HBase有50億條數據,每條數據平均大小是10k,那么數據總量大小為5000000000*10k/1024/1024=47683G(約47T).我們假設每個Region可以存儲100G的數據,那么要存儲到現有數據只需要477個Region足以。
    但是這個Region數量僅能存儲現有數據,實際情況我們需要預估公司未來5-10年的數量量的增長,比如每年增長10億條數據,按5年來做預估,那么5年后的數據會增加50億條數據,因此預估5年后總數居大小為10000000000*10k/1024/1024=95367(約95T)。
    綜上所述,按照每個Region存儲100G的數據,最少需要954個Region,接下來我們就得考慮如何將數據盡量平均的存儲在這954個Region中,而不是將數據放在某一個Region中導致數據傾斜的情況發生。

 

二.手動設置預分區

1>.手動設置預分區

hbase(main):001:0> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']
Created table staff1
Took 3.0235 seconds                                                                                                                                                                                                                                                           
=> Hbase::Table - staff1
hbase(main):002:0> 
hbase(main):001:0> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']        #指定的分區鍵為'1000','2000','3000','4000'
hbase(main):006:0> describe 'staff1'
Table staff1 is ENABLED                                                                                                                                                                                                                                                       
staff1                                                                                                                                                                                                                                                                        
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
{NAME => 'info', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTE
R => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                                   

{NAME => 'partition1', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO
MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                             

2 row(s)

QUOTAS                                                                                                                                                                                                                                                                        
0 row(s)
Took 0.0648 seconds                                                                                                                                                                                                                                                           
hbase(main):007:0> 
hbase(main):006:0> describe 'staff1'

2>.訪問Hmaster的WebUI

  瀏覽器訪問HMaster的地址:
    http://hadoop101.yinzhengjie.org.cn:16010/

3>.查看staff1表的Region信息

 

三.生成16進制序列預分區

1>.生成16進制序列預分區

hbase(main):007:0> create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
Created table staff2
Took 2.4445 seconds                                                                                                                                                                                                                                                           
=> Hbase::Table - staff2
hbase(main):008:0> 
hbase(main):007:0> create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
hbase(main):008:0> describe 'staff2'
Table staff2 is ENABLED                                                                                                                                                                                                                                                       
staff2                                                                                                                                                                                                                                                                        
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
{NAME => 'info', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTE
R => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                                   

{NAME => 'partition2', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO
MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                             

2 row(s)

QUOTAS                                                                                                                                                                                                                                                                        
0 row(s)
Took 0.0880 seconds                                                                                                                                                                                                                                                           
hbase(main):009:0>
hbase(main):008:0> describe 'staff2'

2>.訪問Hmaster的WebUI

  瀏覽器訪問HMaster的地址:
    http://hadoop101.yinzhengjie.org.cn:16010/

3>.查看staff2表的Region信息

 

四.按照文件中設置的規則預分區

1>.按照文件中設置的規則預分區

[root@hadoop101.yinzhengjie.org.cn ~]# vim splits.txt
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# cat splits.txt
AAAAA
BBBBB
CCCCC
DDDDD
EEEEE
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# vim splits.txt
hbase(main):009:0> create 'staff3','partition3',SPLITS_FILE => 'splits.txt'
Created table staff3
Took 2.3140 seconds                                                                                                                                                                                                                                                           
=> Hbase::Table - staff3
hbase(main):010:0> 
hbase(main):009:0> create 'staff3','partition3',SPLITS_FILE => 'splits.txt'
hbase(main):010:0> describe 'staff3'
Table staff3 is ENABLED                                                                                                                                                                                                                                                       
staff3                                                                                                                                                                                                                                                                        
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
{NAME => 'partition3', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOO
MFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                             

1 row(s)

QUOTAS                                                                                                                                                                                                                                                                        
0 row(s)
Took 0.0955 seconds                                                                                                                                                                                                                                                           
hbase(main):011:0> 
hbase(main):010:0> describe 'staff3'

2>.訪問Hmaster的WebUI

  瀏覽器訪問HMaster的地址:
    http://hadoop101.yinzhengjie.org.cn:16010/

3>.查看staff3表的Region信息

 

五.基於JavaAPI創建預分區

1>.自定義生成3個Region的預分區表

package cn.org.yinzhengjie.bigdata.hbase.util;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

/**
 * HBase操作工具類
 */
public class HBaseUtil {

    //創建ThreadLocal對象(會在各個線程里單獨開辟一塊共享內存空間),目的是為了同一個線程內實現數據共享,它的作用並不是解決線程安全問題喲~
    private static ThreadLocal<Connection> connHolder = new ThreadLocal<Connection>();

    /**
     *  將構造方法私有化,禁止該工具類被實例化
     */
    private HBaseUtil(){}


    /**
     *  獲取HBase連接對象
     */
    public static void makeHbaseConnection() throws IOException {
        //獲取連接
        Connection conn = connHolder.get();

        //第一次獲取連接為空,因此需要手動創建Connection對象
        if (conn == null){
            //使用HBaseConfiguration的單例方法實例化,該方法會自動幫咱們加載"hbase-default.xml"和"hbase-site.xml"文件.
            Configuration conf = HBaseConfiguration.create();
            conn = ConnectionFactory.createConnection(conf);
            connHolder.set(conn);
        }
    }

    /**
     *  增加數據
     */
    public static void insertData (String tableName, String rowKey, String columnFamily, String column, String value)
            throws IOException{
        //獲取連接
        Connection conn = connHolder.get();
        //獲取表
        Table table = conn.getTable(TableName.valueOf(tableName));
        //創建Put對象,需要指定往哪個"RowKey"插入數據
        Put put = new Put(Bytes.toBytes(rowKey));
        //記得添加列族信息
        put.addColumn(Bytes.toBytes(columnFamily),Bytes.toBytes(column),Bytes.toBytes(value));
        //執行完這行代碼數據才會被真正寫入到HBase喲~
        table.put(put);
        //記得關閉表
        table.close();
    }

    /**
     *  關閉連接
     */
    public static void close() throws IOException {
        //獲取連接
        Connection conn = connHolder.get();
        if (conn != null){
            conn.close();
            //關閉連接后記得將其從ThreadLocal的內存中移除以釋放空間。
            connHolder.remove();
        }
    }

    /**
     *  生成分區鍵
     */
    public static byte[][] genRegionKeys(int regionCount){
        //設置3個Region就會生成2分區鍵
        byte[][] bs = new byte[regionCount -1][];
        for (int i = 0;i<regionCount -1;i++){
            bs[i] = Bytes.toBytes(i + "|");
        }
        return bs;
    }


    /**
     *  生成分區號
     */
    public static String genRegionNumber(String rowkey,int regionCount){

        //計算分區號
        int regionNumber;
        //獲取一個散列值
        int hash = rowkey.hashCode();

        /**
         *  (regionCount & (regionCount - 1)) == 0 :
         *      用於判斷regionCount這個數字是否是2的N次方:
         *          如果是則將該數字減去1和hash值做與運算(&)目的是可以隨機匹配到不同的分區號;
         *          如果不是則將該數字和hash值做取余運算(%)目的也是可以隨機匹配到不同的分區號;
         */
        if (regionCount > 0 && (regionCount & (regionCount - 1)) == 0){
            // (regionCount -1)算的是分區鍵,舉個例子:分區數為3,則分區鍵為2.
            regionNumber = hash & (regionCount -1);
        }else {
            regionNumber = hash % regionCount;
        }

        return regionNumber + "_" + rowkey;
    }
}
HBaseUtil.java
package cn.org.yinzhengjie.bigdata.hbase;

import cn.org.yinzhengjie.bigdata.hbase.util.HBaseUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

public class PrePartition {
    public static void main(String[] args) throws IOException {
        //創建配置對象
        Configuration conf = HBaseConfiguration.create();

        //獲取HBase的連接對象
        Connection conn = ConnectionFactory.createConnection(conf);

        //獲取管理Hbase的對象
        Admin admin = conn.getAdmin();

        //獲取表對象
        HTableDescriptor td = new HTableDescriptor(TableName.valueOf("yinzhengjie2020:course1"));

        //為表設置列族
        HColumnDescriptor cd = new HColumnDescriptor("info");
        td.addFamily(cd);

        //設置3個Region就會生成2分區鍵
        byte[][] spiltKeys = HBaseUtil.genRegionKeys(3);
        for (byte[] spiltKey : spiltKeys) {
            System.out.println(Bytes.toString(spiltKey));
        }

        //創建表並指定分區鍵(二維數組)可用增加預分區
        admin.createTable(td,spiltKeys);
        System.out.println("預分區表創建成功....");

        //釋放資源
        admin.close();
        conn.close();
    }
}
案例代碼

2>.查看WebUI

  瀏覽器訪問:
    http://hadoop105.yinzhengjie.org.cn:16010/table.jsp?name=yinzhengjie2020:course

3>.往自定義的分區插入數據

package cn.org.yinzhengjie.bigdata.hbase.util;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

/**
 * HBase操作工具類
 */
public class HBaseUtil {

    //創建ThreadLocal對象(會在各個線程里單獨開辟一塊共享內存空間),目的是為了同一個線程內實現數據共享,它的作用並不是解決線程安全問題喲~
    private static ThreadLocal<Connection> connHolder = new ThreadLocal<Connection>();

    /**
     *  將構造方法私有化,禁止該工具類被實例化
     */
    private HBaseUtil(){}


    /**
     *  獲取HBase連接對象
     */
    public static void makeHbaseConnection() throws IOException {
        //獲取連接
        Connection conn = connHolder.get();

        //第一次獲取連接為空,因此需要手動創建Connection對象
        if (conn == null){
            //使用HBaseConfiguration的單例方法實例化,該方法會自動幫咱們加載"hbase-default.xml"和"hbase-site.xml"文件.
            Configuration conf = HBaseConfiguration.create();
            conn = ConnectionFactory.createConnection(conf);
            connHolder.set(conn);
        }
    }

    /**
     *  增加數據
     */
    public static void insertData (String tableName, String rowKey, String columnFamily, String column, String value)
            throws IOException{
        //獲取連接
        Connection conn = connHolder.get();
        //獲取表
        Table table = conn.getTable(TableName.valueOf(tableName));
        //創建Put對象,需要指定往哪個"RowKey"插入數據
        Put put = new Put(Bytes.toBytes(rowKey));
        //記得添加列族信息
        put.addColumn(Bytes.toBytes(columnFamily),Bytes.toBytes(column),Bytes.toBytes(value));
        //執行完這行代碼數據才會被真正寫入到HBase喲~
        table.put(put);
        //記得關閉表
        table.close();
    }

    /**
     *  關閉連接
     */
    public static void close() throws IOException {
        //獲取連接
        Connection conn = connHolder.get();
        if (conn != null){
            conn.close();
            //關閉連接后記得將其從ThreadLocal的內存中移除以釋放空間。
            connHolder.remove();
        }
    }

    /**
     *  生成分區鍵
     */
    public static byte[][] genRegionKeys(int regionCount){
        //設置3個Region就會生成2分區鍵
        byte[][] bs = new byte[regionCount -1][];
        for (int i = 0;i<regionCount -1;i++){
            bs[i] = Bytes.toBytes(i + "|");
        }
        return bs;
    }


    /**
     *  生成分區號
     */
    public static String genRegionNumber(String rowkey,int regionCount){

        //計算分區號
        int regionNumber;
        //獲取一個散列值
        int hash = rowkey.hashCode();

        /**
         *  (regionCount & (regionCount - 1)) == 0 :
         *      用於判斷regionCount這個數字是否是2的N次方:
         *          如果是則將該數字減去1和hash值做與運算(&)目的是可以隨機匹配到不同的分區號;
         *          如果不是則將該數字和hash值做取余運算(%)目的也是可以隨機匹配到不同的分區號;
         */
        if (regionCount > 0 && (regionCount & (regionCount - 1)) == 0){
            // (regionCount -1)算的是分區鍵,舉個例子:分區數為3,則分區鍵為2.
            regionNumber = hash & (regionCount -1);
        }else {
            regionNumber = hash % regionCount;
        }

        return regionNumber + "_" + rowkey;
    }
}
HBaseUtil.java
package cn.org.yinzhengjie.bigdata.hbase;

import cn.org.yinzhengjie.bigdata.hbase.util.HBaseUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

public class InsertPrePartition {
    public static void main(String[] args) throws IOException {
        //創建配置對象
        Configuration conf = HBaseConfiguration.create();
        //獲取HBase的連接對象
        Connection conn = ConnectionFactory.createConnection(conf);
        //獲取管理Hbase的對象
        Admin admin = conn.getAdmin();
        //獲取要操作的表對象
        Table courseTable = conn.getTable(TableName.valueOf("yinzhengjie2020:course"));
        //編輯待插入數據,將rowkey均勻分配到不同的分區中,效果可以模擬HashMap數據存儲的規則。
        String rowkey = "yinzhengjie8";
        rowkey = HBaseUtil.genRegionNumber(rowkey, 3);
        Put put = new Put(Bytes.toBytes(rowkey));
        put.addColumn(Bytes.toBytes("info"),Bytes.toBytes("age"),Bytes.toBytes(20));
        //往表中插入數據
        courseTable.put(put);
        System.out.println("數據插入成功......");
    }
}
案例代碼

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM