Hbase數據備份&&容災方案
標簽(空格分隔): Hbase
一、Distcp
在使用distcp命令copy hdfs文件的方式實現備份時,需要禁用備份表確保copy時該表沒有數據寫入,對於在線服務的hbase集群,該方式不可用,將靜態此目錄distcp 到其他HDFS文件系統時候,可以通過在其他集群直接啟動新Hbase 集群將所有數據恢復。
二、CopyTable
執行命令前,需在對端集群先創建表
支持時間區間、row區間,改變表名稱,改變列簇名稱,指定是否copy刪除數據等功能,例如:
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr= dstClusterZK:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
1、同一個集群不同表名稱
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy srcTable
2、跨集群copy表
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase srcTable
跨集群copytable 必須注意是用推的方式,即從原集群運行此命令。
copytable eg
$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
Options:
rs.class hbase.regionserver.class of the peer cluster,
specify if different from current cluster
rs.impl hbase.regionserver.impl of the peer cluster,
startrow the start row
stoprow the stop row
starttime beginning of the time range (unixtime in millis)
without endtime means from starttime to forever
endtime end of the time range. Ignored if no starttime specified.
versions number of cell versions to copy
new.name new table's name
peer.adr Address of the peer cluster given in the format
hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
families comma-separated list of families to copy
To copy from cf1 to cf2, give sourceCfName:destCfName.
To keep the same name, just give "cfName"
all.cells also copy delete markers and deleted cells
Args:
tablename Name of the table to copy
Examples:
To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
For performance consider the following general options:
It is recommended that you set the following to >=100. A higher value uses more memory but
decreases the round trip time to the server and may increase performance.
-Dhbase.client.scanner.caching=100
The following should always be set to false, to prevent writing data twice, which may produce
inaccurate results.
-Dmapred.map.tasks.speculative.execution=false
一些示例
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1478448000000 –endtime=1478591994506 –peer.adr=VECS00001,VECS00002,VECS00003:2181:/hbase –families=txjl –new.name=hy_membercontacts_bk hy_membercontacts
#根據時間范圍備份
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1478448000000 –endtime=1478591994506 –new.name=hy_membercontacts_bk hy_membercontacts
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1477929600000 –endtime=1478591994506 –new.name=hy_linkman_tmp hy_linkman
#備份全表
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=hy_mobileblacklist_bk_before_del hy_mobileblacklist
#拓展根據時間范圍查詢
scan ‘hy_linkman’, {COLUMNS => ‘lxr:sguid’, TIMERANGE => [1478966400000, 1479052799000]}
scan ‘hy_mobileblacklist’, {COLUMNS => ‘mobhmd:sguid’, TIMERANGE => [1468719824000, 1468809824000]}
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=hy_mobileblacklist_bk_before_del_20161228 hy_mobileblacklist
三、Export/Import(使用mapreduce)
Export 執行導出命令
可使用-D命令自定義參數,此處限定表名、列族、開始結束RowKey、以及導出到HDFS的目錄
hbase org.apache.hadoop.hbase.mapreduce.Export -D hbase.mapreduce.scan.column.family=cf -D hbase.mapreduce.scan.row.start=0000001 -D hbase.mapreduce.scan.row.stop=1000000 table_name /tmp/hbase_export
可選的-D參數配置項
Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]
Note: -D properties will be applied to the conf used.
For example:
-D mapred.output.compress=true
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-D mapred.output.compression.type=BLOCK
Additionally, the following SCAN properties can be specified
to control/limit what is exported..
-D hbase.mapreduce.scan.column.family=<familyName>
-D hbase.mapreduce.include.deleted.rows=true
For performance consider the following properties:
-Dhbase.client.scanner.caching=100
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
For tables with very wide rows consider setting the batch size as below:
-Dhbase.export.scanner.batch=10
Import 執行導入命令
必須在導入前存在表
create 'table_name','cf'
運行導入命令
hbase org.apache.hadoop.hbase.mapreduce.Import table_name hdfs://flashhadoop/tmp/hbase_export/
可選的-D參數配置項
Usage: Import [options] <tablename> <inputdir>
By default Import will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimport.bulk.output=/path/for/output
To apply a generic org.apache.hadoop.hbase.filter.Filter to the input, use
-Dimport.filter.class=<name of filter class>
-Dimport.filter.args=<comma separated list of args for filter
NOTE: The filter will be applied BEFORE doing key renames via the HBASE_IMPORTER_RENAME_CFS property. Futher, filters will only use the Filter#filterRowKey(byte[] buffer, int offset, int length) method to identify whether the current row needs to be ignored completely for processing and Filter#filterKeyValue(KeyValue) method to determine if the KeyValue should be added; Filter.ReturnCode#INCLUDE and #INCLUDE_AND_NEXT_COL will be considered as including the KeyValue.
For performance consider the following options:
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false
-Dimport.wal.durability=<Used while writing data to hbase. Allowed values are the supported durability values like SKIP_WAL/ASYNC_WAL/SYNC_WAL/...>
四、Snapshot
即為Hbase 表的鏡像。
需要提前開啟Hbase 集群的snapshot 功能。
<property>
<name>hbase.snapshot.enabled</name>
<value>true</value>
</property>
在hbase shell中使用clone_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot命令可是是想創建快照,查看快照,通過快照恢復表,通過快照創建一個新的表等功能,
在創建snapshot后,可以通過ExportSnapshot工具把快照導出到另外一個集群,實現數據備份或者數據遷移,ExportSnapshot工具的用法如下:(必須為推送的方式,即從現集群到目的集群)
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot table_name_snapshot -copy-to hdfs://flashhadoop_2/hbase -mappers 2
執行該命令后,在flashhadoop_2的hdfs中會把table_name_snapshot文件夾copy到/hbase/.hbase-snapshot文件下,進入flashhadoop_2這個hbase集群,執行list_snapshots會看到有一個快照:table_name_snapshot,通過命令clone_snapshot可以把該快照copy成一個新的表,不用提前創建表,新表的region個數等信息完全與快照保持一致。也可以先創建一張與原表相同的表,然后通過restore snapshot的方式恢復表,但會多出一個region.這個region 將會失效。
在使用snapshot把一個集群的數據copy到新集群后,應用程序開啟雙寫,然后可以使用Export工具把快照與雙寫之間的數據導入到新集群,從而實現數據遷移,為保障數據不丟失,Export導出時指定的時間范圍可以適當放寬。
五、Replication
可以通過replication機制實現hbase集群的主從模式,或者可以說主主模式,也就是兩邊都做雙向同步,具體步驟如下:
1、 如果主從hbase集群共用一個zk集群,則zookeeper.znode.parent不能都是默認的hbase,可以配置為hbase-master和hbase-slave,總之在zk 中的znode節點命名不能沖突。
2,在主,從hbase集群的hbase-site.xml中添加配置項:(其實做主從模式的話,只需要將從集群hbase.replication設置為true 即可,其他可以忽略。)
<property>
<name>hbase.replication</name>
<value>true</value>
</property>
<property>
<name>replication.source.nb.capacity</name>
<value>25000</value>
<description>主集群每次向從集群發送的entry最大的個數,默認值25000,可根據集群規模做出適當調整</description>
</property>
<property>
<name>replication.source.size.capacity</name>
<value>67108864</value>
<description>主集群每次向從集群發送的entry的包的最大值大小,默認為64M</description>
</property>
<property>
<name>replication.source.ratio</name>
<value>1</value>
<description>主集群使用的從集群的RS的數據百分比,默認為0.1,1.X.X版本默認0.15,需調整為1,充分利用從集群的RS</description>
</property>
<property>
<name>replication.sleep.before.failover</name>
<value>2000</value>
<description>主集群在RS宕機多長時間后進行failover,默認為2秒,具體的sleep時間是: sleepBeforeFailover + (long) (new Random().nextFloat() * sleepBeforeFailover) </description>
</property>
<property>
<name>replication.executor.workers</name>
<value>1</value>
<description>從事replication的線程數,默認為1,如果寫入量大,可以適當調大</description>
</property>
3,重啟主從集群,新集群搭建請忽略重啟,直接啟動即可。
4,分別在主從集群hbase shell中
add_peer 'ID' 'CLUSTER_KEY'
The ID must be a short integer. To compose the CLUSTER_KEY, use the following template:
hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent
This will show you the help to setup the replication stream between both clusters. If both clusters use the same Zookeeper cluster, you have to use a different zookeeper.znode.parent since they can't write in the same folder.
1,
增加主Hbase 到容災 Hbase 數據表 同步
add_peer '1', "VECS00840,VECS00841,VECS00842,VECS00843,VECS00844:2181:/hbase"
2,
增加容災Hbase 到主 Hbase 數據表 同步
add_peer '2', "VECS00994,VECS00995,VECS00996,VECS00997,VECS00998:2181:/hbase"
3,然后在主,備集群建表結構,屬性完全相同的表。(注意,是完全相同)
主從集群都建立。
hbase shell>
create 't_warehouse_track', {NAME => 'cf', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
4,在主集群hbase shell
enable_table_replication 't_warehouse_track'
5,在容災集群hbase shell
disable 'your_table'
alter 'your_table', {NAME => 'family_name', REPLICATION_SCOPE => '1'}
enable 'your_table
此處的REPLICATION_SCOPE => '1'中的1,與第3步中設置到“ID”無關系,這個值只有0或者1,標示開啟復制或者關閉。