sqoop增量導入hbase


背景

  因為hadoop/hive本質上不支持更新,所以hive不能夠采用update行級別的維度數據的更新。可以采用的變通的方式。

  • hive和hbase結合

我認為這是首選的方式,hbase本質上也是不支持行級更新,只不過是追加行加上時間戳,然后取最新的時間戳的數據而已,但是對於我們來說是透明的。可以在hbase中建立一張表,然后在hive中也建立這張維度表,再hive中將這張表映射到hbase中,然后在hbase中按照行級別更新維度數據就簡單多了。在ETL中,往往從其他的在線的系統的數據庫中導出有更新的維度信息,然后加載到hadoop,用MR更新到hbase的表,這樣就達到了更新hive中維度表的作用。

下面介紹下語句:

 

English Version:

  

Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.

 

Argument Description
--check-column (col) Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)
--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.

 

Sqoop supports two types of incremental imports: append and lastmodified. You can use the --incremental argument to specify the type of incremental import to perform.

You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.

An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

 

翻譯:==================================

 

翻譯上述段落的意思其實不難理解,增量導入共有三個參數

第一個參數:

     --check-column (col):控制增量的變量字段,這個字段最好不要是字符串類型的。比如說是time, id 等等字段。

 

第二個字段:

  --incremental (mode):增加的模式選擇,共有兩個選擇一個是 append, 一個是lastmodified.

 

第三個字段:

  

  --last-value (value): 根據第一個參數的變量,從哪里開始導入,例如這個參數是 --last-value 0 那么就從0開始導入。

 

加上其余的語句如下:

     sqoop import --connect jdbc:mysql://ip:port/db --table tablename --hbase-table namespace:tablename --column-family columnfamily --hbase-create-table -username 'username' -password 'password' --incremental append --check-column 'id' --last-value 0


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM