用sqoop從oracle導數據到hive的例子

本文轉載自查看原文 2019-07-03 19:08 3208 hive/ sqoop/ 大數據

用sqoop導數據到 Hive 的機制或者說步驟：
1. 先把數據導入--target-dir 指定的 HDFS 的目錄中，以文件形式存儲（類似於_SUCCESS, part-m-00000這種文件）
2. 在 Hive 中建表
3. 調用 Hive 的 LOAD DATA INPATH ？把 --target-dir中的數據移動到 Hive 中

這段代碼實現的是，從oracle 數據庫導數據到hive，數據庫密碼和用戶名用xxx代替：

sqoop import -D mapred.job.queue.name=hdpuser007_queue02 -D mapred.job.name=daily_registereduser_record_SQOOP \
--connect jdbc:oracle:thin:@loacalhost:1521:orcl \
--username xxx \
--password xxx \
--query "SELECT * FROM USERDATA.daily_registereduser_record WHERE ${updated} \$CONDITIONS" \
--m 1 --hive-table user_bhvr.orcl _USERDATA_daily_registereduser_record_delta \
--hive-drop-import-delims \
--null-non-string '\\N' \
--null-string '\\N' \
--target-dir /apps-data/hdpuser007/user_bhvr/orcl _USERDATA_daily_registereduser_record_delta \
--hive-partition-key y,m,d \
--hive-partition-value 2019,07,02 \
--hive-import \
--hive-overwrite \
--delete-target-dir

為了不引起歧義，語法問題都建議先參考Apache官網的文檔，用“sqoop version”可知，我用的是1.4.5-cdh5.4.2版本的，關於這個版本的Sqoop User Guide鏈接如下：

http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html

首先看屬於hive arguments的語句：

--hive-import：如果要把數據導入hive，就用這句，不需要解釋；官網說的是， Import tables into Hive (Uses Hive’s default delimiters if none are set.)。
--hive-overwrite：如果沒有加上overwrite，重復使用這個sqoop語句會在同一個（指定）目錄下建多個文件，如part-m-00000,part-m-00001等；官網定義，Overwrite existing data in the Hive table.
--hive-drop-import-delims：官網定義 Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-partition-key：官網定義Name of a hive field to partition are sharded on.
--hive-partition-value <v>：一看就是和上面那個key配套使用，且這個value必須是字符；官網定義String-value that serves as partition key for this imported into hive in this job. Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments. The partition value must be a string.

再看屬於Import control arguments的語句：

--warehouse-dir <dir> ：這個字段是和 --table 一起使用的，不屬於咱們這個例子，但還是想說說它，如果不加這個字段的話，sqoop就會把文件放到當前用戶的默認目錄下（By default, Sqoop will import a table named foo to a directory named foo inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/foo/(files)）；如果加上這個字段，即<dir>這個路徑，會自動生成和 --table 后面跟的表同名的目錄，目錄下存數據文件；且如果多個不同table都用同一個父目錄，這個父目錄下可以存多張表；官網定義 HDFS parent for table destination。
--target-dir <dir>： <dir>這個目錄下臨時路徑，同步完成后會清空，存的就是sqoop導入表的數據文件。在導入 arbitrary aql query 或者說是free-form query的時候必須用，也就是 --query（Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns and --where arguments, you can specify a SQL statement with the --query argument.When importing a free-form query, you must specify a destination directory with --target-dir）；因為任意query是沒有名字的，sqoop不指定該在hdfs系統里創建什么名字的目錄，只能先在sqoop語句里定義好--target-dir ；要注意的是，--target dir 和 --warehouse-dir不能同時使用；官網定義， HDFS destination dir。
--delete-target-dir：加上這個字段是比較保險的，如果在導數過程中出現hdfs文件已經有了，但hive里沒數據的情況，這時候就需要重新導入。重新導入的時候，系統如果發現hdfs系統里已經有--target-dir 對應的文件夾了，就會報錯（ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://.../apps-data/hdpuser007/user_bhvr/orcl _USERDATA_daily_registereduser_record_delta already exists），加上--delete-target-dir 會讓系統自動刪除文件夾，然后順暢的走導入流程。
--null-string <null-string> ：對字符型列的null值的處理；官網定義， The string to be interpreted as null for string columns.
--null-non-string <null-string> ：對非字符型列的null值的處理；官網定義，The string to be interpreted as null for non string columns.

關於--warehouse-dir和--target-dir ，還可以參考這篇文章，寫得很清楚：http://f.dataguru.cn/hadoop-914126-1-1.html

其他語句：

--m 1：是和如果我們想順序導入的話，可以只開一個線程，官網是這么說滴，the query can be executed once and imported serially, by specifying a single map task with -m 1；和這個訴求相對的就是並行導入，需要和--split-by結合使用（import the results of a query in parallel， You must also select a splitting column with --split-by）。當然，無論是串行還是並行，都要和$conditions 一起使用（ Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression）。要注意的是，咱們這例子里，where語句中有用單引號的，所以要加個反斜杠 "...... \$CONDITIONS"。官網也說了，Note：If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like:

　　"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

其他情況：

當hive里沒有這個表 orcl _USERDATA_daily_registereduser_record_delta 的時候，用sqoop語句可以自動創建。

如果說hive數據庫里已經有這個表了，用sqoop語句也一樣可以把數據導進去，只要這個表結構和分區是正確的。

參考：

apache官網sqoop user guide：http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
--target-dir與--warehouse-dir的區別：http://f.dataguru.cn/hadoop-914126-1-1.html
sqoop導入數據到hive： https://www.cnblogs.com/dongdone/p/5696233.html
sqoop 常用命令整理，中文好理解： http://www.aboutyun.com/thread-9983-1-1.html
${} 和 #{}的區別： https://ww的w.cnblogs.com/eastwjn/p/9699966.html
這一篇也很好，常用命令整理：https://blog.csdn.net/jerrydzan/article/details/88527619

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 sqoop1.4.4從oracle導數據到hive中 sqoop 從oracle導數據到hive中，date型數據時分秒截斷問題利用sqoop將hive數據導入Oracle中使用pyspark模仿sqoop從oracle導數據到hive的主要功能（自動建表，分區導入，增量，解決數據換行符問題） sqoop從oracle數據庫抽取數據,導入到hive sqoop導入數據到hive sqoop導oracle數據到hive中並動態分區 sqoop從mysql導數據到hive報錯：Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure SQLServer導數據到Oracle sqoop導入數據到hive---2