Sqoop Mysql導入Hive完整命令參數解析

本文轉載自查看原文 2020-08-14 15:22 1026 hive/ sqoop/ centos

sqoop是常用的關系數據庫離線同步到數倉的工具
sqoop導入有兩種方式：
1）直接導入到hdfs，然后再load到表中
2）直接導入到hive中

一、直接導入到hdfs，然后再load到表中

1：先將mysql一張表的數據用sqoop導入到hdfs中

將 test 表中的前10條數據導導出來只要id name 和 teset 這3個字段

數據存在 hdfs 目錄 /user/hdfs 下

bin/sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/dbtest \
--username root \
--password root \
--query 'select id, name,text from test where $CONDITIONS LIMIT 10' \
--target-dir /user/hadoop \
--delete-target-dir \
--num-mappers 1 \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--direct \
--fields-terminated-by '\t'

如果導出的數據庫是mysql 則可以添加一個屬性 --direct 該屬性在導出mysql數據庫表中的數據會快一點執行的是mysq自帶的導出功能

2：啟動hive 在hive中創建一張表

drop table if exists default.hive_test ;
create table default.hive_test(
　id int,
　name string,
   text string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;

3：將hdfs中的數據導入到hive中（導入后，此時hdfs 中原數據沒有了）

load data inpath '/user/hadoop' into table default.hive_test ;

4：查詢 hive_test 表

select * from hive_test;

二、直接導入到hive中

1、創建一個文件 vi havetest.sql 編輯文件 vi havetest.sql

use test;
drop table if exists test.hive_test ;
create table test.hive_test(
id int,
name string,
text string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;

2、在啟動hive的時候執行 sql腳本

hive -f /root/hivetest.sql

[root@cdh01 ~]# hive -f /root/hivetest.sql
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false
OK
Time taken: 4.816 seconds
OK
Time taken: 0.19 seconds
OK
Time taken: 0.725 seconds
[root@cdh01 ~]#

3、執行sqoop直接導入hive的功能 IP不能是localhost

sqoop import \
--connect jdbc:mysql://IP:3306/test \
--username root \
--password root \
--table test \
--fields-terminated-by '\t' \
--delete-target-dir \
--num-mappers 1 \
--hive-import \
--hive-database test \
--hive-table hive_test

參數說明
#（必須參數）sqoop 導入，-D 指定參數，當前參數是集群使用隊列名稱
sqoop import -D mapred.job.queue.name=q \
#（必須參數）鏈接mysql jdbs參數 xxxx路徑/yy庫?mysql的參數（多個參數用&隔開）
#tinyInt1isBit=false這個參數 主要解決 從Sqoop導入MySQL導入TINYINT(1)類型數據到hive（tinyint），數據為null 
--connect jdbc:mysql:xxxx/yy?tinyInt1isBit=false \
#（必須參數）用戶名、密碼、具體的表
--username xx --password xx --table xx \
--delete-target-dir \
#（非必須）系統默認是textfile格式，當然也可以是parquet
--as-textfile \
#（非必須）指定想要的列
--columns 'id,title' \
#（必須參數）導入到hive的參數
--hive-import \
#（必須參數）指定分隔符，和hive 的目標表分隔符一致
--fields-terminated-by '\t' \
#（必須參數）hive的庫表名
--hive-database xx \
--hive-table  xx \
#（必須參數）是不是string 為null的都要變成真正為null
--null-string '\\N'  --null-non-string '\\N' \
#（非必須）寫入到hive的分區信息,hive無分區無需這步
--hive-partition-key dt  \
--hive-partition-value $day \
#寫入hive的方式
--hive-overwrite \
--num-mappers 1 \
#（必須參數）導入到hive時刪除 \n, \r, and \01 
-hive-drop-import-delims \
#sqoop完會生成java文件，可以指定文件的路徑，方便刪除和管理
--outdir xxx 
#導入時創建表
--hbase-create-table

hive導入相關參數

--hive-database 庫名
--hive-table 表名
--hive-home 重寫$HIVE_HOME
--hive-import 插入數據到hive當中，使用hive的默認分隔符
--hive-overwrite 重寫插入
--create-hive-table 建表，如果表已經存在，該操作會報錯！
--hive-table [table] 設置到hive當中的表名
--hive-drop-import-delims 導入到hive時刪除 \n, \r, and \01
--hive-delims-replacement 導入到hive時用自定義的字符替換掉 \n, \r, and \01
--hive-partition-key hive分區的key
--hive-partition-value hive分區的值
--map-column-hive 類型匹配，sql類型對應到hive類型
--query 'select * from test where id >10 and $CONDITIONS' sql語句 $CONDITIONS 必須

執行

[root@cdh01 ~]# sqoop import \
> --connect jdbc:mysql://192.168.230.101:3306/test \
> --username root \
> --password root \
> --table test \
> --fields-terminated-by '\t' \
> --delete-target-dir \
> --num-mappers 5 \
> --hive-import \
> --hive-database test \
> --hive-table hive_test
Warning: /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/08/14 14:47:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.2
20/08/14 14:47:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/08/14 14:47:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/08/14 14:47:43 INFO tool.CodeGenTool: Beginning code generation
Fri Aug 14 14:47:44 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
20/08/14 14:47:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1
20/08/14 14:47:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1
20/08/14 14:47:45 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/08/14 14:48:28 INFO mapreduce.Job: Job job_1597287071548_0008 running in uber mode : false
20/08/14 14:48:28 INFO mapreduce.Job:  map 0% reduce 0%
20/08/14 14:48:49 INFO mapreduce.Job:  map 20% reduce 0%
20/08/14 14:49:08 INFO mapreduce.Job:  map 40% reduce 0%
20/08/14 14:49:25 INFO mapreduce.Job:  map 60% reduce 0%
20/08/14 14:49:42 INFO mapreduce.Job:  map 80% reduce 0%
20/08/14 14:50:00 INFO mapreduce.Job:  map 100% reduce 0%
20/08/14 14:50:00 INFO mapreduce.Job: Job job_1597287071548_0008 completed successfully
20/08/14 14:50:00 INFO mapreduce.Job: Counters: 33
20/08/14 14:50:05 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2
20/08/14 14:50:05 INFO session.SessionState: Created local directory: /tmp/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2
20/08/14 14:50:05 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2/_tmp_space.db
20/08/14 14:50:05 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2
20/08/14 14:50:05 INFO session.SessionState: Updating thread name to d2cee9ab-0c69-4748-9259-45a1ed4e38b2 main
20/08/14 14:50:05 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2
20/08/14 14:50:06 INFO ql.Driver: Compiling command(queryId=root_20200814145005_35ad04c9-8567-4295-b490-96d4b495a085): CREATE TABLE IF NOT EXISTS `test`.`hive_test` ( `id` INT, `name` STRING, `text` STRING) COMMENT 'Imported by sqoop on 2020/08/14 14:50:01' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LINES TERMINATED BY '\012' STORED AS TEXTFILE
20/08/14 14:50:11 INFO hive.metastore: HMS client filtering is enabled.
20/08/14 14:50:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://cdh01:9083
20/08/14 14:50:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/08/14 14:50:11 INFO hive.metastore: Connected to metastore.
20/08/14 14:50:12 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
20/08/14 14:50:12 INFO parse.SemanticAnalyzer: Creating table test.hive_test position=27
20/08/14 14:50:12 INFO ql.Driver: Semantic Analysis Completed
20/08/14 14:50:12 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
20/08/14 14:50:12 INFO ql.Driver: Completed compiling command(queryId=root_20200814145005_35ad04c9-8567-4295-b490-96d4b495a085); Time taken: 6.559 seconds
20/08/14 14:50:12 INFO lockmgr.DummyTxnManager: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
20/08/14 14:50:12 INFO imps.CuratorFrameworkImpl: Starting
20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-cdh6.3.2--1, built on 11/08/2019 13:15 GMT
20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:host.name=cdh01
20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:java.version=1.8.0_202
20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.8.0_202/jre
20/08/14 14:50:15 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
20/08/14 14:50:15 INFO exec.StatsTask: Executing stats task
20/08/14 14:50:15 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/08/14 14:50:15 INFO hive.metastore: HMS client filtering is enabled.
20/08/14 14:50:15 INFO hive.metastore: Trying to connect to metastore with URI thrift://cdh01:9083
20/08/14 14:50:15 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/08/14 14:50:15 INFO hive.metastore: Connected to metastore.
20/08/14 14:50:16 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/08/14 14:50:16 INFO hive.metastore: HMS client filtering is enabled.
20/08/14 14:50:16 INFO hive.metastore: Trying to connect to metastore with URI thrift://cdh01:9083
20/08/14 14:50:16 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/08/14 14:50:16 INFO hive.metastore: Connected to metastore.
20/08/14 14:50:16 INFO exec.StatsTask: Table test.hive_test stats: [numFiles=5, numRows=0, totalSize=72, rawDataSize=0, numFilesErasureCoded=0]
20/08/14 14:50:16 INFO ql.Driver: Completed executing command(queryId=root_20200814145013_1557d3d4-f1aa-4813-be67-35f865ce527f); Time taken: 2.771 seconds
OK
20/08/14 14:50:16 INFO ql.Driver: OK
Time taken: 3.536 seconds
20/08/14 14:50:16 INFO CliDriver: Time taken: 3.536 seconds
20/08/14 14:50:16 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2
20/08/14 14:50:16 INFO session.SessionState: Resetting thread name to  main
20/08/14 14:50:16 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2
20/08/14 14:50:16 INFO session.SessionState: Deleted directory: /tmp/hive/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2 on fs with scheme hdfs
20/08/14 14:50:16 INFO session.SessionState: Deleted directory: /tmp/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2 on fs with scheme file
20/08/14 14:50:16 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/08/14 14:50:16 INFO hive.HiveImport: Hive import complete.
20/08/14 14:50:16 INFO hive.HiveClientCommon: Export directory is contains the _SUCCESS file only, removing the directory.
20/08/14 14:50:16 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
20/08/14 14:50:16 INFO zookeeper.ZooKeeper: Session: 0x173e5b94d6d13c0 closed
20/08/14 14:50:16 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
20/08/14 14:50:16 INFO zookeeper.ClientCnxn: EventThread shut down

4、查看結果

hive> select * from  hive_test;
OK
1    name1    text1
2    name2    text2
3    name3    text3
4    name4    text4
5    中文    測試
Time taken: 0.668 seconds, Fetched: 5 row(s)
hive>

通過sqoop顯示mysql數據庫列表

sqoop-list-databases --connect jdbc:mysql://127.0.0.1:3306 --username root --password root
sqoop-list-tables --connect jdbc:mysql://127.0.0.1:3306 --username root --password root

[root@cdh01 ~]# sqoop-list-databases --connect jdbc:mysql://127.0.0.1:3306 --username root --password root
Warning: /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/08/14 14:41:57 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.2
20/08/14 14:41:57 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/08/14 14:41:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Fri Aug 14 14:41:58 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
information_schema
metastore
mysql
performance_schema
scm
sys
test
[root@cdh01 ~]#

3、導入

1.append方式

2.lastmodified方式，必須要加--append（追加）或者--merge-key（合並，一般填主鍵）

Mysql創建表、數據

-- ----------------------------
-- Table structure for `data`
-- ----------------------------
DROP TABLE IF EXISTS `testdata`;
CREATE TABLE `testdata` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` char(20) DEFAULT NULL,
  `last_mod` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

-- ----------------------------
-- Records of data
-- ----------------------------
INSERT INTO `testdata` VALUES ('1', '1', '2019-08-28 17:34:51');
INSERT INTO `testdata` VALUES ('2', '2', '2019-08-28 17:31:57');
INSERT INTO `testdata` VALUES ('3', '3', '2019-08-28 17:31:58');

導入全部數據到HIVE

sqoop import \
--connect jdbc:mysql://192.168.230.101:3306/test \
--username root \
--password root \
--table testdata \
--fields-terminated-by '\t' \
--delete-target-dir \
--num-mappers 1 \
--hive-import \
--hive-database test \
--hive-table testdata

執行結果

[root@cdh01 ~]# sqoop import --connect jdbc:mysql://192.168.230.101:3306/test --username root --password root --table testdata --fields-terminated-by '\t' --delete-target-dir --num-mappers 1 --hive-import --hive-database test --hive-table testdata

hive> select * from testdata;
OK
1    1    2019-08-28 17:34:51.0
2    2    2019-08-28 17:31:57.0
3    3    2019-08-28 17:31:58.0
Time taken: 0.192 seconds, Fetched: 3 row(s)
hive>

增量導入--append方式導入

Mysql 插入2條記錄
INSERT INTO `testdata` VALUES ('4', '4', '2020-08-28 17:31:57');
INSERT INTO `testdata` VALUES ('5', '5', '2020-08-28 17:31:58');

--last-value 3，意味mysql中id為3的數據不會被導入

--targrt-dir的值設置成hive表數據文件存儲的路徑

sqoop import \
--connect jdbc:mysql://192.168.230.101:3306/test \
--username root \
--password root \
--table testdata \
--fields-terminated-by '\t' \
--num-mappers 1 \
--hive-import \
--hive-database test \
--hive-table testdata \
--target-dir /user/root/test \
--incremental append \
--check-column id \
--last-value 3

執行結果

[root@cdh01 ~]# sqoop import --connect jdbc:mysql://192.168.230.101:3306/test --username root --password root --table testdata --fields-terminated-by '\t' --num-mappers 1 --hive-import --hive-database test --hive-table testdata --target-dir /user/root/test --incremental append --check-column id --last-value 3 

hive> select * from testdata;
OK
1    1    2019-08-28 17:34:51.0
2    2    2019-08-28 17:31:57.0
3    3    2019-08-28 17:31:58.0
4    4    2020-08-28 17:31:57.0
5    5    2020-08-28 17:31:58.0
Time taken: 0.192 seconds, Fetched: 5 row(s)
hive>

注意

sqoop import \
--connect jdbc:mysql://192.168.230.101:3306/test \
--username root \
--password root \
--table testdata \
--fields-terminated-by '\t' \
--num-mappers 1 \
--hive-import \
--hive-database test \
--hive-table testdata \
--target-dir /user/root/test \
--incremental lastmodified \
--merge-key id \
--check-column last_mod \
--last-value '2019-08-30 17:05:49'

參數    說明
–incremental lastmodified    基於時間列的增量導入（將時間列大於等於閾值的所有數據增量導入Hadoop）
–check-column    時間列（int）
–last-value    閾值（int）
–merge-key    合並列（主鍵，合並鍵值相同的記錄）

以上語句使用 lastmodified 模式進行增量導入，結果報錯：

錯誤信息：--incremental lastmodified option for hive imports is not supported. Please remove the parameter --incremental lastmodified

錯誤原因：Sqoop 不支持 mysql轉hive時使用 lastmodified 模式進行增量導入，但mysql轉HDFS時可以支持該方式！

使用 --incremental append

sqoop import \
--connect jdbc:mysql://192.168.230.101:3306/test \
--username root \
--password root \
--table testdata \
--fields-terminated-by '\t' \
--num-mappers 1 \
--hive-import \
--hive-database test \
--hive-table testdata \
--target-dir /user/root/test \
--incremental append \
--merge-key id \
--check-column last_mod \
--last-value '2019-08-30 17:05:49'

可以增量，但是無法合並數據

hive> select * from testdata;
OK
1    1    2019-08-28 17:34:51.0
2    2    2019-08-28 17:31:57.0
3    3    2019-08-28 17:31:58.0
4    4    2020-08-28 17:31:57.0
5    5    2020-08-28 17:31:58.0
Time taken: 0.192 seconds, Fetched: 5 row(s)
hive> select * from testdata;
OK
1    1    2019-08-28 17:34:51.0
2    2    2019-08-28 17:31:57.0
3    3    2019-08-28 17:31:58.0
4    4    2020-08-28 17:31:57.0
5    5    2020-08-28 17:31:58.0
2    222    2020-08-14 19:13:06.0
3    333    2020-08-14 19:13:04.0
4    4    2020-08-28 17:31:57.0
5    5    2020-08-28 17:31:58.0
Time taken: 0.172 seconds, Fetched: 9 row(s)
hive>

由於 HDFS 不支持修改文件，sqoop 的 --incremental 和 --hive-import 不能同時使用，分開進行 import 和 merge 這兩個步驟。

完

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 sqoop命令，mysql導入到hdfs、hbase、hive 使用sqoop從mysql導入數據到hive 利用Sqoop將MySQL數據導入Hive中 sqoop用法之mysql與hive數據導入導出使用sqoop從mysql導入數據到hive python腳本用sqoop把mysql數據導入hive sqoop導入數據到hive sqoop mysql導入hive 數值類型變成null的問題分析 sqoop 數據從hive導入到mysql 遇到的坑使用Sqoop從MySQL導入數據到Hive和HBase 及近期感悟