sqoop是常用的 關系數據庫離線同步到數倉的 工具
sqoop導入有兩種方式:
1)直接導入到hdfs,然后再load到表中
2)直接導入到hive中
一、直接導入到hdfs,然后再load到表中
1:先將mysql一張表的數據用sqoop導入到hdfs中
將 test 表中的前10條數據導 導出來 只要id name 和 teset 這3個字段
數據存在 hdfs 目錄 /user/hdfs 下
bin/sqoop import \ --connect jdbc:mysql://127.0.0.1:3306/dbtest \ --username root \ --password root \ --query 'select id, name,text from test where $CONDITIONS LIMIT 10' \ --target-dir /user/hadoop \ --delete-target-dir \ --num-mappers 1 \ --compress \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--direct \ --fields-terminated-by '\t'
如果導出的數據庫是mysql 則可以添加一個 屬性 --direct 該屬性在導出mysql數據庫表中的數據會快一點 執行的是mysq自帶的導出功能
2:啟動hive 在hive中創建一張表
drop table if exists default.hive_test ; create table default.hive_test( id int, name string, text string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;
3:將hdfs中的數據導入到hive中 (導入后,此時hdfs 中原數據沒有了)
load data inpath '/user/hadoop' into table default.hive_test ;
4:查詢 hive_test 表
select * from hive_test;
二、直接導入到hive中
1、創建一個文件 vi havetest.sql 編輯文件 vi havetest.sql
use test; drop table if exists test.hive_test ; create table test.hive_test( id int, name string, text string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;
2、在 啟動hive的時候 執行 sql腳本
hive -f /root/hivetest.sql
[root@cdh01 ~]# hive -f /root/hivetest.sql WARNING: Use "yarn jar" to launch YARN applications. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false OK Time taken: 4.816 seconds OK Time taken: 0.19 seconds OK Time taken: 0.725 seconds [root@cdh01 ~]#
3、執行sqoop直接導入hive的功能 IP不能是localhost
sqoop import \ --connect jdbc:mysql://IP:3306/test \ --username root \ --password root \ --table test \ --fields-terminated-by '\t' \ --delete-target-dir \ --num-mappers 1 \ --hive-import \ --hive-database test \ --hive-table hive_test
參數說明 #(必須參數)sqoop 導入,-D 指定參數,當前參數是集群使用隊列名稱 sqoop import -D mapred.job.queue.name=q \ #(必須參數)鏈接mysql jdbs參數 xxxx路徑/yy庫?mysql的參數(多個參數用&隔開) #tinyInt1isBit=false這個參數 主要解決 從Sqoop導入MySQL導入TINYINT(1)類型數據到hive(tinyint),數據為null --connect jdbc:mysql:xxxx/yy?tinyInt1isBit=false \ #(必須參數)用戶名、密碼、具體的表 --username xx --password xx --table xx \ --delete-target-dir \ #(非必須)系統默認是textfile格式,當然也可以是parquet --as-textfile \ #(非必須)指定想要的列 --columns 'id,title' \ #(必須參數)導入到hive的參數 --hive-import \ #(必須參數)指定分隔符,和hive 的目標表分隔符一致 --fields-terminated-by '\t' \ #(必須參數)hive的庫表名 --hive-database xx \ --hive-table xx \ #(必須參數)是不是string 為null的都要變成真正為null --null-string '\\N' --null-non-string '\\N' \ #(非必須)寫入到hive的分區信息,hive無分區無需這步 --hive-partition-key dt \ --hive-partition-value $day \ #寫入hive的方式 --hive-overwrite \ --num-mappers 1 \ #(必須參數)導入到hive時刪除 \n, \r, and \01 -hive-drop-import-delims \ #sqoop完會生成java文件,可以指定文件的路徑,方便刪除和管理 --outdir xxx
#導入時創建表
--hbase-create-table
hive導入相關參數
--hive-database 庫名
--hive-table 表名
--hive-home 重寫$HIVE_HOME
--hive-import 插入數據到hive當中,使用hive的默認分隔符
--hive-overwrite 重寫插入
--create-hive-table 建表,如果表已經存在,該操作會報錯!
--hive-table [table] 設置到hive當中的表名
--hive-drop-import-delims 導入到hive時刪除 \n, \r, and \01
--hive-delims-replacement 導入到hive時用自定義的字符替換掉 \n, \r, and \01
--hive-partition-key hive分區的key
--hive-partition-value hive分區的值
--map-column-hive 類型匹配,sql類型對應到hive類型
--query 'select * from test where id >10 and $CONDITIONS' sql語句 $CONDITIONS 必須
執行
[root@cdh01 ~]# sqoop import \ > --connect jdbc:mysql://192.168.230.101:3306/test \ > --username root \ > --password root \ > --table test \ > --fields-terminated-by '\t' \ > --delete-target-dir \ > --num-mappers 5 \ > --hive-import \ > --hive-database test \ > --hive-table hive_test Warning: /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 20/08/14 14:47:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.2 20/08/14 14:47:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 20/08/14 14:47:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 20/08/14 14:47:43 INFO tool.CodeGenTool: Beginning code generation Fri Aug 14 14:47:44 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 20/08/14 14:47:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1 20/08/14 14:47:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1 20/08/14 14:47:45 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce 20/08/14 14:48:28 INFO mapreduce.Job: Job job_1597287071548_0008 running in uber mode : false 20/08/14 14:48:28 INFO mapreduce.Job: map 0% reduce 0% 20/08/14 14:48:49 INFO mapreduce.Job: map 20% reduce 0% 20/08/14 14:49:08 INFO mapreduce.Job: map 40% reduce 0% 20/08/14 14:49:25 INFO mapreduce.Job: map 60% reduce 0% 20/08/14 14:49:42 INFO mapreduce.Job: map 80% reduce 0% 20/08/14 14:50:00 INFO mapreduce.Job: map 100% reduce 0% 20/08/14 14:50:00 INFO mapreduce.Job: Job job_1597287071548_0008 completed successfully 20/08/14 14:50:00 INFO mapreduce.Job: Counters: 33 20/08/14 14:50:05 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:05 INFO session.SessionState: Created local directory: /tmp/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:05 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2/_tmp_space.db 20/08/14 14:50:05 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:05 INFO session.SessionState: Updating thread name to d2cee9ab-0c69-4748-9259-45a1ed4e38b2 main 20/08/14 14:50:05 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:06 INFO ql.Driver: Compiling command(queryId=root_20200814145005_35ad04c9-8567-4295-b490-96d4b495a085): CREATE TABLE IF NOT EXISTS `test`.`hive_test` ( `id` INT, `name` STRING, `text` STRING) COMMENT 'Imported by sqoop on 2020/08/14 14:50:01' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\011' LINES TERMINATED BY '\012' STORED AS TEXTFILE 20/08/14 14:50:11 INFO hive.metastore: HMS client filtering is enabled. 20/08/14 14:50:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://cdh01:9083 20/08/14 14:50:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1 20/08/14 14:50:11 INFO hive.metastore: Connected to metastore. 20/08/14 14:50:12 INFO parse.SemanticAnalyzer: Starting Semantic Analysis 20/08/14 14:50:12 INFO parse.SemanticAnalyzer: Creating table test.hive_test position=27 20/08/14 14:50:12 INFO ql.Driver: Semantic Analysis Completed 20/08/14 14:50:12 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null) 20/08/14 14:50:12 INFO ql.Driver: Completed compiling command(queryId=root_20200814145005_35ad04c9-8567-4295-b490-96d4b495a085); Time taken: 6.559 seconds 20/08/14 14:50:12 INFO lockmgr.DummyTxnManager: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager 20/08/14 14:50:12 INFO imps.CuratorFrameworkImpl: Starting 20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-cdh6.3.2--1, built on 11/08/2019 13:15 GMT 20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:host.name=cdh01 20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:java.version=1.8.0_202 20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 20/08/14 14:50:12 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.8.0_202/jre 20/08/14 14:50:15 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode 20/08/14 14:50:15 INFO exec.StatsTask: Executing stats task 20/08/14 14:50:15 INFO hive.metastore: Closed a connection to metastore, current connections: 0 20/08/14 14:50:15 INFO hive.metastore: HMS client filtering is enabled. 20/08/14 14:50:15 INFO hive.metastore: Trying to connect to metastore with URI thrift://cdh01:9083 20/08/14 14:50:15 INFO hive.metastore: Opened a connection to metastore, current connections: 1 20/08/14 14:50:15 INFO hive.metastore: Connected to metastore. 20/08/14 14:50:16 INFO hive.metastore: Closed a connection to metastore, current connections: 0 20/08/14 14:50:16 INFO hive.metastore: HMS client filtering is enabled. 20/08/14 14:50:16 INFO hive.metastore: Trying to connect to metastore with URI thrift://cdh01:9083 20/08/14 14:50:16 INFO hive.metastore: Opened a connection to metastore, current connections: 1 20/08/14 14:50:16 INFO hive.metastore: Connected to metastore. 20/08/14 14:50:16 INFO exec.StatsTask: Table test.hive_test stats: [numFiles=5, numRows=0, totalSize=72, rawDataSize=0, numFilesErasureCoded=0] 20/08/14 14:50:16 INFO ql.Driver: Completed executing command(queryId=root_20200814145013_1557d3d4-f1aa-4813-be67-35f865ce527f); Time taken: 2.771 seconds OK 20/08/14 14:50:16 INFO ql.Driver: OK Time taken: 3.536 seconds 20/08/14 14:50:16 INFO CliDriver: Time taken: 3.536 seconds 20/08/14 14:50:16 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:16 INFO session.SessionState: Resetting thread name to main 20/08/14 14:50:16 INFO conf.HiveConf: Using the default value passed in for log id: d2cee9ab-0c69-4748-9259-45a1ed4e38b2 20/08/14 14:50:16 INFO session.SessionState: Deleted directory: /tmp/hive/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2 on fs with scheme hdfs 20/08/14 14:50:16 INFO session.SessionState: Deleted directory: /tmp/root/d2cee9ab-0c69-4748-9259-45a1ed4e38b2 on fs with scheme file 20/08/14 14:50:16 INFO hive.metastore: Closed a connection to metastore, current connections: 0 20/08/14 14:50:16 INFO hive.HiveImport: Hive import complete. 20/08/14 14:50:16 INFO hive.HiveClientCommon: Export directory is contains the _SUCCESS file only, removing the directory. 20/08/14 14:50:16 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting 20/08/14 14:50:16 INFO zookeeper.ZooKeeper: Session: 0x173e5b94d6d13c0 closed 20/08/14 14:50:16 INFO CuratorFrameworkSingleton: Closing ZooKeeper client. 20/08/14 14:50:16 INFO zookeeper.ClientCnxn: EventThread shut down
4、查看結果
hive> select * from hive_test; OK 1 name1 text1 2 name2 text2 3 name3 text3 4 name4 text4 5 中文 測試 Time taken: 0.668 seconds, Fetched: 5 row(s) hive>
通過sqoop顯示mysql數據庫列表
sqoop-list-databases --connect jdbc:mysql://127.0.0.1:3306 --username root --password root
sqoop-list-tables --connect jdbc:mysql://127.0.0.1:3306 --username root --password root
[root@cdh01 ~]# sqoop-list-databases --connect jdbc:mysql://127.0.0.1:3306 --username root --password root Warning: /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 20/08/14 14:41:57 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.2 20/08/14 14:41:57 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 20/08/14 14:41:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. Fri Aug 14 14:41:58 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. information_schema metastore mysql performance_schema scm sys test [root@cdh01 ~]#
3、導入
1.append方式
2.lastmodified方式,必須要加--append(追加)或者--merge-key(合並,一般填主鍵)
Mysql創建表、數據
-- ---------------------------- -- Table structure for `data` -- ---------------------------- DROP TABLE IF EXISTS `testdata`; CREATE TABLE `testdata` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `name` char(20) DEFAULT NULL, `last_mod` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8; -- ---------------------------- -- Records of data -- ---------------------------- INSERT INTO `testdata` VALUES ('1', '1', '2019-08-28 17:34:51'); INSERT INTO `testdata` VALUES ('2', '2', '2019-08-28 17:31:57'); INSERT INTO `testdata` VALUES ('3', '3', '2019-08-28 17:31:58');
導入全部數據到HIVE
sqoop import \ --connect jdbc:mysql://192.168.230.101:3306/test \ --username root \ --password root \ --table testdata \ --fields-terminated-by '\t' \ --delete-target-dir \ --num-mappers 1 \ --hive-import \ --hive-database test \ --hive-table testdata
執行結果
[root@cdh01 ~]# sqoop import --connect jdbc:mysql://192.168.230.101:3306/test --username root --password root --table testdata --fields-terminated-by '\t' --delete-target-dir --num-mappers 1 --hive-import --hive-database test --hive-table testdata hive> select * from testdata; OK 1 1 2019-08-28 17:34:51.0 2 2 2019-08-28 17:31:57.0 3 3 2019-08-28 17:31:58.0 Time taken: 0.192 seconds, Fetched: 3 row(s) hive>
增量導入--append方式導入
Mysql 插入2條記錄 INSERT INTO `testdata` VALUES ('4', '4', '2020-08-28 17:31:57'); INSERT INTO `testdata` VALUES ('5', '5', '2020-08-28 17:31:58');
--last-value 3,意味mysql中id為3的數據不會被導入
--targrt-dir的值設置成hive表數據文件存儲的路徑
sqoop import \ --connect jdbc:mysql://192.168.230.101:3306/test \ --username root \ --password root \ --table testdata \ --fields-terminated-by '\t' \ --num-mappers 1 \ --hive-import \ --hive-database test \ --hive-table testdata \ --target-dir /user/root/test \ --incremental append \ --check-column id \ --last-value 3
執行結果
[root@cdh01 ~]# sqoop import --connect jdbc:mysql://192.168.230.101:3306/test --username root --password root --table testdata --fields-terminated-by '\t' --num-mappers 1 --hive-import --hive-database test --hive-table testdata --target-dir /user/root/test --incremental append --check-column id --last-value 3 hive> select * from testdata; OK 1 1 2019-08-28 17:34:51.0 2 2 2019-08-28 17:31:57.0 3 3 2019-08-28 17:31:58.0 4 4 2020-08-28 17:31:57.0 5 5 2020-08-28 17:31:58.0 Time taken: 0.192 seconds, Fetched: 5 row(s) hive>
注意
sqoop import \ --connect jdbc:mysql://192.168.230.101:3306/test \ --username root \ --password root \ --table testdata \ --fields-terminated-by '\t' \ --num-mappers 1 \ --hive-import \ --hive-database test \ --hive-table testdata \ --target-dir /user/root/test \ --incremental lastmodified \ --merge-key id \ --check-column last_mod \ --last-value '2019-08-30 17:05:49'
參數 說明 –incremental lastmodified 基於時間列的增量導入(將時間列大於等於閾值的所有數據增量導入Hadoop) –check-column 時間列(int) –last-value 閾值(int) –merge-key 合並列(主鍵,合並鍵值相同的記錄)
以上語句使用 lastmodified 模式進行增量導入,結果報錯:
錯誤信息:--incremental lastmodified option for hive imports is not supported. Please remove the parameter --incremental lastmodified
錯誤原因:Sqoop 不支持 mysql轉hive時使用 lastmodified 模式進行增量導入,但mysql轉HDFS時可以支持該方式!
使用 --incremental append
sqoop import \ --connect jdbc:mysql://192.168.230.101:3306/test \ --username root \ --password root \ --table testdata \ --fields-terminated-by '\t' \ --num-mappers 1 \ --hive-import \ --hive-database test \ --hive-table testdata \ --target-dir /user/root/test \ --incremental append \ --merge-key id \ --check-column last_mod \ --last-value '2019-08-30 17:05:49'
可以增量,但是無法合並數據
hive> select * from testdata; OK 1 1 2019-08-28 17:34:51.0 2 2 2019-08-28 17:31:57.0 3 3 2019-08-28 17:31:58.0 4 4 2020-08-28 17:31:57.0 5 5 2020-08-28 17:31:58.0 Time taken: 0.192 seconds, Fetched: 5 row(s) hive> select * from testdata; OK 1 1 2019-08-28 17:34:51.0 2 2 2019-08-28 17:31:57.0 3 3 2019-08-28 17:31:58.0 4 4 2020-08-28 17:31:57.0 5 5 2020-08-28 17:31:58.0 2 222 2020-08-14 19:13:06.0 3 333 2020-08-14 19:13:04.0 4 4 2020-08-28 17:31:57.0 5 5 2020-08-28 17:31:58.0 Time taken: 0.172 seconds, Fetched: 9 row(s) hive>
由於 HDFS 不支持修改文件,sqoop 的 --incremental
和 --hive-import
不能同時使用,分開進行 import 和 merge 這兩個步驟。
完