sqoop創建並導入數據到hive orc表
sqoop import \ --connect jdbc:mysql://localhost:3306/spider \ --username root --password 1234qwer \ --table org_ic_track --driver com.mysql.jdbc.Driver \ --create-hcatalog-table \ --hcatalog-database spider_tmp \ --hcatalog-table org_ic_track \ --hcatalog-partition-keys batch \ --hcatalog-partition-values 20190404 \ --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \ -m 1
查看表結構
CREATE TABLE `org_ic_track`( `id` int, `info_id` int, `company` varchar(250), `company_url` varchar(250), `invest_date` varchar(150), `invested_company` varchar(500), `invested_ratio` varchar(100), `update_time` string) PARTITIONED BY ( `batch` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hadoop1:8020/home/hive/warehouse/spider_tmp.db/org_ic_track' TBLPROPERTIES ( 'orc.compress'='SNAPPY', 'transient_lastDdlTime'='1554342988')
sqoop導入數據到已存在的hive orc表
sqoop import \ --connect jdbc:mysql://localhost:3306/spider \ --username root --password 1234qwer \ --table org_ic_track --driver com.mysql.jdbc.Driver \ --hcatalog-database spider_tmp \ --hcatalog-table org_ic_track \ --hcatalog-partition-keys batch \ --hcatalog-partition-values 20190405 \ -m 1
sqoop導入數據(query)到已存在的hive orc表
sqoop import \ --connect jdbc:mysql://localhost:3306/spider \ --username root --password 1234qwer \ --query "select * from org_ic_track where update_time between '2019-04-01 21:16:04' and '2019-04-01 21:16:05' and \$CONDITIONS" \ --driver com.mysql.jdbc.Driver \ --hcatalog-database spider_tmp \ --hcatalog-table org_ic_track \ --hcatalog-partition-keys batch \ --hcatalog-partition-values 20190406 \ -m 1
字段說明
connect JDBC連接信息 username JDBC驗證用戶名 password JDBC驗證密碼 table 要導入的源表名 driver 指定JDBC驅動 create-hcatalog-table 指定需要創建表,若不指定則默認不創建,注意若指定創建的表已存在將會報錯 hcatalog-database 目標庫 hcatalog-table 目標表名 hcatalog-storage-stanza 指定存儲格式,該參數值會拼接到create table的命令中。默認:stored as rcfile hcatalog-partition-keys 指定分區字段,多個字段請用逗號隔開(hive-partition-key的加強版) hcatalog-partition-values 指定分區值,多分區值請用逗號隔開(hive-partition-value的加強)
注:若不指定字段類型,MySQL中的varchar數據抽取至hive中也會是varchar類型,但是varchar類型在hive中操作會出現各種問題
1.抽取時長文本、含有特殊字符的文本抽取不全
2.hive操作orc表varchar類型的字段造成亂碼
解決:抽取數據時指定字段類型
-map-column-hive company=String,company_url=String