Hive/Impala批量插入數據

本文轉載自查看原文 2020-01-06 09:07 5380 hive/ Big Data/ hdfs/ impala/ DB

問題描述

現有幾千條數據，需要插入到對應的Hive/Impala表中。安排給了一個同事做，但是等了好久，反饋還沒有插入完成……看到他的做法是：對每條數據進行處理轉換為對應的insert語句，但是，實際執行起來，速度很慢，每條數據都要耗時1s左右。比在MySQL中批量插入數據慢多了，因而抱怨Impala不太好用

問題分析

首先，必須明確的是，把每條數據處理成insert語句的方式，肯定是最低效的，不管是在MySQL中，還是在分布式組件Hive、Impala中。

這種方式的資源消耗，更多的花在了連接、SQL語句的解析、執行計划生成上，實際插入數據的開銷還是相對較少的。

所以，要提高批量數據的插入，關鍵是減少無謂的資源開銷，提高一條SQL的吞吐率，即通過盡量少的SQL條數，插入更多的數據。

解決方案

測試數據：

aaa
bbb
ccc
ddd
eee
fff
ggg
hhh
iii
jjj

測試表：

create table if not exists test.test_batch_insert(
    f1 string
) comment 'test for batch insert'
row format delimited fields terminated by '\t' lines terminated by '\n'
stored as textfile;

方案1（最慢的）：數據轉換為insert語句

step1：處理成sql語句

vim中：
%s/^/insert into test.test_batch_insert select '/g
%s/$/';/g
 
 
或者使用awk:
awk '{printf "insert into test.test_batch_insert select \"%s\";\n", $0}' test.txt > test.sql

生成的SQL腳本：

insert into test.test_batch_insert select "aaa";
insert into test.test_batch_insert select "bbb";
insert into test.test_batch_insert select "ccc";
insert into test.test_batch_insert select "ddd";
insert into test.test_batch_insert select "eee";
insert into test.test_batch_insert select "fff";
insert into test.test_batch_insert select "ggg";
insert into test.test_batch_insert select "hhh";
insert into test.test_batch_insert select "iii";
insert into test.test_batch_insert select "jjj";

step2：執行生成的SQL腳本

impala-shell -i data1 -f test.sql

一條條執行，比較慢……

方案2（相對快點）：一條SQL盡量插入多條數據

step1：轉換成SQL

awk 'BEGIN{print "insert into test.test_batch_insert"; i=1; n=10} {if(i<n){ printf "select \"%s\" union\n", $0; i++} else {printf "select \"%s\";", $0}}' test.txt > test2.sql
 
 
vim %s 或者 sed也行

生成的SQL腳本：

insert into test.test_batch_insert
select "aaa" union
select "bbb" union
select "ccc" union
select "ddd" union
select "eee" union
select "fff" union
select "ggg" union
select "hhh" union
select "iii" union
select "jjj";

step2：執行生成的SQL

執行前，先清空表；

impala-shell -i data1 -f test2.sql

執行之后，會發現，不止快了一點點……

但是，這種方式有局限……

因為，一條SQL的長度是有限制的，數據量大了，只生成一條SQL，會導致超長，無法執行。此時，可以考慮分割文件：

split -l 500 test.txt test_split_

然后，編寫腳本遍歷每個文件分片，重復上述操作即可。

方案3（最快的，如果你沒有更好的）

step1：首先查看下test.test_batch_insert的建表語句：

impala-shell -i data1 -B -q "show create table test.test_batch_insert"

建表語句如下：

Query: show create table test.test_batch_insert
"CREATE TABLE test.test_batch_insert (
  f1 STRING
)
 COMMENT 'test for batch insert'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
WITH SERDEPROPERTIES ('field.delim'='\t', 'line.delim'='\n', 'serialization.format'='\t')
STORED AS TEXTFILE
LOCATION 'hdfs://xxxxxx:8020/user/hive/warehouse/test.db/test_batch_insert'
"

關注一下LOCATION屬性，在HDFS上查看下該路徑：

hdfs dfs -ls /user/hive/warehouse/test.db/test_batch_insert

然后，看下文件內容：

hdfs dfs -cat /user/hive/warehouse/test.db/test_batch_insert/*data.0.

發現了吧，就是可讀的純文本文件，每行都是一條數據。因為前面建表的時候，就指定了用\n作為記錄分隔符。

看到這里，聰明的你，應該知道我接下來要做什么了……

step2：上傳數據文件

首先，再次清空test.test_batch_insert；

然后，上傳文件：

hdfs dfs -put test.txt /user/hive/warehouse/test.db/test_batch_insert

此時，在hive表中，應該能直接查詢到數據了，impala中還需要刷新下表：

impala-shell命令行窗口中執行：
refresh test.test_batch_insert;

然后，搞定了……

其實，hive/impla類似於MySQL，有對應的load data的語句……這里只是把load data語句實際干的事展示了一下……

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive 實現HBase 數據批量插入 Hive 批量插入數據到多個表使用 python 批量插入 hive Hive查詢結果批量插入分區 impala為什么比hive快 [轉]impala操作hive數據實例 [Hive_4] Hive 插入數據 hive插入數據-單條 mysql批量插入數據用PreparedStatement批量插入數據