不同系統、不同存儲格式（textfile， parquet）數據的傳遞

本文轉載自查看原文 2018-06-17 22:47 2088 數據/ hive 導出 textfile parquet

描述：

本地測試環境hive中有數據，存儲格式為textfile，現在要上傳到公司開發環境，存儲格式為parquet，如何實現？？？

tb_textfile表---> local file --->tb_parquet(❌)

tb_textfile表---> local file --->tb_textfile_tmp ---> tb_parquet(✔️)

[因為是不同的系統，不能直接將tb_textfile表中的數據導入tb_parquet中，中間需要先導出到本地文件]

--建表tb_textfile：指明分隔符，textfile存儲
create table if not exists tb_textfile(id int, name string) partitioned by(time string) row format delimited fields terminated by '\t' stored as textfile;

--加載數據到tb_textfile
insert into tb_textfile partition(time='20180616') values (111,'text111'),(222,'text222'),(333,'text333');

--導出tb_textfile數據到本地文件夾，指明分隔符
insert overwrite local directory '/Users/wooluwalker/Desktop/export_test' row format delimited fields terminated by '\t' select * from tb_textfile;

--目標文件夾export_test中出現 000000_0 文件

--cat /Users/wooluwalker/Desktop/export_test/000000_0

        111    text111    20180616
        222    text222    20180616
        333    text333    20180616

--創建tb_parquet表，指明分隔符，parquet格式存儲        
create table if not exists tb_parquet(id int, name string) partitioned by(time string) row format delimited fields terminated by '\t' stored as parquet;

--上傳export_test目錄中的數據到hive的tb_parquet表
load data local inpath '/Users/wooluwalker/Desktop/export_test/000000_0' into table tb_parquet partition(time='20180616');
--查看上傳的數據
select * from tb_parquet;
返回的結果是：
Failed with exception java.io.IOException:java.lang.RuntimeException: 
hdfs://0.0.0.0:9000/user/hive/warehouse/hivetest.db/tb_parquet/time=20180616/000000_0 is not a Parquet file. 
expected magic number at tail [80, 65, 82, 49] but found [54, 49, 54, 10]

由此證明，不能將textfile格式存儲的表所導出的文件，直接上傳到 parquet格式的表中

解決方式：
將export_test目錄中的數據到hive的textfile格式存儲的表，然后再由此表導出數據到parquet中
-- 上一步上傳的數據格式不對，需要先清空，否則無法select
truncate table tb_parquet;
--創建textfile格式的中間表tb_textfile_tmp，指明分隔符，存儲格式為textfile
create table if not exists tb_textfile_tmp(id int, name string) partitioned by(time string) row format delimited fields terminated by '\t' stored as textfile;
--上傳數據到textfile格式的中間表中
load data local inpath '/Users/wooluwalker/Desktop/export_test/000000_0' into table tb_textfile_tmp partition(time='20180616');
--將textfile格式的中間表數據導出到parquet格式的目標表 tb_parquet
insert into tb_parquet partition(time='20180616') select id, name from tb_textfile_tmp;
--查看表數據
select * from tb_parquet;

111    text111    20180616
222    text222    20180616
333    text333    20180616

hive編程指南中講，‘不管源表中數據如何存儲，hive會將所有字段序列化生成字符串寫入到文件中，hive使用和hive內存存儲的表相同的編碼方式來生成輸出文件’，因此textfile導出的文件不能導入parquet表中

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Parquet 列式存儲格式 Parquet列式存儲格式 java 讀寫Parquet格式的數據 Parquet example Hive 導入 parquet 格式數據（未完，待續） mapreduce 讀寫Parquet格式數據 Demo 數據倉庫之 ORC/PARQUET等文件保存格式 & 導入方法 Parquet 格式文件 iceberg數據存儲格式 HBase數據存儲格式 parquet 合並元數據