impala+hdfs+parquet格式文件


[創建目錄]
hdfs dfs -mkdir -p /user/hdfs/sample_data/parquet

[賦予權限]
sudo -u hdfs hadoop fs -chown -R impala:supergroup /user/hdfs/sample_data

[刪除目錄]
hdfs dfs -rm -r /user/hdfs/sample_data/parquet

[上傳文件]
hdfs dfs -put -f device /user/hdfs/sample_data/parquet
hdfs dfs -put -f metrics /user/hdfs/sample_data/parquet

[查看文件]
hdfs dfs -ls /user/hdfs/sample_data/parquet

[impala建表,不帶分區](創建表之后,還需要通過下面的alter語句添加分區)
DROP TABLE IF EXISTS device_parquet;
CREATE EXTERNAL TABLE device_parquet
(
deviceId STRING,
deviceName STRING,
orgId STRING
)

STORED AS PARQUET
LOCATION '/user/hdfs/sample_data/parquet/device';

[impala建表,帶分區]
DROP TABLE IF EXISTS metrics_parquet;
CREATE EXTERNAL TABLE metrics_parquet
(
deviceId STRING,
reading BIGINT,
time STRING
)
partitioned by (year string)
STORED AS PARQUET
LOCATION '/user/hdfs/sample_data/parquet/metrics';

[添加表分區]
alter table metrics_parquet add partition (year="2017");
alter table metrics_parquet add partition (year="2018");

[刪除分區]
alter table metrics_parquet drop partition (year="2017");
alter table metrics_parquet drop partition (year="2018");

[查看表分區]
show partitions metrics_parquet;

[不指定分區查詢數據]
select
T_3C75F1.`deviceId`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`),
sum(T_3C75F1.`reading`),
count(1)
from (select device_parquet.deviceId,reading,metrics_parquet.time as time from device_parquet,metrics_parquet where device_parquet.deviceId=metrics_parquet.deviceId) as `T_3C75F1`
group by
T_3C75F1.`deviceId`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`);

耗時:device表50條,metrics表1億條(261M)執行上面的查詢語句,耗時平均135秒

[指定分區查詢數據]
select
T_3C75F1.`deviceId`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`),
sum(T_3C75F1.`reading`),
count(1)
from (select device_parquet.deviceId,reading,metrics_parquet.time as time from device_parquet,metrics_parquet where device_parquet.deviceId=metrics_parquet.deviceId and year='2017') as `T_3C75F1`
group by
T_3C75F1.`deviceId`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`);

耗時:device表50條,metrics表1億條(261M)執行上面的查詢語句,耗時平均96秒

[查詢多個分區的數據]
select
T_3C75F1.`deviceId`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`),
sum(T_3C75F1.`reading`),
count(1)
from (select device_parquet.deviceId,reading,metrics_parquet.time as time from device_parquet,metrics_parquet where device_parquet.deviceId=metrics_parquet.deviceId and year in ('2017','2018')) as `T_3C75F1`
group by
T_3C75F1.`deviceId`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`);

[刷新數據](hdfs中數據發生變化時,需要執行以下命令更新impala)
refresh device_parquet;
refresh metrics_parquet;

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM