HIVE存儲格式ORC、PARQUET對比


  hive有三種默認的存儲格式,TEXT、ORC、PARQUET。TEXT是默認的格式,ORC、PARQUET是列存儲格式,占用空間和查詢效率是不同的,專門測試過后記錄一下。

一:建表語句差別

create table if not exists text(
a bigint
) partitioned by (dt string)
row format delimited fields terminated by '\001'
location '/hdfs/text/';

create table if not exists orc(
a bigint)
partitioned by (dt string)
row format delimited fields terminated by '\001'
stored as orc
location '/hdfs/orc/';

create table if not exists parquet(
a bigint)
partitioned by (dt string)
row format delimited fields terminated by '\001'
stored as parquet
location '/hdfs/parquet/';

 

其實就是stored as 后面跟的不一樣

二:HDFS存儲對比

parquet orc text
709M 275M 1G
687M 249M 1G
647M 265M 1G

 

三:查詢時間對比

parquet orc text
36.451 26.133 42.574
38.425 29.353 41.673
36.647 27.825 43.938

四:文件如何生成

val sparkSession = SparkSession.builder().master("local").appName("pushFunnelV3").getOrCreate()
val javasc = new JavaSparkContext(sparkSession.sparkContext)
val nameRDD = javasc.parallelize(util.Arrays.asList("{'name':'zhangsan','age':'18'}", "{'name':'lisi','age':'19'}")).rdd;
sparkSession.read.json(nameRDD).write.mode(SaveMode.Overwrite).csv("/data/aa")
sparkSession.read.json(nameRDD).write.mode(SaveMode.Overwrite).orc("/data/bb")
sparkSession.read.json(nameRDD).write.mode(SaveMode.Overwrite).parquet("/data/cc")


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM