hive的壓縮設置hive.exec.compress.output和hive.exec.compress.intermediate

本文轉載自查看原文 2021-08-16 11:44 103

壓縮配置：
map/reduce 輸出壓縮（一般采用序列化文件存儲）
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.output.compression.type=BLOCK;

任務中間壓縮
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;（常用）
set hive.intermediate.compression.type=BLOCK;

1、是否選擇文件壓縮：
在hadoop作業執行過程中，job執行速度更多的是局限於I/O，而不是受制於CPU。如果是這樣，通過文件壓縮可以提高hadoop性能。然而，如果作業的執行速度受限於CPU的性能，呢么壓縮文件可能就不合適，因為文件的壓縮和解壓會花費掉較多的時間。當然確定適合集群最優配置的最好方式是通過實驗測試，然后衡量結果。
2、壓縮格式
GZip 和 BZip2壓縮格式是所有最近的hadoop版本支持的，而且linux本地的庫也支持這種格式的壓縮和解壓縮。
Snappy是最近添加的壓縮格式，可以自己添加這種壓縮格式
LZO是經常用到的壓縮格式
GZip 和 BZip2壓縮可以保證最小的壓縮文件，但是過於消耗時間； Snappy和LZO壓縮和解壓縮很快，但是壓縮的文件較大。所以如何選擇壓縮格式，需要根據具體的需求決定。（I/O,CPU）
BZip2 and LZO支持壓縮文件分割
3、中間壓縮

中間壓縮就是處理作業map任務和reduce任務之間的數據，對於中間壓縮，最好選擇一個節省CPU耗時的壓縮方式
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
<description> This controls whether intermediate files produced by Hive between
multiple map-reduce jobs are compressed. The compression codec and other options
are determined from hadoop config variables mapred.output.compress* </description>
</property>

hadoop壓縮有一個默認的壓縮格式，當然可以通過修改mapred.map.output.compression.codec屬性，使用新的壓縮格式，這個變量可以在
mapred-site.xml 中設置或者在 hive-site.xml文件。 SnappyCodec 是一個較好的壓縮格式，CPU消耗較低。
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description> This controls whether intermediate files produced by Hive
between multiple map-reduce jobs are compressed. The compression codec
and other options are determined from hadoop config variables
mapred.output.compress* </description>
</property>

4、最終的壓縮輸出

作業最終的輸出也可以壓縮，hive.exec.compress.output這個屬性控制這個操作。當然，如果僅僅只需要在某一次作業中使用最終壓縮，呢么，可以直接在腳本中設置這個屬性，而不必修改配置文件
<property>
<name>hive.exec.compress.output</name>
<value>false</value>
<description> This controls whether the final outputs of a query
(to a local/hdfs file or a Hive table) is compressed. The compression
codec and other options are determined from hadoop config variables
mapred.output.compress* </description>
</property>

如果hive.exec.compress.output這個屬性被設置成true，呢么可以選擇GZip壓縮方式，這種方式具有很好的壓縮效果，減少I/O，當然GZip壓縮格式文件是不允許被分割的。
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>

5、序列化文件
序列化文件支持hadoop把文件按塊分割，同時支持壓縮文件分割。
在hive中可以通過以下設置使用序列化文件：
CREATE TABLE a_sequence_file_table STORED AS SEQUENCEFILE;

序列化文件有三種不同的壓縮方式: NONE, RECORD, and BLOCK.
RECORD是默認的；
BLOCK壓縮方式比較有效，同時可以支持文件分割，和其他的屬性一樣，這個屬性不是hive獨有的，可以通過hadoop的mapred-site.xml文件和hive的hive.site.xml文件設置，也可以通過腳本、終端查詢設置

<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles,
how should they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
————————————————
版權聲明：本文為CSDN博主「djd已經存在」的原創文章，遵循CC 4.0 BY-SA版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/djd1234567/article/details/51581354

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive.exec.parallel參數 hive的hive.exec.parallel參數說明啟動hive --service metastore &出現Missing Hive Execution Jar: /opt/apache-hive-1.2.0-bin//lib/hive-exec-*.jar 【原創】大叔經驗分享（84）spark sql中設置hive.exec.max.dynamic.partitions無效關於HIVE做MapReduce報錯:return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask 報錯：hive tez return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask Hive問題：Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask hive-”return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask”問題 Hive hive (with as)