一起學Hive——總結常用的Hive優化技巧


今天總結本人在使用Hive過程中的一些優化技巧,希望給大家帶來幫助。Hive優化最體現程序員的技術能力,面試官在面試時最喜歡問的就是Hive的優化技巧。

技巧1.控制reducer數量

下面的內容是我們每次在hive命令行執行SQL時都會打印出來的內容:

In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>

很多人都會有個疑問,上面的內容是干什么用的。我們一一來解答,先看

set hive.exec.reducers.bytes.per.reducer=<number>,這個一條Hive命令,用於設置在執行SQL的過程中每個reducer處理的最大字節數量。 可以在配置文件中設置,也可以由我們在命令行中直接設置。如果處理的數據量大於number,就會多生成一個reudcer。例如,number = 1024K,處理的數據是1M,就會生成10個reducer。我們來驗證下上面的說法是否正確:

  1. 執行set hive.exec.reducers.bytes.per.reducer=200000;命令,設置每個reducer處理的最大字節是200000。
  2. 執行sql:
select user_id,count(1) as cnt 
  from orders group by user_id limit 20; 

執行上面的sql時會在控制台打印出信息:

  Number of reduce tasks not specified. Estimated from input data size: 159
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 159

控制台打印的信息中第一句話:Number of reduce tasks not specified. Estimated from input data size: 159。翻譯成中文:沒有指定reducer任務數量,根據輸入的數據量估計會有159個reducer任務。然后在看最后一句話:number of mappers: 1; number of reducers: 159。確定該SQL最終生成159個reducer。因此如果我們知道數據的大小,只要通過set hive.exec.reducers.bytes.per.reducer命令設置每個reducer處理數據的大小就可以控制reducer的數量。

接着看
set hive.exec.reducers.max=<number> 這也是一條Hive命令,用於設置Hive的最大reducer數量,如果我們設置number為50,表示reducer的最大數量是50。
我們來驗證下這個說法是否正確:

  1. 執行命令set hive.exec.reducers.max=8;設置reducer的數量為8。
  2. 繼續執行sql:
select user_id,count(1) as cnt 
  from orders group by user_id limit 20; 

會在控制台打印出如下信息:

Number of reduce tasks not specified. Estimated from input data size: 8
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 8

控制台打印的信息中第一句話:Number of reduce tasks not specified. Estimated from input data size: 8。reducer的數據量為8,正好驗證了我們的說法。set set hive.exec.reducers.max=8;命令是設置reducer的數量的上界。

最后來看 set mapreduce.job.reduces=<number>命令。這條Hive命令是設置reducer的數據,在執行sql會生成多少個reducer處理數據。使用和上面同樣的方法來驗證set mapreduce.job.reduces= 這條命令。

  1. 執行命令set mapreduce.job.reduces=5;設置reducer的數量為8。
  2. 繼續執行sql:
select user_id,count(1) as cnt 
  from orders group by user_id limit 20; 

會在控制台打印出如下信息:

Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0026, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0026/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0026
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5

根據Number of reduce tasks not specified. Defaulting to jobconf value of: 5和number of mappers: 1; number of reducers: 5這兩句話,可以知道生成5個reducer。

如果我們將數量由5改成15。還是執行select user_id,count(1) as cnt
from orders group by user_id limit 20;SQL,在控制台打印的信息是:

Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 15
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0027, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0027/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job  -kill job_1538917788450_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 15

可見reducer的數量已經由5變為15個。

小結,控制hive中reducer的數量由三種方式,分別是:

set hive.exec.reducers.bytes.per.reducer=<number> 
set hive.exec.reducers.max=<number>
set mapreduce.job.reduces=<number>

其中 set mapreduce.job.reduces=<number>的方式優先級最高, set hive.exec.reducers.max=<number>優先級次之, set hive.exec.reducers.bytes.per.reducer=<number> 優先級最低。從hive0.14開始,一個reducer處理文件的大小的默認值是256M。

reducer的數量並不是越多越好,我們知道有多少個reducer就會生成多少個文件,小文件過多在hdfs中就會占用大量的空間,造成資源的浪費。如果reducer數量過小,導致某個reducer處理大量的數據(數據傾斜就會出現這樣的現象),沒有利用hadoop的分而治之功能,甚至會產生OOM內存溢出的錯誤。使用多少個reducer處理數據和業務場景相關,不同的業務場景處理的辦法不同。

技巧2.使用Map join

sql中涉及到多張表的join,當有一張表的大小小於1G時,使用Map Join可以明顯的提高SQL的效率。如果最小的表大於1G,使用Map Join會出現OOM的錯誤。
用法:

select /*+ MAPJOIN(table_a)*/,a.*,b.* from table_a a join table_b b on a.id = b.id

技巧3.使用distinct + union all代替union

如果遇到要使用union去重的場景,使用distinct + union all比使用union的效果好。
distinct + union all的用法:

select count(distinct *) 
from (
select order_id,user_id,order_type from orders where order_type='0' union all
select order_id,user_id,order_type from orders where order_type='1' union all 
select order_id,user_id,order_type from orders where order_type='1'
)a;

union的用法:

select count(*) 
from(
select order_id,user_id,order_type from orders where order_type='0' union
select order_id,user_id,order_type from orders where order_type='0' union
select order_id,user_id,order_type from orders where order_type='1')t;

技巧4.解決數據傾斜的通用辦法

數據傾斜的現象:任務進度長時間維持在99%,只有少量reducer任務完成,未完成任務數據讀寫量非常大,超過10G。在聚合操作是經常發生。
通用解決方法:set hive.groupby.skewindata=true;
將一個map reduce拆分成兩個map reduce。

說說我遇到過的一個場景,需用統計某個一天每個用戶的訪問量,SQL如下:

select t.user_id,count(*) from user_log t group by t.user_id

執行這條語句之后,發現任務維持在99%達到一個小時。后面自己分析user_log表,發現user_id有很多數據為null。user_id為null的數據會有一個reducer來處理,導致出現數據傾斜的現象。解決方法有兩種:
1、通過where條件過濾掉user_id為null的記錄。
2、將為null的user_id設置一個隨機數值。保證所有數據平均的分配到所有的reducer中處理。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM