今天總結本人在使用Hive過程中的一些優化技巧,希望給大家帶來幫助。Hive優化最體現程序員的技術能力,面試官在面試時最喜歡問的就是Hive的優化技巧。
技巧1.控制reducer數量
下面的內容是我們每次在hive命令行執行SQL時都會打印出來的內容:
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
很多人都會有個疑問,上面的內容是干什么用的。我們一一來解答,先看
set hive.exec.reducers.bytes.per.reducer=<number>
,這個一條Hive命令,用於設置在執行SQL的過程中每個reducer處理的最大字節數量。
- 執行set hive.exec.reducers.bytes.per.reducer=200000;命令,設置每個reducer處理的最大字節是200000。
- 執行sql:
select user_id,count(1) as cnt
from orders group by user_id limit 20;
執行上面的sql時會在控制台打印出信息:
Number of reduce tasks not specified. Estimated from input data size: 159
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 159
控制台打印的信息中第一句話:Number of reduce tasks not specified. Estimated from input data size: 159。翻譯成中文:沒有指定reducer任務數量,根據輸入的數據量估計會有159個reducer任務。然后在看最后一句話:number of mappers: 1; number of reducers: 159。確定該SQL最終生成159個reducer。因此如果我們知道數據的大小,只要通過set hive.exec.reducers.bytes.per.reducer命令設置每個reducer處理數據的大小就可以控制reducer的數量。
接着看
set hive.exec.reducers.max=<number>
這也是一條Hive命令,用於設置Hive的最大reducer數量,如果我們設置number為50,表示reducer的最大數量是50。
我們來驗證下這個說法是否正確:
- 執行命令set hive.exec.reducers.max=8;設置reducer的數量為8。
- 繼續執行sql:
select user_id,count(1) as cnt
from orders group by user_id limit 20;
會在控制台打印出如下信息:
Number of reduce tasks not specified. Estimated from input data size: 8
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0020, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0020/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 8
控制台打印的信息中第一句話:Number of reduce tasks not specified. Estimated from input data size: 8。reducer的數據量為8,正好驗證了我們的說法。set set hive.exec.reducers.max=8;命令是設置reducer的數量的上界。
最后來看 set mapreduce.job.reduces=<number>
命令。這條Hive命令是設置reducer的數據,在執行sql會生成多少個reducer處理數據。使用和上面同樣的方法來驗證set mapreduce.job.reduces=
- 執行命令set mapreduce.job.reduces=5;設置reducer的數量為8。
- 繼續執行sql:
select user_id,count(1) as cnt
from orders group by user_id limit 20;
會在控制台打印出如下信息:
Number of reduce tasks not specified. Defaulting to jobconf value of: 5
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0026, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0026/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0026
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 5
根據Number of reduce tasks not specified. Defaulting to jobconf value of: 5和number of mappers: 1; number of reducers: 5這兩句話,可以知道生成5個reducer。
如果我們將數量由5改成15。還是執行select user_id,count(1) as cnt
from orders group by user_id limit 20;SQL,在控制台打印的信息是:
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 15
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1538917788450_0027, Tracking URL = http://hadoop-master:8088/proxy/application_1538917788450_0027/
Kill Command = /usr/local/src/hadoop-2.6.1/bin/hadoop job -kill job_1538917788450_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 15
可見reducer的數量已經由5變為15個。
小結,控制hive中reducer的數量由三種方式,分別是:
set hive.exec.reducers.bytes.per.reducer=<number>
set hive.exec.reducers.max=<number>
set mapreduce.job.reduces=<number>
其中 set mapreduce.job.reduces=<number>
的方式優先級最高, set hive.exec.reducers.max=<number>
優先級次之, set hive.exec.reducers.bytes.per.reducer=<number>
優先級最低。從hive0.14開始,一個reducer處理文件的大小的默認值是256M。
reducer的數量並不是越多越好,我們知道有多少個reducer就會生成多少個文件,小文件過多在hdfs中就會占用大量的空間,造成資源的浪費。如果reducer數量過小,導致某個reducer處理大量的數據(數據傾斜就會出現這樣的現象),沒有利用hadoop的分而治之功能,甚至會產生OOM內存溢出的錯誤。使用多少個reducer處理數據和業務場景相關,不同的業務場景處理的辦法不同。
技巧2.使用Map join
sql中涉及到多張表的join,當有一張表的大小小於1G時,使用Map Join可以明顯的提高SQL的效率。如果最小的表大於1G,使用Map Join會出現OOM的錯誤。
用法:
select /*+ MAPJOIN(table_a)*/,a.*,b.* from table_a a join table_b b on a.id = b.id
技巧3.使用distinct + union all代替union
如果遇到要使用union去重的場景,使用distinct + union all比使用union的效果好。
distinct + union all的用法:
select count(distinct *)
from (
select order_id,user_id,order_type from orders where order_type='0' union all
select order_id,user_id,order_type from orders where order_type='1' union all
select order_id,user_id,order_type from orders where order_type='1'
)a;
union的用法:
select count(*)
from(
select order_id,user_id,order_type from orders where order_type='0' union
select order_id,user_id,order_type from orders where order_type='0' union
select order_id,user_id,order_type from orders where order_type='1')t;
技巧4.解決數據傾斜的通用辦法
數據傾斜的現象:任務進度長時間維持在99%,只有少量reducer任務完成,未完成任務數據讀寫量非常大,超過10G。在聚合操作是經常發生。
通用解決方法:set hive.groupby.skewindata=true;
將一個map reduce拆分成兩個map reduce。
說說我遇到過的一個場景,需用統計某個一天每個用戶的訪問量,SQL如下:
select t.user_id,count(*) from user_log t group by t.user_id
執行這條語句之后,發現任務維持在99%達到一個小時。后面自己分析user_log表,發現user_id有很多數據為null。user_id為null的數據會有一個reducer來處理,導致出現數據傾斜的現象。解決方法有兩種:
1、通過where條件過濾掉user_id為null的記錄。
2、將為null的user_id設置一個隨機數值。保證所有數據平均的分配到所有的reducer中處理。