1. 多表join優化代碼結構:
select .. from JOINTABLES (A,B,C) WITH KEYS (A.key, B.key, C.key) where ....
關聯條件相同多表join會優化成一個job
2. LeftSemi-Join是可以高效實現IN/EXISTS子查詢的語義
SELECT a.key,a.value FROM a WHERE a.key in (SELECT b.key FROM b);
(1)未實現Left Semi-Join之前,Hive實現上述語義的語句是:
SELECT t1.key, t1.valueFROM a t1
left outer join (SELECT distinctkey from b) t2 on t1.id = t2.id
where t2.id is not null;
(2)可被替換為Left Semi-Join如下:
SELECT a.key, a.valFROM a LEFT SEMI JOIN b on (a.key = b.key)
這一實現減少至少1次MR過程,注意Left Semi-Join的Join條件必須是等值。
3. 預排序減少map join和group by掃描數據HIVE-1194
(1)重要報表預排序,打開hive.enforce.sorting選項即可
(2)如果MapJoin中的表都是有序的,這一特性使得Join操作無需掃描整個表,這將大大加速Join操作。可通過
hive.optimize.bucketmapjoin.sortedmerge=true開啟這個功能,獲得高的性能提升。
set hive.mapjoin.cache.numrows=10000000; set hive.mapjoin.size.key=100000; Insert overwrite table pv_users Select /*+MAPJOIN(pv)*/ pv.pageid,u.age from page_view pv join user u on (pv.userid=u.userid;
(3)Sorted Group byHIVE-931
對已排序的字段做Group by可以不再額外提交一次MR過程。這種情況下可以提高執行效率。
4. 次性pv uv計算框架
(1)多個mr任務批量提交
hive.exec.parallel[=false]
hive.exec.parallel.thread.number[=8]
(2) 一次性計算框架,結合multi group by
如果少量數據多個union會優化成一個job;
反之計算量過大可以開啟批量mr任務提交減少計算壓力;
利用兩次group by 解決count distinct 數據傾斜問題
Set hive.exec.parallel=true; Set hive.exec.parallel.thread.number=2; From( Select Yw_type, Sum(case when type=’pv’ then ct end) as pv, Sum(case when type=’pv’ then 1 end) as uv, Sum(case when type=’click’ then ct end) as ipv, Sum(case when type=’click’ then 1 end) as ipv_uv from ( select yw_type,log_type,uid,count(1) as ct from ( select ‘total’ yw_type,‘pv’ log_type,uid from pv_log union all select ‘cat’ yw_type,‘click’ log_type,uid from click_log ) t group by yw_type,log_type ) t group by yw_type ) t Insert overwrite table tmp_1 Select pv,uv,ipv,ipv_uv Where yw_type=’total’ Insert overwrite table tmp_2 Select pv,uv,ipv,ipv_uv Where yw_type=’cat’;
5. 控制hive中的map和reduce數
(1)合並小文件
set mapred.max.split.size=100000000; set mapred.min.split.size.per.node=100000000; set mapred.min.split.size.per.rack=100000000; set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
hive.input.format=……表示合並小文件。大於文件塊大小128m的,按照128m來分隔,小於128m,大於100m的,按照100m來分隔,把那些小於100m的(包括小文件和分隔大文件剩下的),進行合並,最終生成了74個塊
(2)耗時任務增大map數
setmapred.reduce.tasks=10;
6. 利用隨機數減少數據傾斜
大表之間join容易因為空值產生數據傾斜
select a.uid from big_table_a a left outer join big_table_b b on b.uid = case when a.uid is null or length(a.uid)=0 then concat('rd_sid',rand()) else a.uid end;