第十章 Hive調優 【大表Join大表-bucketmapjoin】


1. hive 的三種join
1. reduceJoin 也叫 Common Join、Shuffle Join
2. MapJoin
3. Sort Merge Bucket Join(分桶表Join)

2. SMB(Sort Merge Bucket) Join 分桶表join
說明 : 大表與大表join時,如果key分布均勻,單純因為數據量過大,導致任務失敗或運行時間過長
可以考慮將大表分桶,來優化任務
原理 :
key % 分桶數 = 分桶編號
分桶編號1 join 分桶編號1
注意 : A表、B表 都需要是分桶表且分桶規則相同
參數 :
        set hive.optimize.bucketmapjoin=true;
            set hive.optimize.bucketmapjoin.sortedmerge=true;
            set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
3.測試案例
-- 對照組 -- 不分桶 A表
create table bigtable2( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t'; -- 不分桶 B表
create table bigtable( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t'; -- 導入數據
load data local inpath '/root/bigtable' into table bigtable2; load data local inpath '/root/bigtable' into table bigtable; -- 不分桶關聯
set yarn.scheduler.maximum-allocation-mb=118784; set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096; set yarn.nodemanager.vmem-pmem-ratio=4.2; insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable s join bigtable2 b on b.id = s.id; Time taken: 109.024 seconds -- 實驗組

-- 創建分桶表1,分桶個數不超過CPU核數
create table bigtable_buck1( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) clustered by(id) sorted by(id) into 2 buckets row format delimited fields terminated by '\t'; load data local inpath '/root/bigtable' into table bigtable_buck1; -- 創建分桶表2,分桶個數不超過CPU核數
create table bigtable_buck2( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) clustered by(id) sorted by(id) into 2 buckets row format delimited fields terminated by '\t'; load data local inpath '/root/bigtable' into table bigtable_buck2; -- 參數設置(開啟分桶連接)
set hive.optimize.bucketmapjoin=true; set hive.optimize.bucketmapjoin.sortedmerge=true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable_buck1 s join bigtable_buck2 b on b.id = s.id; Time taken: 64.895 seconds 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM