第十章 Hive調優【大表Join大表-bucketmapjoin】

本文轉載自查看原文 2022-02-10 19:42 1637 Hive

1. hive 的三種join
    1. reduceJoin 也叫 Common Join、Shuffle Join
    2. MapJoin
    3. Sort Merge Bucket Join(分桶表Join)

2. SMB(Sort Merge Bucket) Join 分桶表join
 說明 : 大表與大表join時,如果key分布均勻,單純因為數據量過大,導致任務失敗或運行時間過長
              可以考慮將大表分桶,來優化任務
 原理 :
            key % 分桶數 = 分桶編號
            分桶編號1 join 分桶編號1
        注意 :  A表、B表 都需要是分桶表且分桶規則相同
 參數 :

 　　　　　　 set hive.optimize.bucketmapjoin=true;
            set hive.optimize.bucketmapjoin.sortedmerge=true;
            set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

3.測試案例

-- 對照組 -- 不分桶 A表
create table bigtable2( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t'; -- 不分桶 B表
create table bigtable( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t'; -- 導入數據
load data local inpath '/root/bigtable' into table bigtable2; load data local inpath '/root/bigtable' into table bigtable; -- 不分桶關聯
set yarn.scheduler.maximum-allocation-mb=118784; set mapreduce.map.memory.mb=4096; set mapreduce.reduce.memory.mb=4096; set yarn.nodemanager.vmem-pmem-ratio=4.2; insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable s join bigtable2 b on b.id = s.id; Time taken: 109.024 seconds -- 實驗組

-- 創建分桶表1,分桶個數不超過CPU核數
create table bigtable_buck1( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) clustered by(id) sorted by(id) into 2 buckets row format delimited fields terminated by '\t'; load data local inpath '/root/bigtable' into table bigtable_buck1; -- 創建分桶表2,分桶個數不超過CPU核數
create table bigtable_buck2( id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) clustered by(id) sorted by(id) into 2 buckets row format delimited fields terminated by '\t'; load data local inpath '/root/bigtable' into table bigtable_buck2; -- 參數設置(開啟分桶連接)
set hive.optimize.bucketmapjoin=true; set hive.optimize.bucketmapjoin.sortedmerge=true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable_buck1 s join bigtable_buck2 b on b.id = s.id; Time taken: 64.895 seconds

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 第十章 Hive調優【大小表join-MapJoin】第十章 Hive調優【本地模式】第十章 Hive調優【小文件合並】第十章 Hive調優【嚴格模式】第十章 Hive調優【group by 開啟map端聚合】第十章 Hive調優【笛卡爾積】第十章 Hive調優【合理設置Map數】第十章 Hive調優【合理設置Reduce數】第十章 Ingress 第十章組網技術

第十章 Hive調優 【大表Join大表-bucketmapjoin】

免責聲明！

第十章 Hive調優【大表Join大表-bucketmapjoin】