現象描述

在使用Tez引擎查詢時，發現一個bug：

SELECT
         t1.*,t2.activity_id,t3.timeMap
     from
         (select * from ods_order_info where dt='2020-03-29') t1  --單獨查詢結果為7條
             left join
         (select order_id,activity_id  from ods_activity_order where dt='2020-03-29') t2  --t1 與t2 left join ，結果為7條
         on t1.id=t2.order_id
             join
         (select order_id,str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') timeMap
          from ods_order_status_log where dt='2020-03-29'
          group by order_id) t3  --單獨查詢是跟t1主鍵相同的7條
         on t3.order_id=t1.id

講道理此SQL查詢出的結果應該是7條，但是結果確是4條

而利用MR引擎查詢出來的結果是正確的7條

set hive.execution.engine=mr;

原因分析

這是因為Tez和MR一樣，都默認開啟了mapjoin，這里面涉及到了幾個參數

-- 是否自動開啟mapjoin,默認為true
set hive.auto.convert.join=true;

-- mapjoin小表和大表的閾值設置
set hive.mapjoin.smalltable.filesize=25000000;

-- 多個mapjoin 轉換為1個時，限制輸入的最大的數據量 影響tez，默認10m
set hive.auto.convert.join.noconditionaltask.size =10000000;

當表的數據大於10m時，tez會把多余的那部分數據截掉，這樣就會造成丟數據

解決方法

1.

hive.mapjoin.smalltable.filesize和hive.mapjoin.smalltable.filesize一致或者更大，一般擴大10倍是不會有問題的保證小表中所有的數據，都可以參與計算。

2.

關閉map join

參考https://blog.csdn.net/qq_37714755/article/details/105438009

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive on tez Hive集成Tez hive on tez 異常 hive 更換 tez 引擎（二）配置 Hive On Tez hive tez調優（3） hive on tez 任務失敗 Apache Tez on hive 出現 Expected 0 arguments but found 1 的bug原因 Hive計算引擎大PK，萬字長文解析MapRuce、Tez、Spark三大引擎