[Hive]-常規優化以及執行計划解析


1.HiveSQL優化

  1.1 中心思想

    這里以Hive On MapReduce 為例,Hive On Spark等思路也是一致的.

    HiveSQL會最終轉化為MapReduce進行執行,那么優化的前提是至少對MapReduce有基本的了解

    其次是必須了解HiveSQL會轉化成怎么樣的MapReduce作業(執行計划),這是優化HiveSQL根本依據.切記,HiveSQL的優化本質是對MapReduce作業的優化.

    比如MapReduce的一些特點:

      數據讀取和寫入,都是針對HDFS(磁盤)而言,都是IO操作

      不喜歡某一個任務過大(數據傾斜).一個經典的結論:數據量不是問題,數據傾斜才是

      不喜歡大量過小的任務.任務資源申請等本身初始化和管理也是需要消耗時間和資源得.大量過小任務,導致時間和資源都花在任務維護上了

    所以在HiveSQL上,也是針對這些特點來進行優化

  1.2 一些常見的優化思路

    1.2.1 IO

       只查詢需要的列.MapReduce會根據查詢謂詞裁剪列,簡單說就是不查詢的列不讀,這樣可以降低IO

          盡可能的使用表分區.表分區條件后,MapReduce會直接跳過不需要的分區的全部文件,極大的降低IO

    1.2.2 數據傾斜

        1.2.2.1 慎用count(distinct)

          慎用count(distinct)原因是容易造成數據傾斜.因為其執行的MapReduce是以GroupBy分組,再對distinct列排序,然后輸出交給Reduce.

          問題就在這里,相比其它GroupBy聚合統計,count(distinct)少一個關鍵步驟(Map的預計算,在Map端提前做一次聚合再將聚合結果交給Reduce)

          當Map直接將全部數據交給Reduce后,如果數據的分組本身不平衡(比如及格,80%以上及格數據),會造成某一些Reduce處理太過多的數據,這就是數據傾斜

          count(distinct)可以考慮換GroupBy子查詢

       1.2.2.2 注意null值帶來的數據傾斜

          所有null會認為是同一個值,會走同一個Map,如果null占的比重一大,又是一個數據傾斜.這是業務上考慮是否能做過濾

          這里同樣適用其它的業務null值(比如常見的0,1,-1,-99等業務默認值)

     1.2.3 表關聯

         大表放后 MapReduce從后往前構建數據,先過濾大表把數據量降下來,可以在Reduce端的Hash-Join減少數據量,提示效率

         同列關聯 如可能,用同一列關聯 同列關聯,無論關聯多少表都是一個Map搞定,如果不是同列,就會新開一個MapReduce

    1.2.4 配置優化

       這里的配置,是指MapReduce或Spark配置

 

2.HiveSQL的MR轉換

  2.1 不跑MapReduce的情況

    HiveSQL不是每種情況都會跑MapReduce的.基本查詢,或者是不涉及計算(比如查詢分區表)的查詢,是不會啟動MapReduce任務的

      explain select * from dept_et limit 1;      

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: 1
      Processor Tree:
        TableScan
          alias: dept_et
          Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: id (type: int), name (type: string), city (type: string)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
            Limit
              Number of rows: 1
              Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
              ListSink

  2.2 join

    explain select * from dept_et et join dept_mg mg on et.id= mg.id  

<!--構築MR作業流 4=>3=>0(結束) -->
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  <!--第一步MR 表掃描mg(dept_mg mg) 自帶一個基礎過濾謂詞(id is not null) 這里可以看出 join的基准表是后表 Map Reduce Local 本地化的MapReduce 因為測試表的數據量非常小,所以Hive最終選擇將數據拉取到本地直接操作,而不是去執行一個完整的分布式MapReduce-->
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        mg 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        mg 
          TableScan
            alias: mg
            Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 id (type: int)
                  1 id (type: int)
  <!--第二步的MapReduce任務 表掃描 執行一個 Map Join 輸出_col0, _col1, _col2, _col6, _col7, _col8(也就是語句中的*,全部共6個字段) 輸出結果為 File Output 臨時文件(compressed: false 不壓縮)-->
  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: et
            Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
                Statistics: Num rows: 1 Data size: 354 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                  Statistics: Num rows: 1 Data size: 354 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 354 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

  2.3 group by

    explain select city,sum(id) from dept_et group by city;

    執行計划如下:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  <!--stage定義,一個stage對應一個MapReduce-->
  Stage: Stage-1
    <!--Map過程-->
    Map Reduce
      Map Operator Tree:
          TableScan //表掃描
            alias: dept_et
            Statistics: Num rows: 3 Data size: 322 Basic stats: COMPLETE Column stats: NONE //表dept_et的統計數據預估
            Select Operator //查詢列裁剪,表示只需要 city (type: string), id (type: int) 兩列
              expressions: city (type: string), id (type: int)
              outputColumnNames: city, id
              Statistics: Num rows: 3 Data size: 322 Basic stats: COMPLETE Column stats: NONE
              <!--map操作定義 是以city (type: string)取hash作為key,執行函數sum(id),結果為_col0, _col1(hash(city),sum(id))-->
              Group By Operator 
                aggregations: sum(id) //分組執行函數=>sum(id)
                keys: city (type: string) 
                mode: hash 
                outputColumnNames: _col0, _col1 
                Statistics: Num rows: 3 Data size: 322 Basic stats: COMPLETE Column stats: NONE
                <!--map端的輸出-->
                Reduce Output Operator 
                  key expressions: _col0 (type: string) //Map端輸出的Key是_col0(hash(city))
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 3 Data size: 322 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: bigint) //Map端輸出的Value是_col1(sum(id))
      <!--Reduce過程 合並多個Map的輸出 以_col0(也就是map輸出的hash(city))為key 執行sum(VALUE._col0(也就是map輸出的sum(id))),執行結果也是_col0, _col1(hash(city),sum(sum(id)))-->
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0
          keys: KEY._col0 (type: string)
          mode: mergepartial //partial(多個map的輸出)merge(合並)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 107 Basic stats: COMPLETE Column stats: NONE
          <!--Reduce端的輸出 輸出為一個臨時文件,不壓縮-->
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 107 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

  2.4 distinct

    2.4.1 distinct一個

      select city,count(distinct(name)) from dept_et group by city;

      只有一個distinct,將group字段和distinct字段一起組合為Map的輸出Key,然后把group字段作為Reduce的Key,在Reduce階段保存LastKey      

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      <!--Map端定義 輸入: 表掃描 dept_et 原值查詢city,name 執行過程: 以group列(city),distinct列(name)做為Key,執行表達式count(DISTINCT name) 輸出:_col0, _col1, _col2 (city,name,count(DISTINCT name))-->
      Map Operator Tree:
          TableScan
            alias: dept_et
            Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: city (type: string), name (type: string) //沒有計算函數,直接是查詢原值
              outputColumnNames: city, name
              Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(DISTINCT name)
                keys: city (type: string), name (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
      <!--Reduce端定義 接收Map端的輸出,再以_col0作為Key,再做一次聚合(對city.name做一次去重計數) 結果輸出到臨時文件-->
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

 

    2.4.2 多個distinct字段

      select dealid, count(distinct uid), count(distinct date) from order group by dealid;

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM