1.HiveSQL優化
1.1 中心思想
這里以Hive On MapReduce 為例,Hive On Spark等思路也是一致的.
HiveSQL會最終轉化為MapReduce進行執行,那么優化的前提是至少對MapReduce有基本的了解
其次是必須了解HiveSQL會轉化成怎么樣的MapReduce作業(執行計划),這是優化HiveSQL根本依據.切記,HiveSQL的優化本質是對MapReduce作業的優化.
比如MapReduce的一些特點:
數據讀取和寫入,都是針對HDFS(磁盤)而言,都是IO操作
不喜歡某一個任務過大(數據傾斜).一個經典的結論:數據量不是問題,數據傾斜才是
不喜歡大量過小的任務.任務資源申請等本身初始化和管理也是需要消耗時間和資源得.大量過小任務,導致時間和資源都花在任務維護上了
所以在HiveSQL上,也是針對這些特點來進行優化
1.2 一些常見的優化思路
1.2.1 IO
只查詢需要的列.MapReduce會根據查詢謂詞裁剪列,簡單說就是不查詢的列不讀,這樣可以降低IO
盡可能的使用表分區.表分區條件后,MapReduce會直接跳過不需要的分區的全部文件,極大的降低IO
1.2.2 數據傾斜
1.2.2.1 慎用count(distinct)
慎用count(distinct)原因是容易造成數據傾斜.因為其執行的MapReduce是以GroupBy分組,再對distinct列排序,然后輸出交給Reduce.
問題就在這里,相比其它GroupBy聚合統計,count(distinct)少一個關鍵步驟(Map的預計算,在Map端提前做一次聚合再將聚合結果交給Reduce)
當Map直接將全部數據交給Reduce后,如果數據的分組本身不平衡(比如及格,80%以上及格數據),會造成某一些Reduce處理太過多的數據,這就是數據傾斜
count(distinct)可以考慮換GroupBy子查詢
1.2.2.2 注意null值帶來的數據傾斜
所有null會認為是同一個值,會走同一個Map,如果null占的比重一大,又是一個數據傾斜.這是業務上考慮是否能做過濾
這里同樣適用其它的業務null值(比如常見的0,1,-1,-99等業務默認值)
1.2.3 表關聯
大表放后 MapReduce從后往前構建數據,先過濾大表把數據量降下來,可以在Reduce端的Hash-Join減少數據量,提示效率
同列關聯 如可能,用同一列關聯 同列關聯,無論關聯多少表都是一個Map搞定,如果不是同列,就會新開一個MapReduce
1.2.4 配置優化
這里的配置,是指MapReduce或Spark配置
2.HiveSQL的MR轉換
2.1 不跑MapReduce的情況
HiveSQL不是每種情況都會跑MapReduce的.基本查詢,或者是不涉及計算(比如查詢分區表)的查詢,是不會啟動MapReduce任務的
explain select * from dept_et limit 1;
STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: 1 Processor Tree: TableScan alias: dept_et Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int), name (type: string), city (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE Limit Number of rows: 1 Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE ListSink
2.2 join
explain select * from dept_et et join dept_mg mg on et.id= mg.id
<!--構築MR作業流 4=>3=>0(結束) --> STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: <!--第一步MR 表掃描mg(dept_mg mg) 自帶一個基礎過濾謂詞(id is not null) 這里可以看出 join的基准表是后表 Map Reduce Local 本地化的MapReduce 因為測試表的數據量非常小,所以Hive最終選擇將數據拉取到本地直接操作,而不是去執行一個完整的分布式MapReduce--> Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: mg Fetch Operator limit: -1 Alias -> Map Local Operator Tree: mg TableScan alias: mg Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 id (type: int) 1 id (type: int) <!--第二步的MapReduce任務 表掃描 執行一個 Map Join 輸出_col0, _col1, _col2, _col6, _col7, _col8(也就是語句中的*,全部共6個字段) 輸出結果為 File Output 臨時文件(compressed: false 不壓縮)--> Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: et Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 id (type: int) 1 id (type: int) outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8 Statistics: Num rows: 1 Data size: 354 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col6 (type: int), _col7 (type: string), _col8 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 1 Data size: 354 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 354 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
2.3 group by
explain select city,sum(id) from dept_et group by city;
執行計划如下:
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: <!--stage定義,一個stage對應一個MapReduce--> Stage: Stage-1 <!--Map過程--> Map Reduce Map Operator Tree: TableScan //表掃描 alias: dept_et Statistics: Num rows: 3 Data size: 322 Basic stats: COMPLETE Column stats: NONE //表dept_et的統計數據預估 Select Operator //查詢列裁剪,表示只需要 city (type: string), id (type: int) 兩列 expressions: city (type: string), id (type: int) outputColumnNames: city, id Statistics: Num rows: 3 Data size: 322 Basic stats: COMPLETE Column stats: NONE <!--map操作定義 是以city (type: string)取hash作為key,執行函數sum(id),結果為_col0, _col1(hash(city),sum(id))--> Group By Operator aggregations: sum(id) //分組執行函數=>sum(id) keys: city (type: string) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 3 Data size: 322 Basic stats: COMPLETE Column stats: NONE <!--map端的輸出--> Reduce Output Operator key expressions: _col0 (type: string) //Map端輸出的Key是_col0(hash(city)) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 3 Data size: 322 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: bigint) //Map端輸出的Value是_col1(sum(id)) <!--Reduce過程 合並多個Map的輸出 以_col0(也就是map輸出的hash(city))為key 執行sum(VALUE._col0(也就是map輸出的sum(id))),執行結果也是_col0, _col1(hash(city),sum(sum(id)))--> Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0 keys: KEY._col0 (type: string) mode: mergepartial //partial(多個map的輸出)merge(合並) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 107 Basic stats: COMPLETE Column stats: NONE <!--Reduce端的輸出 輸出為一個臨時文件,不壓縮--> File Output Operator compressed: false Statistics: Num rows: 1 Data size: 107 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
2.4 distinct
2.4.1 distinct一個
select city,count(distinct(name)) from dept_et group by city;
只有一個distinct,將group字段和distinct字段一起組合為Map的輸出Key,然后把group字段作為Reduce的Key,在Reduce階段保存LastKey
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce <!--Map端定義 輸入: 表掃描 dept_et 原值查詢city,name 執行過程: 以group列(city),distinct列(name)做為Key,執行表達式count(DISTINCT name) 輸出:_col0, _col1, _col2 (city,name,count(DISTINCT name))--> Map Operator Tree: TableScan alias: dept_et Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: city (type: string), name (type: string) //沒有計算函數,直接是查詢原值 outputColumnNames: city, name Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(DISTINCT name) keys: city (type: string), name (type: string) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE <!--Reduce端定義 接收Map端的輸出,再以_col0作為Key,再做一次聚合(對city.name做一次去重計數) 結果輸出到臨時文件--> Reduce Operator Tree: Group By Operator aggregations: count(DISTINCT KEY._col1:0._col0) keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 322 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
2.4.2 多個distinct字段
select dealid, count(distinct uid), count(distinct date) from order group by dealid;