Hive group by實現-就是word 統計


准備數據

SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
hive> SELECT * FROM logs; a 蘋果 5 a 橙子 3 a 蘋果 2 b 燒雞 1 hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid; a 10 b 1

計算過程

hive-groupby-cal
默認設置了hive.map.aggr=true,所以會在mapper端先group by一次,最后再把結果merge起來,為了減少reducer處理的數據量。注意看explain的mode是不一樣的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那將groupby放到reducer才做,他的mode是complete.

Operator

hive-groupby-op

Explain

hive> explain SELECT uid, sum(count) FROM logs group by uid; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: logs TableScan // 掃描表 alias: logs Select Operator //選擇字段 expressions: expr: uid type: string expr: count type: int outputColumnNames: uid, count Group By Operator //這里是因為默認設置了hive.map.aggr=true,會在mapper先做一次聚合,減少reduce需要處理的數據 aggregations: expr: sum(count) //聚集函數 bucketGroup: false keys: //鍵 expr: uid type: string mode: hash //hash方式,processHashAggr() outputColumnNames: _col0, _col1 Reduce Output Operator //輸出key,value給reducer key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) //聚合 bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial //合並值 outputColumnNames: _col0, _col1 Select Operator //選擇字段 expressions: expr: _col0 type: string expr: _col1 type: bigint outputColumnNames: _col0, _col1 File Output Operator //輸出到文件 compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM