Hive group by实现-就是word 统计

本文转载自查看原文 2017-01-31 23:21 3259 数据分析

准备数据

SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;

hive> SELECT * FROM logs; a 苹果 5 a 橙子 3 a 苹果 2 b 烧鸡 1 hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid; a 10 b 1

计算过程

默认设置了hive.map.aggr=true，所以会在mapper端先group by一次，最后再把结果merge起来，为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash，reducer是mergepartial。如果把hive.map.aggr=false，那将groupby放到reducer才做，他的mode是complete.

Operator

Explain

hive> explain SELECT uid, sum(count) FROM logs group by uid; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: logs TableScan // 扫描表 alias: logs Select Operator //选择字段 expressions: expr: uid type: string expr: count type: int outputColumnNames: uid, count Group By Operator //这里是因为默认设置了hive.map.aggr=true，会在mapper先做一次聚合，减少reduce需要处理的数据 aggregations: expr: sum(count) //聚集函数 bucketGroup: false keys: //键 expr: uid type: string mode: hash //hash方式，processHashAggr() outputColumnNames: _col0, _col1 Reduce Output Operator //输出key，value给reducer key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) //聚合 bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial //合并值 outputColumnNames: _col0, _col1 Select Operator //选择字段 expressions: expr: _col0 type: string expr: _col1 type: bigint outputColumnNames: _col0, _col1 File Output Operator //输出到文件 compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Hive – Group By 的实现zz hive实现group_concat Hive mapreduce SQL实现原理——SQL最终分解为MR任务，而group by在MR里和单词统计MR没有区别了 hive------ Group by、join、distinct等实现原理 HIVE Group by、join、distinct等实现原理 hive中实现group_concat shell实现group by聚合操作统计 hive中实现类似MySQL中的group_concat功能 hive 分组后聚合实现GROUP_CONCAT Hive中实现group concat功能（不用udf）