hive里的group by和distinct

本文轉載自查看原文 2017-10-23 17:49 15684 hive/ group by/ distinct

hive里的group by和distinct

前言

今天才明確知道group by實際上還是有去重讀作用的，其實細想一下，按照xx分類，肯定相同的就算是一類了，也就相當於去重來，詳細的看一下。

group by

看一下實例1：

hive> select * from test;
OK
zhao	15	20170807
zhao	14	20170809
zhao	15	20170809
zhao	16	20170809

hive> select name from test;
OK
zhao
zhao
zhao
zhao

hive> select name from test group by name;

...

OK
zhao
Time taken: 40.273 seconds, Fetched: 1 row(s)

按照這個去分類，最后結果只有一個，達到了去重的效果；實際上，所謂去重，肯定是兩個一樣的才可以去重，下面試一下兩列的效果：

hive> select name,age from test group by name,age;
...

OK
zhao	14
zhao	15
zhao	16
Time taken: 36.943 seconds, Fetched: 3 row(s)

hive> select name,age from test group by name;
FAILED: SemanticException [Error 10025]: Line 1:12 Expression not in GROUP BY key 'age'

只group by name就會出錯，想一下只用name去做那么age不同就沒法處理了，也合情合理。

distinct

這個也比較簡單，就是去重：

hive> select distinct name from test;
...

OK
zhao
Time taken: 37.047 seconds, Fetched: 1 row(s)

hive> select distinct name,age from test;
OK
zhao	14
zhao	15
zhao	16
Time taken: 39.131 seconds, Fetched: 3 row(s)

hive> select distinct(name),age from test;
OK
zhao	14
zhao	15
zhao	16
Time taken: 37.739 seconds, Fetched: 3 row(s)

區別

如果數據較多，distinct效率會更低一些，一般推薦使用group by。
至於原因，推薦這篇文章

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 HIVE Group by、join、distinct等實現原理 hive group by distinct區別以及性能比較 hive------ Group by、join、distinct等實現原理 SQL中的distinct與group distinct 與group by 去重 Hive中筆記：三種去重方法，distinct,group by與ROW_Number()窗口函數 distinct和group by的效率比較 ThinkPHP去重 distinct和group by group by 和 distinct 去重比較 DISTINCT 與 GROUP BY 的比較