Hive查詢某一重復字段記錄第一條數據

本文轉載自查看原文 2020-06-17 16:13 1802

場景：數據庫中id、toapp、topin、toclienttype幾個字段都相同，receivetime字段不一樣，現需要將receive最小的一行查出，其他行舍去。

select
*
from
(
select
*,
row_number() over(partition by id order by receivetime asc) num
from
xxxxxxxxxxxxxxxxxxxx
where
dt = '2019-01-14'
and toapp = 'xxxxxx'
and toclienttype = 'xxxxxx'
and msgrectype = '1'
and systemtime is not null
) as t
where
t.num = 1
這里主要的代碼就是row_number() OVER (PARTITION BY COL1 ORDER BY COL2)

這行代碼的意思是先對COL1列進行分組，然后按照COL2進行排序，row_number()函數是對分組后的每個組內記錄按照COL2排序標號，我們最后取的時候就拿標號為1的一條記錄，即達到我的需求。

例子：

empid    deptid    salary
----------- ----------- ---------------------------------------
1          10       5500.00
2          10       4500.00
3          20       1900.00
4          20       4800.00
5          40       6500.00
6          40       14500.00
7          40       44500.00
8          50       6500.00
9          50       7500.00

row_number() OVER (PARTITION BY deptid ORDER BY salary)

SELECT *, Row_Number() OVER (partition by deptid ORDER BY salary desc) rank FROM employee

結果：

empid    deptid    salary                               rank
----------- ----------- --------------------------------------- --------------------
1          10       5500.00                               1
2          10       4500.00                               2
4          20       4800.00                               1
3          20       1900.00                               2
7          40       44500.00                            1
6          40       14500.00                            2
5          40       6500.00                               3
9          50       7500.00                               1
8          50       6500.00                               2
————————————————
版權聲明：本文為CSDN博主「dancheren」的原創文章，遵循CC 4.0 BY-SA版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/dancheren/article/details/86481376

hive數據庫如何去重，去除相同的一模一樣的數據

問題：發現存在一張表中存在相同的兩行數據

得到：僅僅保留一行數據

方法：

原理-我們通過

 1 select count (字段1，字段2) from 表1；
 2 
 3 結果 200條數據
 4 
 5 select count (distinct 字段1，字段2) from 表1；
 6 
 7 結果 100條數據
 8 
 9 相當於后者可以實現查出來去重后的數據
10 
11 create table 表1_bak as select distinct 字段1，字段2 from 表1;   --備份表數據
12 
13 delete from 表1;
14 
15 insert into 表1 select * from 表1_bak；

Hive中使用Distinct踩到的坑

問題描述：

在使用Hive的過程中，用Distinct對重復數據進行過濾，得出了一個違背認知的結果，百思不得其解。

假設：test表中有100W數據，對test表按照a, b, c, d, e去重。

一、使用Distinct的SQL如下：

SQL1 ：select count(distinct a, b, c, d, e) from test;
得出結果： 2W+。

根據數據特點第一感覺，並不會有那么多重復數據，對自己的distinct使用產生了懷疑，因此用group by校驗結果。

二、使用Group by的SQL如下：

SQL2 ：select sum (gcount) from (select count(*) gcount from test group by a, b, c, d, e) t
得出結果： 80W+。

這個結果是符合數據特點的；

三、修改SQL1，去掉一個字段；

SQL3：select count(distinct b, c, d, e) from test;
得出結果：90W+。

四、對比SQL1和 SQL3

按照4個字段distinct 理論上一定比 5個字段distinct 結果少，測試結果缺恰恰相反；

原因就是因為a列中包含null，按我的認知以為所有的null值會被歸結為同一個，可實際上hive並不會；

所以distinct的列中如果含有null值，會導致結果不准，需要將null值替換為一個統一的值。

修改如下：

select count(distinct nvl(a, 0), b, c, d, e) from test;
如上，問題解決！
————————————————
版權聲明：本文為CSDN博主「UncleMing5371」的原創文章，遵循CC 4.0 BY-SA版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/UncleMing5371/article/details/85236709

hive中的distinct用法：

https://blog.csdn.net/lz6363/article/details/85842146

HIVESQL中ROW_NUMBER() OVER語法以及示例

ROW_NUMBER() OVER函數的基本用法

語法：ROW_NUMBER() OVER(PARTITION BY COLUMNORDER BY COLUMN)

詳解：

row_number() OVER (PARTITION BY COL1 ORDERBY COL2)表示根據COL1分組，在分組內部根據COL2排序，而此函數計算的值就表示每組內部排序后的順序編號（該編號在組內是連續並且唯一的)。

場景描述：

在Hive中employee表包括empid、depid、salary三個字段，根據部門分組，顯示每個部門的工資等級。

1、原表查看：在Hive中employee表及其內容如下所示：

2、執行SQL。

SELECT *, Row_Number() OVER (partition by deptid ORDER BY salary desc) rank FROM employee
3、查看結果。

————————————————
版權聲明：本文為CSDN博主「汀樺塢」的原創文章，遵循CC 4.0 BY-SA版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/wiborgite/article/details/80521593

我的使用：

(SELECT * FROM tm_data_room_${this_ds}_${this_ts}_per_hour_tmp1) as a

left join

(

SELECT DISTINCT(k.setid),k.ds,k.ts,k.game_mode,k.create_clubid as clubid,k.leagueid,k.is_satellite,k.is_entity,k.ensure_chips,k.gameset_end_time from

(

select * from

(select *,Row_Number() OVER (partition by setid ORDER BY gameset_end_time desc) rank from gameset_info_log_flow

where ds = ${this_ds} and ts = ${this_ts} and room_mode = 3 and gameset_status=100 and gameset_start_time != 0 and gameset_end_time !=0

) as t

where t.rank = 1

) as k

--gameset_info_log_flow

--where ds = ${this_ds} and ts = ${this_ts} and room_mode = 3 and gameset_status=100 and gameset_start_time != 0 and gameset_end_time !=0

) as b

on a.setid = b.setid

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 sql根據某一個字段重復只取第一條數據合並求取分組記錄的第一條數據 SQL查詢表的第一條數據和最后一條數據 hive內group by取第一條數據，Hive中row_number的使用【sql進階】查詢每天、每個設備的第一條數據 oracle只要第一條數據SQL LINQ分組取出第一條數據 oracle取order by的第一條數據 SQL 去重顯示第一條數據顯示一條數據 Oracle獲取最后一條，第一條數據（按時間獲取第一條和最后一條數據）