Select first row in each GROUP BY group?
stackflow上面的一個問題。用窗口函數比較簡單,但是那些沒有窗口函數的數據庫怎么辦?
id | customer | total ---+----------+------ 1 | Joe | 5 2 | Sally | 3 3 | Joe | 2 4 | Sally | 1
-
WITH summary AS (
-
SELECT p.id,
-
p.customer,
-
p.total,
-
ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) AS ranks
-
FROM PURCHASES p)
-
SELECT s.*
-
FROM summary s
-
WHERE s.ranks = 1
所以給出通用方法:
-
SELECT MIN(x.id), -- change to MAX if you want the highest
-
x.customer,
-
x.total
-
FROM PURCHASES x
-
JOIN (SELECT p.customer,
-
MAX(total) AS max_total
-
FROM PURCHASES p
-
GROUP BY p.customer) y ON y.customer = x.customer
-
AND y.max_total = x.total
-
GROUP BY x.customer, x.total
PS:原博還提到了一種Postresql中特有的解法:DISTINCT ON ()
-
SELECT DISTINCT ON (customer) -
id, customer, total -
FROM purchases -
ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
-
SELECT DISTINCT ON (2) -
id, customer, total -
FROM purchases -
ORDER BY 2, 3 DESC, 1;
If total can be NULL (won't hurt either way, but you'll want to match existing indexes):
-
... -
ORDER BY customer, total DESC NULLS LAST, id;
If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated.
--如果total可以為空,則最可能希望具有最大非空值的行。最后添加空值。具體可參照:
其實有點不明白distinct on,看前輩的博客點擊打開鏈接。還用了IN 子查詢
DISTINCT ON ( expression [, …] )把記錄根據[, …]的值進行分組,分組之后僅返回每一組的第一行。需要注意的是,如果你不指定ORDER BY子句,返回的第一條的不確定的。如果你使用了ORDER BY 子句,那么[, …]里面的值必須靠近ORDER BY子句的最左邊。
1. 當沒用指定ORDER BY子句的時候返回的記錄是不確定的。
-
postgres= # select distinct on(course)id,name,course,score from student;
-
id | name | course | score
-
----+--------+--------+-------
-
10 | 周星馳 | 化學 | 83
-
8 | 周星馳 | 外語 | 88
-
2 | 周潤發 | 數學 | 99
-
14 | 黎明 | 物理 | 90
-
6 | 周星馳 | 語文 | 91
-
( 5 rows)
2. 獲取每門課程的最高分
-
postgres= # select distinct on(course)id,name,course,score from student order by course,score desc;
-
id | name | course | score
-
----+--------+--------+-------
-
5 | 周潤發 | 化學 | 87
-
13 | 黎明 | 外語 | 95
-
2 | 周潤發 | 數學 | 99
-
14 | 黎明 | 物理 | 90
-
6 | 周星馳 | 語文 | 91
-
( 5 rows)
3. 如果指定ORDER BY 必須把分組的字段放在最左邊
-
postgres= # select distinct on(course)id,name,course,score from student order by score desc;
-
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
-
LINE 1: select distinct on(course)id,name,course,score from student ...
4. 獲取每門課程的最高分同樣可以使用IN子句來實現
-
postgres= # select * from student where(course,score) in(select course,max(score) from student group by course);
-
id | name | course | score
-
----+--------+--------+-------
-
2 | 周潤發 | 數學 | 99
-
5 | 周潤發 | 化學 | 87
-
6 | 周星馳 | 語文 | 91
-
13 | 黎明 | 外語 | 95
-
14 | 黎明 | 物理 | 90
-
(5 rows)
原文還提到 在 row_number() over(), distinct on和in子句之間有一個小區別 ,主要是因為前兩個方法是用行號,且行號唯一。解決辦法就是用rank()窗口函數,讓同成績的行號出現重復。
下面是一位大神提供的6種方法,有些需要在PG中實現。而且這個大神的建表語句也讓我學習了。
Queries
1. row_number() in CTE, (see other answer) 公用表達式
-
WITH cte AS ( -
SELECT id, customer_id, total -
, row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn -
FROM purchases -
) -
SELECT id, customer_id, total -
FROM cte -
WHERE rn = 1;
2. row_number() in subquery (my optimization) 子查詢
-
SELECT id, customer_id, total -
FROM ( -
SELECT id, customer_id, total -
, row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn -
FROM purchases -
) sub -
WHERE rn = 1;
3. DISTINCT ON (see other answer)
-
SELECT DISTINCT ON (customer_id) -
id, customer_id, total -
FROM purchases -
ORDER BY customer_id, total DESC, id;
4. rCTE with LATERAL subquery (see here) 遞歸和LATERAL

在FROM 或者JOIN子句的子查詢里面可以關聯查詢FROM子句或者JOIN子句的另一邊的子句或者表.
見這一篇→點擊打開鏈接
-
WITH RECURSIVE cte AS ( -
( -- parentheses required -
SELECT id, customer_id, total -
FROM purchases -
ORDER BY customer_id, total DESC -
LIMIT 1 -
) -
UNION ALL -
SELECT u.* -
FROM cte c -
, LATERAL ( -
SELECT id, customer_id, total -
FROM purchases -
WHERE customer_id > c.customer_id -- lateral reference -
ORDER BY customer_id, total DESC -
LIMIT 1 -
) u -
) -
SELECT id, customer_id, total -
FROM cte -
ORDER BY customer_id;
5. customer table with LATERAL (see here)
-
SELECT l.* -
FROM customer c -
, LATERAL ( -
SELECT id, customer_id, total -
FROM purchases -
WHERE customer_id = c.customer_id -- lateral reference -
ORDER BY total DESC -
LIMIT 1 -
) l;
6. array_agg() with ORDER BY (see other answer)
-
SELECT (array_agg(id ORDER BY total DESC))[1] AS id -
, customer_id -
, max(total) AS total -
FROM purchases -
GROUP BY customer_id;
Results(性能)
Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.
All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.
A. Postgres 9.4 with 200k rows and ~ 20 per customer_id
-
1. 273.274 ms -
2. 194.572 ms -
3. 111.067 ms -
4. 92.922 ms -
5. 37.679 ms -- winner -
6. 189.495 ms
B. The same with Postgres 9.5
-
1. 288.006 ms -
2. 223.032 ms -
3. 107.074 ms -
4. 78.032 ms -
5. 33.944 ms -- winner -
6. 211.540 ms
C. Same as B., but with ~ 2.3 rows per customer_id
-
1. 381.573 ms -
2. 311.976 ms -
3. 124.074 ms -- winner -
4. 710.631 ms -
5. 311.976 ms -
6. 421.679 ms
參考資料:https://www.oschina.net/translate/postgresqls-powerful-new-join-type-lateral

