PostreSQL取出每組第一條(最高)記錄(6種方法 )


Select first row in each GROUP BY group?

 

stackflow上面的一個問題。用窗口函數比較簡單,但是那些沒有窗口函數的數據庫怎么辦?

 

id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1

 

  1.  
    WITH summary AS (
  2.  
    SELECT p.id,
  3.  
    p.customer,
  4.  
    p.total,
  5.  
    ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) AS ranks
  6.  
    FROM PURCHASES p)
  7.  
    SELECT s.*
  8.  
    FROM summary s
  9.  
    WHERE s.ranks = 1

所以給出通用方法:

  1.  
    SELECT MIN(x.id), -- change to MAX if you want the highest
  2.  
    x.customer,
  3.  
    x.total
  4.  
    FROM PURCHASES x
  5.  
    JOIN (SELECT p.customer,
  6.  
    MAX(total) AS max_total
  7.  
    FROM PURCHASES p
  8.  
    GROUP BY p.customer) y ON y.customer = x.customer
  9.  
    AND y.max_total = x.total
  10.  
    GROUP BY x.customer, x.total

PS:原博還提到了一種Postresql中特有的解法:DISTINCT ON ()

 

  1.  
    SELECT DISTINCT ON (customer)
  2.  
    id, customer, total
  3.  
    FROM purchases
  4.  
    ORDER BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

  1.  
    SELECT DISTINCT ON (2)
  2.  
    id, customer, total
  3.  
    FROM purchases
  4.  
    ORDER BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

  1.  
    ...
  2.  
    ORDER BY customer, total DESC NULLS LAST, id;

If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. 

--如果total可以為空,則最可能希望具有最大非空值的行。最后添加空值。具體可參照:

 

其實有點不明白distinct on,看前輩的博客點擊打開鏈接。還用了IN 子查詢

 

 DISTINCT ON ( expression [, …] )把記錄根據[, …]的值進行分組,分組之后僅返回每一組的第一行。需要注意的是,如果你不指定ORDER BY子句,返回的第一條的不確定的。如果你使用了ORDER BY 子句,那么[, …]里面的值必須靠近ORDER BY子句的最左邊。 

 

1. 當沒用指定ORDER BY子句的時候返回的記錄是不確定的。

  1.  
    postgres= # select distinct on(course)id,name,course,score from student;
  2.  
    id | name | course | score
  3.  
    ----+--------+--------+-------
  4.  
    10 | 周星馳 | 化學 | 83
  5.  
    8 | 周星馳 | 外語 | 88
  6.  
    2 | 周潤發 | 數學 | 99
  7.  
    14 | 黎明 | 物理 | 90
  8.  
    6 | 周星馳 | 語文 | 91
  9.  
    ( 5 rows)

2. 獲取每門課程的最高分

  1.  
    postgres= # select distinct on(course)id,name,course,score from student order by course,score desc;
  2.  
    id | name | course | score
  3.  
    ----+--------+--------+-------
  4.  
    5 | 周潤發 | 化學 | 87
  5.  
    13 | 黎明 | 外語 | 95
  6.  
    2 | 周潤發 | 數學 | 99
  7.  
    14 | 黎明 | 物理 | 90
  8.  
    6 | 周星馳 | 語文 | 91
  9.  
    ( 5 rows)

3. 如果指定ORDER BY 必須把分組的字段放在最左邊

  1.  
    postgres= # select distinct on(course)id,name,course,score from student order by score desc;
  2.  
    ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
  3.  
    LINE 1: select distinct on(course)id,name,course,score from student ...

4. 獲取每門課程的最高分同樣可以使用IN子句來實現

  1.  
    postgres= # select * from student where(course,score) in(select course,max(score) from student group by course);
  2.  
    id | name | course | score
  3.  
    ----+--------+--------+-------
  4.  
    2 | 周潤發 | 數學 | 99
  5.  
    5 | 周潤發 | 化學 | 87
  6.  
    6 | 周星馳 | 語文 | 91
  7.  
    13 | 黎明 | 外語 | 95
  8.  
    14 | 黎明 | 物理 | 90
  9.  
    (5 rows)

原文還提到 在 row_number() over(), distinct on和in子句之間有一個小區別  ,主要是因為前兩個方法是用行號,且行號唯一。解決辦法就是用rank()窗口函數,讓同成績的行號出現重復。

 

下面是一位大神提供的6種方法,有些需要在PG中實現。而且這個大神的建表語句也讓我學習了。

Queries

1. row_number() in CTE, (see other answer) 公用表達式

  1.  
    WITH cte AS (
  2.  
    SELECT id, customer_id, total
  3.  
    , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
  4.  
    FROM purchases
  5.  
    )
  6.  
    SELECT id, customer_id, total
  7.  
    FROM cte
  8.  
    WHERE rn = 1;

2. row_number() in subquery (my optimization) 子查詢

  1.  
    SELECT id, customer_id, total
  2.  
    FROM (
  3.  
    SELECT id, customer_id, total
  4.  
    , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
  5.  
    FROM purchases
  6.  
    ) sub
  7.  
    WHERE rn = 1;

3. DISTINCT ON (see other answer

  1.  
    SELECT DISTINCT ON (customer_id)
  2.  
    id, customer_id, total
  3.  
    FROM purchases
  4.  
    ORDER BY customer_id, total DESC, id;

4. rCTE with LATERAL subquery (see here)  遞歸和LATERAL

第一個查詢取customer_id最小,且該id中total最大的。

在FROM 或者JOIN子句的子查詢里面可以關聯查詢FROM子句或者JOIN子句的另一邊的子句或者表.

見這一篇→點擊打開鏈接

  1.  
    WITH RECURSIVE cte AS (
  2.  
    ( -- parentheses required
  3.  
    SELECT id, customer_id, total
  4.  
    FROM purchases
  5.  
    ORDER BY customer_id, total DESC
  6.  
    LIMIT 1
  7.  
    )
  8.  
    UNION ALL
  9.  
    SELECT u.*
  10.  
    FROM cte c
  11.  
    , LATERAL (
  12.  
    SELECT id, customer_id, total
  13.  
    FROM purchases
  14.  
    WHERE customer_id > c.customer_id -- lateral reference
  15.  
    ORDER BY customer_id, total DESC
  16.  
    LIMIT 1
  17.  
    ) u
  18.  
    )
  19.  
    SELECT id, customer_id, total
  20.  
    FROM cte
  21.  
    ORDER BY customer_id;

5. customer table with LATERAL (see here)

  1.  
    SELECT l.*
  2.  
    FROM customer c
  3.  
    , LATERAL (
  4.  
    SELECT id, customer_id, total
  5.  
    FROM purchases
  6.  
    WHERE customer_id = c.customer_id -- lateral reference
  7.  
    ORDER BY total DESC
  8.  
    LIMIT 1
  9.  
    ) l;

6. array_agg() with ORDER BY (see other answer)

  1.  
    SELECT (array_agg(id ORDER BY total DESC))[1] AS id
  2.  
    , customer_id
  3.  
    , max(total) AS total
  4.  
    FROM purchases
  5.  
    GROUP BY customer_id;

挺有意思的,數組中按照每組total降序,取第一個(MAX)。第一次見到分組字段不是放在第一個的。最后用MAX函數取出最大的值。 但是這里出現了customer_id的 自動升序,看了些帖子。

Results(性能)

Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

  1.  
    1. 273.274 ms
  2.  
    2. 194.572 ms
  3.  
    3. 111.067 ms
  4.  
    4. 92.922 ms
  5.  
    5. 37.679 ms -- winner
  6.  
    6. 189.495 ms

B. The same with Postgres 9.5

  1.  
    1. 288.006 ms
  2.  
    2. 223.032 ms
  3.  
    3. 107.074 ms
  4.  
    4. 78.032 ms
  5.  
    5. 33.944 ms -- winner
  6.  
    6. 211.540 ms

C. Same as B., but with ~ 2.3 rows per customer_id

  1.  
    1. 381.573 ms
  2.  
    2. 311.976 ms
  3.  
    3. 124.074 ms -- winner
  4.  
    4. 710.631 ms
  5.  
    5. 311.976 ms
  6.  
    6. 421.679 ms
 

參考資料:https://www.oschina.net/translate/postgresqls-powerful-new-join-type-lateral


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM