PostgreSQL group by 后取最新的一條
參考:https://www.cnblogs.com/funnyzpc/p/9311281.html,https://www.cnblogs.com/aeolian/p/9359898.html
需求
- 針對 registration_id 和 district 分組(登記編號、區)
- 並且根據時間倒序取最后一條
數據
請注意,id=1,2的兩條數據經過雙重分組查詢后的結果應該是兩條
CREATE TABLE "public"."application" (
"id" int8 NOT NULL,
"district" varchar(255) COLLATE "pg_catalog"."default",
"registration_id" varchar(255) COLLATE "pg_catalog"."default",
"create_time" timestamp(6),
CONSTRAINT "application_pkey" PRIMARY KEY ("id")
);
ALTER TABLE "public"."application"
OWNER TO "postgres";
COMMENT ON COLUMN "public"."application"."id" IS '主鍵';
COMMENT ON COLUMN "public"."application"."district" IS '報名區';
COMMENT ON COLUMN "public"."application"."registration_id" IS '登記編號';
COMMENT ON COLUMN "public"."application"."create_time" IS '創建時間';
INSERT INTO "public"."application"("id", "district", "registration_id", "create_time") VALUES (1, '310101', '1', '2021-05-05 09:42:49');
INSERT INTO "public"."application"("id", "district", "registration_id", "create_time") VALUES (2, '310102', '1', '2021-05-05 09:42:55');
INSERT INTO "public"."application"("id", "district", "registration_id", "create_time") VALUES (3, '310101', '2', '2021-05-05 09:43:00');
INSERT INTO "public"."application"("id", "district", "registration_id", "create_time") VALUES (4, '310101', '2', '2021-05-05 09:43:59');
實現1
按照慣性思維:我開始想到了 having create_time = max(cerate_time) 哈哈,這就笨比了。因為 having 是在 group by 之后執行的。
SELECT
app.registration_id,
app.district,
(SELECT "id" FROM application AS temp1 WHERE temp1.registration_id = app.registration_id AND temp1.district = app.district ORDER BY create_time DESC LIMIT 1) AS "id"
FROM application AS app
GROUP BY app.registration_id, app.district
查詢結果:正確
雖然結果正確,但是生產環境幾十萬數據量的時候並且在除了主鍵就沒索引的情況下會慢到懷疑人生
實現2
SELECT * FROM (
SELECT district, row_number() over (partition by registration_id, district order by create_time desc), "id" FROM application
) AS de_dump WHERE row_number = 1
查詢結果:正確
對比
需求
- 針對 registration_id 和 district 分組(登記編號、區)
- 並且根據時間倒序取最后一條
個人理解下來 partition by 和 group by 比
- 相同點:分組並且是多字段
- 不同點1:partition by 可以更加方便的對組內數據排序以及根據需要取出需要的數據,group by 比較愣頭青只能通過聚合函數搞到需要的數據。
- 不同點2:同時group by查詢的時候不能出現非聚合字段,這導致想獲取其他列只能通過子查詢,這就很笨比
- 針對我這個需求的優劣選擇 明顯要選擇 partition by ,因為我用 QUERY PLAN 跑了下兩者光是步驟就有很大的數量差距
QUERY PLAN: 個人數據庫水平有限,可這看執行條數以及個人實踐采用窗口函數的sql是明顯優於傳統group by的
小結
雖然個人摸索出 實現2
要優於 實現1
,但是實現2
為了去重也不可避免的使用了子查詢然后通過查詢 row_numer = 1
來實現去重。
如果有更好的方法歡迎在評論區留言。