1.針對PostgreSQL數據庫表的去重復方法基本有三種,這是在網上查找的方法,在附錄1給出。但是這些方法對GreenPlum來說都不管用。
2.數據表分布在不同的節點上,每個節點的ctid是唯一的,但是不同的節點就有ctid重復的可能,因此GreenPlum必須借助gp_segment_id來進行去重復處理。
3.在網上找到了一個相對繁瑣的方法,在附錄2給出:
4.最終的方法是:
delete from test where (gp_segment_id, ctid) not in (select gp_segment_id, min(ctid) from test group by x, gp_segment_id);
驗證通過。
附錄1:PostgreSQL數據表去重復的三種方法:
引用自:http://my.oschina.net/swuly302/blog/144933
采用PostgreSQL 9.2 官方文檔例子為例:
CREATE TABLE weather ( city varchar(80), temp_lo int, -- low temperature temp_hi int, -- high temperature prcp real, -- precipitation date date ); INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27'), ('San Francisco', 43, 57, 0, '1994-11-29'), ('Hayward', 37, 54, NULL, '1994-11-29'), ('Hayward', 37, 54, NULL, '1994-11-29'); --- duplicated row
這里有3中方法:
第一種:替換法 -- 剔除重復行的數據轉存到新表weather_temp SELECT DISTINCT city, temp_lo, temp_hi, prcp, date INTO weather_temp FROM weather; -- 刪除原表 DROP TABLE weather; -- 將新表重命名為weather ALTER TABLE weather_temp RENAME TO weather; 或者 -- 創建與weather一樣的表weather_temp CREATE TABLE weather_temp (LIKE weather INCLUDING CONSTRAINTS); -- 用剔除重復行的數據填充到weather_temp中 INSERT INTO weather_temp SELECT DISTINCT * FROM weather; -- 刪除原表 DROP TABLE weather; -- 將新重命名為weather. ALTER TABLE weather_temp RENAME TO weather; 通俗易懂,有很多毀滅性的操作如DROP,而且當數據量大時,耗時耗空間。不推薦。 第二種: 添加字段法 -- 添加一個新字段,類型為serial ALTER TABLE weather ADD COLUMN id SERIAL; -- 刪除重復行 DELETE FROM weather WHERE id NOT IN ( SELECT max(id) FROM weather GROUP BY city, temp_lo, temp_hi, prcp, date ); -- 刪除添加的字段 ALTER TABLE weather DROP COLUMN id; 需要添加字段,「暫時不知道Postgres是如何處理添加字段的,是直接在原表追加呢,還是復制原表組成新表呢?」,如果是原表追加,可能就會因為新字段的加入而導致分頁(一般block: 8k),如果是復制的話那就罪過了。不好。 第三種:系統字段[查看 System Columns] DELETE FROM weather WHERE ctid NOT IN ( SELECT max(ctid) FROM weather GROUP BY city, temp_lo, temp_hi, prcp, date ); 針對性強[Postgres獨有],但是簡單。
----------------但是對GreenPlum的表來說,表分割在各個節點上,不能單純的用ctid來做去重復處理。
附錄2:
https://discuss.pivotal.io/hc/zh-cn/community/posts/206428018-What-is-the-most-efficient-way-of-deleting-duplicate-records-from-a-table-
What is the most efficient way of deleting duplicate records from a table?
Currently we use Primary Keys to avoid loading duplicate data into our tables, but PK brings many restrictions. Since we can’t easily identify or prevent duplicates arriving from the variety of 3rd party upstream systems, we wanted to investigate the ‘load everything, remove duplicates afterwards’ approach.
In Postgres, you can use an efficient method such as:
DELETE FROM test WHERE ctid NOT IN ( SELECT min(ctid) FROM test GROUP BY x); (where 'x' is the unique column list)
However in Greenplum ‘ctid’ is only unique per segment.
One approach would be:
DELETE FROM test USING (select gp_segment_id, ctid from (select gp_segment_id, ctid, rank() over (partition by x order by gp_segment_id, ctid) as rk from test ) foo WHERE rk <> 1) rows_to_delete WHERE test.gp_segment_id=rows_to_delete.gp_segment_id AND test.ctid=rows_to_delete.ctid;
But the use of window functions, subqueries etc. feels pretty inefficient.
Is there a better form?
Note that in our use case our unique column list varies up to ~10 columns so we don’t have a single unique key field – hence the RANK in the example. I suppose adding a sequence column could be used, but how much overhead does this add when doing bulk data loading?
