GreenPlum高效去除表重復數據


1.針對PostgreSQL數據庫表的去重復方法基本有三種,這是在網上查找的方法,在附錄1給出。但是這些方法對GreenPlum來說都不管用。

 

2.數據表分布在不同的節點上,每個節點的ctid是唯一的,但是不同的節點就有ctid重復的可能,因此GreenPlum必須借助gp_segment_id來進行去重復處理。

 

3.在網上找到了一個相對繁瑣的方法,在附錄2給出:

 

4.最終的方法是:

delete from test where (gp_segment_id, ctid) not in (select gp_segment_id, min(ctid) from test group by x, gp_segment_id);

 

驗證通過。

 

附錄1:PostgreSQL數據表去重復的三種方法:

引用自:http://my.oschina.net/swuly302/blog/144933

 

采用PostgreSQL 9.2 官方文檔例子為例: 

CREATE TABLE weather (
city      varchar(80),
temp_lo   int,          -- low temperature
temp_hi   int,          -- high temperature
prcp      real,         -- precipitation
date      date
);

INSERT INTO weather VALUES
('San Francisco', 46, 50, 0.25, '1994-11-27'),
('San Francisco', 43, 57, 0, '1994-11-29'),
('Hayward', 37, 54, NULL, '1994-11-29'),
('Hayward', 37, 54, NULL, '1994-11-29');   --- duplicated row

 

這里有3中方法: 

第一種:替換法 

-- 剔除重復行的數據轉存到新表weather_temp
SELECT DISTINCT city, temp_lo, temp_hi, prcp, date 
INTO weather_temp 
FROM weather; 
-- 刪除原表
DROP TABLE weather;
-- 將新表重命名為weather
ALTER TABLE weather_temp RENAME TO weather;
或者 

-- 創建與weather一樣的表weather_temp
CREATE TABLE weather_temp (LIKE weather INCLUDING CONSTRAINTS);
-- 用剔除重復行的數據填充到weather_temp中
INSERT INTO weather_temp SELECT DISTINCT * FROM weather;
-- 刪除原表
DROP TABLE weather;
-- 將新重命名為weather.
ALTER TABLE weather_temp RENAME TO weather;
通俗易懂,有很多毀滅性的操作如DROP,而且當數據量大時,耗時耗空間。不推薦。 

第二種: 添加字段法
-- 添加一個新字段,類型為serial
ALTER TABLE weather ADD COLUMN id SERIAL;
-- 刪除重復行
DELETE FROM weather WHERE id 
NOT IN (
SELECT max(id) 
FROM weather 
GROUP BY city, temp_lo, temp_hi, prcp, date
);
-- 刪除添加的字段
ALTER TABLE weather DROP COLUMN id;
需要添加字段,「暫時不知道Postgres是如何處理添加字段的,是直接在原表追加呢,還是復制原表組成新表呢?」,如果是原表追加,可能就會因為新字段的加入而導致分頁(一般block: 8k),如果是復制的話那就罪過了。不好。 

第三種:系統字段[查看 System Columns] 

DELETE FROM weather 
WHERE ctid 
NOT IN (
SELECT max(ctid) 
FROM weather 
GROUP BY city, temp_lo, temp_hi, prcp, date
);
針對性強[Postgres獨有],但是簡單。

 

 

 

----------------但是對GreenPlum的表來說,表分割在各個節點上,不能單純的用ctid來做去重復處理。

 

附錄2:

https://discuss.pivotal.io/hc/zh-cn/community/posts/206428018-What-is-the-most-efficient-way-of-deleting-duplicate-records-from-a-table-

What is the most efficient way of deleting duplicate records from a table?

Currently we use Primary Keys to avoid loading duplicate data into our tables, but PK brings many restrictions. Since we can’t easily identify or prevent duplicates arriving from the variety of 3rd party upstream systems, we wanted to investigate the ‘load everything, remove duplicates afterwards’ approach.

In Postgres, you can use an efficient method such as:

DELETE FROM test
WHERE ctid NOT IN (
SELECT min(ctid)
FROM test
GROUP BY x); 
(where 'x' is the unique column list)

 

However in Greenplum ‘ctid’ is only unique per segment.

One approach would be:

DELETE FROM test USING 
(select gp_segment_id, ctid from 
(select gp_segment_id, ctid, rank() over (partition by x order by gp_segment_id, ctid) as rk from test ) foo 
WHERE rk <> 1) rows_to_delete 
WHERE test.gp_segment_id=rows_to_delete.gp_segment_id 
AND test.ctid=rows_to_delete.ctid;

 

But the use of window functions, subqueries etc. feels pretty inefficient.

Is there a better form?

Note that in our use case our unique column list varies up to ~10 columns so we don’t have a single unique key field – hence the RANK in the example. I suppose adding a sequence column could be used, but how much overhead does this add when doing bulk data loading?

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM