最近在做一個多線程的爬蟲程序,由於隊列中有重復的數據,盡管程序中有判斷不存在則插入,但由於多個線程並發,導致數據庫中存在部分重復的數據。
程序中的bug已經修復,但重新爬一遍耗時耗力,於是就選擇刪除重復的數據,只保留一條有效數據
解決的思路就是根據確定其數據唯一的聚合字段進行分組,然后只保留一條有效數據
1.查詢重復數據
select * FROM ZYZBBData WHERE (code,year,report_type) IN (SELECT code, year, report_type FROM (SELECT code, year, report_type FROM ZYZBBData GROUP BY code,year,report_type HAVING COUNT( * ) > 1) a)
2.只保留Id最小的1條數據,過濾出要被刪除的數據
select * FROM ZYZBBData WHERE (code,year,report_type) IN (SELECT code, year, report_type FROM (SELECT code, year, report_type FROM ZYZBBData GROUP BY code,year,report_type HAVING COUNT( * ) > 1) a) AND id NOT IN(SELECT id FROM (SELECT MIN(id) AS id FROM ZYZBBData GROUP BY code,year,report_type HAVING COUNT( * ) > 1) b)
3.刪除重復的數據
DELETE FROM ZYZBBData WHERE (code,year,report_type) IN (SELECT code, year, report_type FROM (SELECT code, year, report_type FROM ZYZBBData GROUP BY code,year,report_type HAVING COUNT( * ) > 1) a) AND id NOT IN(SELECT id FROM (SELECT MIN(id) AS id FROM ZYZBBData GROUP BY code,year,report_type HAVING COUNT( * ) > 1) b)
數據正常