更改表user_info的主鍵uid為自增的id后,忘了設置原來主鍵uid屬性為unique,結果導致產生uid重復的記錄。為此需要清理后來插入的重復記錄。
基本方法可以參考后面的附上的資料,但是由於mysql不支持同時對一個表進行操作,即子查詢和要進行的操作不能是同一個表,因此需要通過零時表中轉一下。
寫在前面:數據量大時,一定要多涉及的關鍵字段創建索引!!!否則很慢很慢很慢,慢到想死的心都有了
1 單字段重復
生成零時表,其中uid是需要去重的字段
create table tmp_uid as (select uid from user_info group by uid having count(uid)) create table tmp_id as (select min(id) from user_info group by uid having count(uid))
數據量大時一定要為uid創建索引
create index index_uid on tmp_uid create index index_id on tmp_id
刪除多余的重復記錄,保留重復項中id最小的
delete from user_info where id not in (select id from tmp_id) and uid in (select uid from tmp_uid)
2.多字段重復
由uid的重復間接的導致了relationship中的記錄重復,故繼續去重。先介紹正常處理流程,在介紹本人根據自身數據特點實踐的更加有效的方法!
2.1一般方法
基本的同上面:
生成零時表
create table tmp_relation as (select source,target from relationship group by source,target having count(*)>1) create table tmp_relationship_id as (select min(id) as id from relationship group by source,target having count(*)>1)
創建索引
create index index_id on tmp_relationship_id
刪除
delete from relationship where id not in (select id from tmp_relationship_id) and (source,target) in (select source,target from relationship)
2.2 實踐出真知
實踐中發現上面的刪除字段重復的方法,由於沒有辦法為多字段重建索引,導致數據量大時效率極低,低到無法忍受。最后,受不了等了半天沒反應的狀況,本人決定,另辟蹊徑。
考慮到,估計同一記錄的重復次數比較低。一般為2,或3,重復次數比較集中。所以可以嘗試直接刪除重復項中最大的,直到刪除到不重復,這時其id自然也是當時重復的里邊最小的。
大致流程如下:
1)選擇每個重復項中id最大的一個記錄
create table tmp_relation_id2 as (select max(id) from relationship group by source,target having count(*)>1)
2)創建索引(僅需在第一次時執行)
create index index_id on tmp_relation_id2
3)刪除 重復項中id最大的記錄
delete from relationship where id in (select id from tmp_relation_id2)
4)刪除臨時表
drop table tmp_relation_id2
重復上述步驟1),2),3),4),直到創建的臨時表中不存在記錄就結束(對於重復次數的數據,比較高效)
參考:
查詢及刪除重復記錄的方法 http://wenku.baidu.com/view/d4b3f134b90d6c85ec3ac6ad.html
查詢及刪除重復記錄的方法
(一) 1、查找表中多余的重復記錄,重復記錄是根據單個字段(peopleId)來判斷 select * from people where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)
2、刪除表中多余的重復記錄,重復記錄是根據單個字段(peopleId)來判斷,只留有rowid最小的記錄 delete from people where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1) and rowid not in (select min(rowid) from people group by peopleId having count(peopleId )>1)
3、查找表中多余的重復記錄(多個字段) select * from vitae a where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
4、刪除表中多余的重復記錄(多個字段),只留有rowid最小的記錄 delete from vitae a where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1) and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)