如何用Python尋找重復文件並刪除

本文轉載自查看原文 2015-03-30 13:11 4665 Python

在實際生活中，經常會有文件重復的困擾，即同一個文件可能既在A目錄中，又在B目錄中，更可惡的是，即便是同一個文件，文件名可能還不一樣。在文件較少的情況下，該類情況還比較容易處理，最不濟就是one by one的人工比較——即便如此，也很難保證你的眼神足夠犀利。倘若文件很多，這豈不是個impossible mission？最近在看《Python UNIX和Linux系統管理指南》，里面就有有關“數據比較”的內容，在其基礎上，結合實際整理如下。

該腳本主要包括以下模塊:diskwalk,chechsum,find_dupes,delete。其中diskwalk模塊是遍歷文件的，給定路徑，遍歷輸出該路徑下的所有文件。chechsum模塊是求文件的md5值。find_dupes導入了diskwalk和chechsum模塊，根據md5的值來判斷文件是否相同。delete是刪除模塊。具體如下：

1. diskwalk.py

import os,sys
class diskwalk(object):
        def __init__(self,path):
                self.path = path
        def paths(self):
                path=self.path
                path_collection=[]
                for dirpath,dirnames,filenames in os.walk(path):
                        for file in filenames:
                                fullpath=os.path.join(dirpath,file)
                                path_collection.append(fullpath)
                return path_collection
if __name__ == '__main__':
        for file in diskwalk(sys.argv[1]).paths():
                print file

2. chechsum.py

import hashlib,sys
def create_checksum(path):
    fp = open(path)
    checksum = hashlib.md5()
    while True:
        buffer = fp.read(8192)
        if not buffer:break
        checksum.update(buffer)
    fp.close()    
    checksum = checksum.digest()
    return checksum
if __name__ == '__main__':
        create_checksum(sys.argv[1])

3. find_dupes.py

from checksum import create_checksum
from diskwalk import diskwalk
from os.path import getsize
import sys
def findDupes(path):
    record = {}
    dup = {}
    d = diskwalk(path)
    files = d.paths()
    for file in files:
        compound_key = (getsize(file),create_checksum(file))
        if compound_key in record:
            dup[file] = record[compound_key]    
        else:
            record[compound_key]=file
    return dup

if __name__ == '__main__':
    for file in  findDupes(sys.argv[1]).items():
        print "The duplicate file is %s" % file[0]
        print "The original file is %s\n" % file[1]

findDupes函數返回了字典dup，該字典的鍵是重復的文件，值是原文件。這樣就解答了很多人的疑惑，畢竟，你怎么確保你輸出的是重復的文件呢？

4. delete.py

import os,sys
class deletefile(object):
    def __init__(self,file):
        self.file=file
    def delete(self):
        print "Deleting %s" % self.file
        os.remove(self.file)
    def dryrun(self):
        print "Dry Run: %s [NOT DELETED]" % self.file
    def interactive(self):
        answer=raw_input("Do you really want to delete: %s [Y/N]" % self.file)
        if answer.upper() == 'Y':
            os.remove(self.file)
        else:
            print "Skiping: %s" % self.file
        return
if __name__ == '__main__':
    from find_dupes import findDupes
        dup=findDupes(sys.argv[1])
    for file in dup.iterkeys():
        delete=deletefile(file)
        #delete.dryrun()
          delete.interactive()
        #delete.delete()

deletefile類構造了3個函數，實現的都是文件刪除功能、其中delete函數是直接刪除文件，dryrun函數是試運行，文件並沒有刪除，interactive函數是交互模式，讓用戶來確定是否刪除。這充分了考慮了客戶的需求。

總結：這四個模塊已封裝好，均可單獨使用實現各自的功能。組合起來就可批量刪除重復文件，只需輸入一個路徑。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 如何用Python刪除一個文件？ Python高效率遍歷文件夾尋找重復文件 Python教程：高效率遍歷文件夾尋找重復文件如何用UE（UltraEdit）刪除重復行？--轉刪除重復文件的程序 python讀取文件時，刪除重復行並計數 [LeetCode] Find Duplicate File in System 在系統中尋找重復文件 python 刪除/查找重復項如何用DOS命令刪除文件夾如何用python操作XML文件