查找当前目录的重复文件 -- 其他 -- IT技术博客大学习 -- 共学习共进步！

您现在的位置：首页 --> 其他 --> 查找当前目录的重复文件

查找当前目录的重复文件

浏览:3736次出处信息

Ubuntu下有一个fdupes程序可以用来打印当前目录下有多少文件是重复的。因为是二进制程序，所以效率上更高。

而用Python实现的，效率就很低了，同样的目录，fdupes只用了2秒的样子，而Python程序就用了15s，大约7-8倍速度差距。

所以这个代码仅仅只是一个演示罢了。代码参考《Python for Unix and Linux System Administration》一书。

以下是代码片段：
#!/usr/bin/python -tt 
# -*- coding:utf-8 -*- 
__DOC__ = ’’’ 
    identifies duplicate files within given directories including subdirectories 
     pydupes.py uses files’ size and md5sums to find 
     duplicate files within a set of directories. 
    ’’’ 
__Author__ = ’wgzhao,wgzhao##gmail.com’ 

import os 
from hashlib import md5 
from sys import exit, argv 

class diskwalk(object): 
    """API for getting directory walking collections""" 
    def __init__(self, path): 
        self.path = path 
    def enumeratePaths(self): 
        """Returns the path to all the files in a directory as a list""" 
        path_collection = [] 
        for dirpath, dirnames, filenames in os.walk(self.path): 
            for file in filenames: 
                fullpath = os.path.join(dirpath, file) 
                path_collection.append(fullpath) 
        return path_collection 

    def enumerateFiles(self): 
        """Returns all the files in a directory as a list""" 
        file_collection = [] 
        for dirpath, dirnames, filenames in os.walk(self.path): 
            for file in filenames: 
                file_collection.append(file) 
        return file_collection 

    def enumerateDir(self): 
        """Returns all the directories in a directory as a list""" 
        dir_collection = [] 
        for dirpath, dirnames, filenames in os.walk(self.path): 
            for dir in dirnames: 
                dir_collection.append(dir) 
        return dir_collection 

def create_checksum(path): 
    """ 
    Reads in file. Creates checksum of file line by line. 
    Returns complete checksum total for file. 
    """ 
    fp = open(path) 
    checksum = md5() 
    while True: 
        buffer = fp.read(8192) 
        if not buffer: 
            break 
        checksum.update(buffer) 
    fp.close() 
    checksum = checksum.digest() 
    return checksum 

def findDupes(path = ’/tmp’): 
    dup = {} 
    record = {} 
    d = diskwalk(path) 
    files = d.enumeratePaths() 
    for file in files: 
        compound_key = (os.path.getsize(file),create_checksum(file)) 
        if compound_key in record: 
            if compound_key in dup: 
                dup[compound_key].append(record[compound_key]) 
            else: 
                dup[compound_key] = [ record[compound_key] ] 
            dup[compound_key].append(file) 
        else: 
            #print "Creating compound key record:", compound_key 
            record[compound_key] = file 
    return dup 

if __name__ == "__main__": 

    if len(argv) < 2: 
        try: 
            path = raw_input(’pls input dir:’) 
        except: 
            path = ’’ 
    else: 
        path = argv[1] 
    if not path: exit(1) 

    dupes = findDupes(path) 

    for k,v in dupes.items(): 
        print 
        for filename in v: 
            print filename

代码运行类似如下:

以下是代码片段：
$ pydupes Dropbox/
....
Dropbox/Public/rhce-ts-9.0-1.2.noarch.rpm
Dropbox/Public/Linux/rhce-ts-9.0-1.2.noarch.rpm

Dropbox/repos/bolebi/hooks/pre-revprop-change.tmpl
Dropbox/repos/lshc/hooks/pre-revprop-change.tmpl
....

Dropbox/Public/libflashplayer.so
Dropbox/Public/Linux/libflashplayer.so

建议继续学习：

QQ技术交流群：445447336，欢迎加入！
扫一扫订阅我的微信号：IT技术博客大学习

<< 前一篇：Xapian的查询分析器

后一篇：如何将AIR应用打包成exe >>

文章信息

作者：wgzhao 来源： Linux|系统管理|WEB开发
标签： python 目录重复
发布时间：2010-06-12 09:55:16

建议继续学习

近3天十大热文