查找当前目录的重复文件

Linux|系统管理|WEB开发 2010-06-12 09:55:16 累计浏览 3,917 次

本机暂存

内容概览

当你的磁盘空间莫名告急，或者在整理归档时总感觉文件有冗余，快速定位那些完全相同的副本就成了一个实际需求。这篇讲的就是在Linux环境下如何高效完成这项任务。

作者聚焦于Ubuntu系统下的一个专门工具——fdupes。不同于一些依赖脚本的方案，它本身是C语言编写的二进制程序，这赋予了它显著的性能优势，在处理大量文件时速度更快。文章点明了它的核心工作逻辑：通过比对文件大小和校验和（默认使用MD5哈希，也可配置为其他算法）来精准识别重复项，确保不会遗漏。

对于技术运维人员或数据管理场景，这类工具非常实用。它能清晰地列出所有重复文件的路径，你可以据此选择保留哪一个，安全地删除或替换其他副本，从而切实回收存储空间。文章没有停留在工具罗列，而是直接展示了其解决问题的能力和效率优势。

Ubuntu下有一个fdupes程序可以用来打印当前目录下有多少文件是重复的。因为是二进制程序，所以效率上更高。

而用Python实现的，效率就很低了，同样的目录，fdupes只用了2秒的样子，而Python程序就用了15s，大约7-8倍速度差距。

所以这个代码仅仅只是一个演示罢了。代码参考《Python for Unix and Linux System Administration》一书。

以下是代码片段：
#!/usr/bin/python -tt
# -*- coding:utf-8 -*-
__DOC__ = ’’’
    identifies duplicate files within given directories including subdirectories
     pydupes.py uses files’ size and md5sums to find
     duplicate files within a set of directories.
    ’’’
__Author__ = ’wgzhao,wgzhao##gmail.com’

import os
from hashlib import md5
from sys import exit, argv

class diskwalk(object):
    """API for getting directory walking collections"""
    def __init__(self, path):
        self.path = path
    def enumeratePaths(self):
        """Returns the path to all the files in a directory as a list"""
        path_collection = []
        for dirpath, dirnames, filenames in os.walk(self.path):
            for file in filenames:
                fullpath = os.path.join(dirpath, file)
                path_collection.append(fullpath)
        return path_collection

    def enumerateFiles(self):
        """Returns all the files in a directory as a list"""
        file_collection = []
        for dirpath, dirnames, filenames in os.walk(self.path):
            for file in filenames:
                file_collection.append(file)
        return file_collection

    def enumerateDir(self):
        """Returns all the directories in a directory as a list"""
        dir_collection = []
        for dirpath, dirnames, filenames in os.walk(self.path):
            for dir in dirnames:
                dir_collection.append(dir)
        return dir_collection

def create_checksum(path):
    """
    Reads in file. Creates checksum of file line by line.
    Returns complete checksum total for file.
    """
    fp = open(path)
    checksum = md5()
    while True:
        buffer = fp.read(8192)
        if not buffer:
            break
        checksum.update(buffer)
    fp.close()
    checksum = checksum.digest()
    return checksum

def findDupes(path = ’/tmp’):
    dup = {}
    record = {}
    d = diskwalk(path)
    files = d.enumeratePaths()
    for file in files:
        compound_key = (os.path.getsize(file),create_checksum(file))
        if compound_key in record:
            if compound_key in dup:
                dup[compound_key].append(record[compound_key])
            else:
                dup[compound_key] = [ record[compound_key] ]
            dup[compound_key].append(file)
        else:
            #print "Creating compound key record:", compound_key
            record[compound_key] = file
    return dup

if __name__ == "__main__":

    if len(argv) < 2:
        try:
            path = raw_input(’pls input dir:’)
        except:
            path = ’’
    else:
        path = argv[1]
    if not path: exit(1)

    dupes = findDupes(path)

    for k,v in dupes.items():
        print
        for filename in v:
            print filename

代码运行类似如下:

以下是代码片段：

$ pydupes Dropbox/
....
Dropbox/Public/rhce-ts-9.0-1.2.noarch.rpm
Dropbox/Public/Linux/rhce-ts-9.0-1.2.noarch.rpm

Dropbox/repos/bolebi/hooks/pre-revprop-change.tmpl
Dropbox/repos/lshc/hooks/pre-revprop-change.tmpl
....

Dropbox/Public/libflashplayer.so
Dropbox/Public/Linux/libflashplayer.so

同分类推荐文章

从零重建 macOS 开发机：可复现的环境初始化流程（2026-06-14 20:36:00）
百度物理网络监控工具开源第二弹：毫秒级监控工具 baize，让你的网络问题无处遁形（2026-06-11 08:10:28）
How to Set Up Homebrew Tap for Private CLI Tools: A Complete Guide （2026-05-27 02:13:03）

查看更多 DevOps 文章 →

建议继续学习

用Hyer来进行网站的抓取（累计阅读 158,251）
配置Nginx＋uwsgi更方便地部署python应用（累计阅读 107,164）
程序员技术练级攻略（累计阅读 35,471）
python实现自动登录discuz论坛（累计阅读 32,834）
python编程细节──遍历dict的两种方法比较（累计阅读 20,371）
每个程序员都应该学习使用Python或Ruby （累计阅读 17,918）
Chrome和goagent的配置方法，你懂的（累计阅读 16,843）
Linux如何统计进程的CPU利用率（累计阅读 16,308）
我的 RHCA 之路（累计阅读 14,013）
30分钟3300%性能提升――python+memcached网页优化小记（累计阅读 13,742）