查找当前目录的重复文件
浏览:3001次 出处信息
Ubuntu下有一个fdupes程序可以用来打印当前目录下有多少文件是重复的。因为是二进制程序,所以效率上更高。
而用Python实现的,效率就很低了,同样的目录,fdupes只用了2秒的样子,而Python程序就用了15s,大约7-8倍速度差距。
所以这个代码仅仅只是一个演示罢了。代码参考《Python for Unix and Linux System Administration》一书。
以下是代码片段: #!/usr/bin/python -tt # -*- coding:utf-8 -*- __DOC__ = ’’’ identifies duplicate files within given directories including subdirectories pydupes.py uses files’ size and md5sums to find duplicate files within a set of directories. ’’’ __Author__ = ’wgzhao,wgzhao##gmail.com’ import os from hashlib import md5 from sys import exit, argv class diskwalk(object): """API for getting directory walking collections""" def __init__(self, path): self.path = path def enumeratePaths(self): """Returns the path to all the files in a directory as a list""" path_collection = [] for dirpath, dirnames, filenames in os.walk(self.path): for file in filenames: fullpath = os.path.join(dirpath, file) path_collection.append(fullpath) return path_collection def enumerateFiles(self): """Returns all the files in a directory as a list""" file_collection = [] for dirpath, dirnames, filenames in os.walk(self.path): for file in filenames: file_collection.append(file) return file_collection def enumerateDir(self): """Returns all the directories in a directory as a list""" dir_collection = [] for dirpath, dirnames, filenames in os.walk(self.path): for dir in dirnames: dir_collection.append(dir) return dir_collection def create_checksum(path): """ Reads in file. Creates checksum of file line by line. Returns complete checksum total for file. """ fp = open(path) checksum = md5() while True: buffer = fp.read(8192) if not buffer: break checksum.update(buffer) fp.close() checksum = checksum.digest() return checksum def findDupes(path = ’/tmp’): dup = {} record = {} d = diskwalk(path) files = d.enumeratePaths() for file in files: compound_key = (os.path.getsize(file),create_checksum(file)) if compound_key in record: if compound_key in dup: dup[compound_key].append(record[compound_key]) else: dup[compound_key] = [ record[compound_key] ] dup[compound_key].append(file) else: #print "Creating compound key record:", compound_key record[compound_key] = file return dup if __name__ == "__main__": if len(argv) < 2: try: path = raw_input(’pls input dir:’) except: path = ’’ else: path = argv[1] if not path: exit(1) dupes = findDupes(path) for k,v in dupes.items(): for filename in v: print filename |
代码运行类似如下:
以下是代码片段:$ pydupes Dropbox/ .... Dropbox/Public/rhce-ts-9.0-1.2.noarch.rpm Dropbox/Public/Linux/rhce-ts-9.0-1.2.noarch.rpm Dropbox/repos/bolebi/hooks/pre-revprop-change.tmpl Dropbox/repos/lshc/hooks/pre-revprop-change.tmpl .... Dropbox/Public/libflashplayer.so Dropbox/Public/Linux/libflashplayer.so |
建议继续学习:
- 配置Nginx+uwsgi更方便地部署python应用 (阅读:105381)
- 如何成为Python高手 (阅读:53378)
- python实现自动登录discuz论坛 (阅读:31573)
- python编程细节──遍历dict的两种方法比较 (阅读:18983)
- 每个程序员都应该学习使用Python或Ruby (阅读:16250)
- 30分钟3300%性能提升――python+memcached网页优化小记 (阅读:12109)
- 使用python爬虫抓站的一些技巧总结:进阶篇 (阅读:12094)
- 我的PHP,Python和Ruby之路 (阅读:11825)
- Python处理MP3的歌词和图片 (阅读:8310)
- 关于使用python开发web应用的几个库总结 (阅读:7423)
QQ技术交流群:445447336,欢迎加入!
扫一扫订阅我的微信号:IT技术博客大学习
扫一扫订阅我的微信号:IT技术博客大学习
<< 前一篇:Xapian的查询分析器
后一篇:如何将AIR应用打包成exe >>
文章信息
- 作者:wgzhao 来源: Linux|系统管理|WEB开发
- 标签: python 目录 重复
- 发布时间:2010-06-12 09:55:16
建议继续学习
近3天十大热文
- [69] 如何拿下简短的域名
- [68] Go Reflect 性能
- [66] Oracle MTS模式下 进程地址与会话信
- [61] 图书馆的世界纪录
- [61] IOS安全–浅谈关于IOS加固的几种方法
- [60] 【社会化设计】自我(self)部分――欢迎区
- [59] android 开发入门
- [56] 视觉调整-设计师 vs. 逻辑
- [48] 给自己的字体课(一)——英文字体基础
- [48] 读书笔记-壹百度:百度十年千倍的29条法则