e42.uk Circle Device

 

Quick Reference

Finding Duplicate Files with md5sum

Firstly make the file list:

find . -type f > myfiles.filelist
cat myfiles.filelist | while read filename ; do md5sum "$filename" ; done > myfiles.md5sum
sort myfiles.md5sum > myfiles.md5sum.sorted

I chose a simple way to make the file list here, you may have seen different ways using xargs and pipes... these are fine too but I needed to use the above approach because I was running my check on my Western Digital MyBookWorld White Light edition which is restricted in memory and other functionality. I only have busybox :-(.

When you have your file list you must check for duplicate md5 hashes (these are at the start of each line). The perl programme below will print the paths of any files that have matching hashes on the same line separated by a tab character. Don't be scared, this is simple perl without any clever optimisations just a simple loop looking for duplicate strings!

#!/usr/bin/perl

#
# This takes a SORTED list (followed by a space and anything)
# and returns a list with no duplicates based on the bit before
# the space.
# Will work with no spaces just line terminators.
#

use strict;
use warnings;

if (scalar @ARGV != 2 || $ARGV[0] eq '' || $ARGV[1] eq '') {
	die "No files specified, dedupe.pl <input> <output>";
}

my $file = $ARGV[0];
open my $info, $file or die "Could not open $file: $!";
open my $outpfl, ">", $ARGV[1] or die "Cannot open output file $!";
binmode $outpfl;

sub trim($)
{
	my $string = shift;
	$string =~ s/^\s+//;
	$string =~ s/\s+$//;
	return $string;
}

my $prev_sum = '0';
my $prev_fn = 'file';
my $curr_fn = 'file';
my $file_name = 'filename';
my $prev_line = '';
my $count = 0;
my $line = <$info>;
my @filenames = ();

# This could be in error... TODO: check this bit.
my @res = split(' ', $line);
$prev_sum = $res[0];
$line = <$info>;

do {
	@res = split(' ', $line);
	$curr_fn = trim(substr($line, length($res[0])));
	if ($res[0] eq $prev_sum) {
		$count = $count + 1;
		push(@filenames, $prev_fn);
	} else {
		#print($outpfl $count . ' ' . trim($prev_fn) . ' ' . trim($curr_fn) . "\n");
		if (@filenames) {
			foreach (@filenames) {
				print($outpfl "$_\t");
			}
			print($outpfl $prev_fn);
			print($outpfl "\n");
			@filenames = ();
			$count = 1;
		}
	}
	$prev_sum = $res[0];
	$prev_fn = $curr_fn;
	$prev_line = $line;
} while ( $line = <$info> );

close $info;
close $outpfl;

Well, that is it, simple as pie. You can, of course, replace the md5sum with an sha1sum or sha256sum or whatever hash function you like.

References

Quick Links: Techie Stuff | General | Personal | Quick Reference