Find identical files

Lars von Wedel · Jan 26, 2004

Hello,

I have a rather large number of files and I would like to check for
duplicates. Name or date are not relevant.

I found the MD5 class which seems to be suitable. Can I use the returned
byte[] as an index in a hashtable in order to create a mapping from
key and identical files (i.e. their names)?

Lars

Peter Rilling · Jan 26, 2004

The byte[] probably cannot be used as in index in this case because each
byte[] will be a different memory object so you would not be able determine
differences. Try converting the bytes into a string (maybe base64 encode)
and use that as your index.

Chad Z. Hower aka Kudzu · Jan 26, 2004

Lars von Wedel said:
I have a rather large number of files and I would like to check for
duplicates. Name or date are not relevant.

I found the MD5 class which seems to be suitable. Can I use the returned
byte[] as an index in a hashtable in order to create a mapping from
key and identical files (i.e. their names)?

Depending on your data, CRC32 might be enough and can be held in an integer.
CRC32 rarely returns dups. On any suspected ones you run a secondary check
using MD5, or raw compare.

The advantage is that it returns integers, easier to index, faster, etc..

--
Chad Z. Hower (a.k.a. Kudzu) - http://www.hower.org/Kudzu/
"Programming is an art form that fights back"

ELKNews - Get your free copy at http://www.atozedsoftware.com

Lars von Wedel · Jan 26, 2004

Hi,

I found the MD5 class which seems to be suitable. Can I use the
returned byte[] as an index in a hashtable in order to create a
mapping from key and identical files (i.e. their names)?

Click to expand...

Depending on your data, CRC32 might be enough and can be held in an
integer. CRC32 rarely returns dups. On any suspected ones you run a
secondary check using MD5, or raw compare.

Good point. However, I'm already finished using MD5 which works fine
in combination with a hashtable... From a runtime or memory point of
view there is no reason to go for a more light-weight approach.

So, in case anyone's interested in a snippet...

Lars

Gary Morris · Jan 27, 2004

I'd like to have a look at a snippet.. that sounds interesting.

gwmorris [AT] hotpop [DOT] com

if it's not too much trouble!

Lars von Wedel said:
Hi,

I found the MD5 class which seems to be suitable. Can I use the
returned byte[] as an index in a hashtable in order to create a
mapping from key and identical files (i.e. their names)?

Click to expand...

Depending on your data, CRC32 might be enough and can be held in an
integer. CRC32 rarely returns dups. On any suspected ones you run a
secondary check using MD5, or raw compare.

Click to expand...

Good point. However, I'm already finished using MD5 which works fine
in combination with a hashtable... From a runtime or memory point of
view there is no reason to go for a more light-weight approach.

So, in case anyone's interested in a snippet...

Lars

Find identical files

Lars von Wedel

Peter Rilling

Chad Z. Hower aka Kudzu

Lars von Wedel

Gary Morris