Find identical files

  • Thread starter Thread starter Lars von Wedel
  • Start date Start date
L

Lars von Wedel

Hello,

I have a rather large number of files and I would like to check for
duplicates. Name or date are not relevant.

I found the MD5 class which seems to be suitable. Can I use the returned
byte[] as an index in a hashtable in order to create a mapping from
key and identical files (i.e. their names)?

Lars
 
The byte[] probably cannot be used as in index in this case because each
byte[] will be a different memory object so you would not be able determine
differences. Try converting the bytes into a string (maybe base64 encode)
and use that as your index.
 
Lars von Wedel said:
I have a rather large number of files and I would like to check for
duplicates. Name or date are not relevant.

I found the MD5 class which seems to be suitable. Can I use the returned
byte[] as an index in a hashtable in order to create a mapping from
key and identical files (i.e. their names)?

Depending on your data, CRC32 might be enough and can be held in an integer.
CRC32 rarely returns dups. On any suspected ones you run a secondary check
using MD5, or raw compare.

The advantage is that it returns integers, easier to index, faster, etc..


--
Chad Z. Hower (a.k.a. Kudzu) - http://www.hower.org/Kudzu/
"Programming is an art form that fights back"


ELKNews - Get your free copy at http://www.atozedsoftware.com
 
Hi,
I found the MD5 class which seems to be suitable. Can I use the
returned byte[] as an index in a hashtable in order to create a
mapping from key and identical files (i.e. their names)?

Depending on your data, CRC32 might be enough and can be held in an
integer. CRC32 rarely returns dups. On any suspected ones you run a
secondary check using MD5, or raw compare.
Good point. However, I'm already finished using MD5 which works fine
in combination with a hashtable... From a runtime or memory point of
view there is no reason to go for a more light-weight approach.

So, in case anyone's interested in a snippet...

Lars
 
I'd like to have a look at a snippet.. that sounds interesting.

gwmorris [AT] hotpop [DOT] com

if it's not too much trouble!

Lars von Wedel said:
Hi,
I found the MD5 class which seems to be suitable. Can I use the
returned byte[] as an index in a hashtable in order to create a
mapping from key and identical files (i.e. their names)?

Depending on your data, CRC32 might be enough and can be held in an
integer. CRC32 rarely returns dups. On any suspected ones you run a
secondary check using MD5, or raw compare.
Good point. However, I'm already finished using MD5 which works fine
in combination with a hashtable... From a runtime or memory point of
view there is no reason to go for a more light-weight approach.

So, in case anyone's interested in a snippet...

Lars
 
Back
Top