Numeric hash code for a file

  • Thread starter Thread starter David Glover
  • Start date Start date
D

David Glover

Hi all,

I was wondering if anyone could suggest a quick and elegant method for
generating a numeric value dependant on the contents of a file (ie 2
identical files will create the same value, and 2 differing files will
generate different values)?

I really need the generation of the value to be executed as quickly
as possible, and the value generated to be an integer - preferably
unsigned.

Many thanks for any suggestions

David
 
The classic way would be to keep an unsigned integer and read the entire
contents of the file. Each byte would be XOR-ed with the current hash value
and the hash value would then be rotated/shifted left by a bit or something.
Something like this.

unsigned hash = 0;
while (more file )
{
b = readbyte();
hash ^= b;
hash <<= 1;
}

This yields a good hash if, instead of unsigned, the hash value is a byte.
You might adjust how much you shift the value to make it more discriminatory
when it's a 32-bit value...

Paul T.
 
unsigned hash = 0;
while (more file )
{
b = readbyte();
hash ^= b;
hash <<= 1;
}

What's this ? If "hash" is a double word than this method only takes the
last 32 bytes into acount !

Why not use Crc32 or Adler32 ? Both are lightning fast, well known and
respected methods. If you need more security you should use an established
hash like MD5 or SHA.

Here is a small class for calculating the Adler32 checksum:

public sealed class Adler32
{
private const UInt32 BASE = 65521;
private const Int32 NMAX = 5552;

private UInt32 m_S1;
private UInt32 m_S2;

public void Start()
{
this.m_S1 = 1;
this.m_S2 = 0;
}

public void Update(byte[] buffer, int offset, int count)
{
int partLength;

while (count > 0)
{
partLength = count < NMAX ? count : NMAX;
count -= partLength;

while (partLength > 0)
{
this.m_S1 += buffer[offset++];
this.m_S2 += this.m_S1;

partLength--;
}

this.m_S1 %= BASE;
this.m_S2 %= BASE;
}
}

public string Finish()
{
return ((UInt32)((this.m_S2 << 16) | this.m_S1)).ToString("x8");
}
}


Regards, Christian
 
Yes, you can do any of the above. A simple XOR of every byte in the file
will work, too, for that matter. Any good compilers book will have several
hashing algorithms in it, too.

Paul T.

Christian Schwarz said:
unsigned hash = 0;
while (more file )
{
b = readbyte();
hash ^= b;
hash <<= 1;
}

What's this ? If "hash" is a double word than this method only takes the
last 32 bytes into acount !

Why not use Crc32 or Adler32 ? Both are lightning fast, well known and
respected methods. If you need more security you should use an established
hash like MD5 or SHA.

Here is a small class for calculating the Adler32 checksum:

public sealed class Adler32
{
private const UInt32 BASE = 65521;
private const Int32 NMAX = 5552;

private UInt32 m_S1;
private UInt32 m_S2;

public void Start()
{
this.m_S1 = 1;
this.m_S2 = 0;
}

public void Update(byte[] buffer, int offset, int count)
{
int partLength;

while (count > 0)
{
partLength = count < NMAX ? count : NMAX;
count -= partLength;

while (partLength > 0)
{
this.m_S1 += buffer[offset++];
this.m_S2 += this.m_S1;

partLength--;
}

this.m_S1 %= BASE;
this.m_S2 %= BASE;
}
}

public string Finish()
{
return ((UInt32)((this.m_S2 << 16) | this.m_S1)).ToString("x8");
}
}


Regards, Christian
 
Many thanks for all the usefull input. I have settled on using the
Adler32 method on small sections of the file.

Thanks again,

David Glover
 
Back
Top