GetHashCode for Objects that Compare Based on Value (Not reference equality)

  • Thread starter Thread starter Matt
  • Start date Start date
M

Matt

Hello,

In a few classes I have overridden Equals, ==, and != and would like to
generate an optimal hash code but I am a little confused about the best
approach to take. All of the classes that I am developing use a value
comparison, not reference, to determine equality. The classes were all
originally generated from an XSD, so there is a hierarchical
organization of classes. Because I am doing a value comparison, the
actual contents of the objects are relevant, but this gets tricky
because the majority of these classes contain arrays of additional
objects, which in turn contain more objects.

For example, one boardgame is only equivalent to another board game if
it contains the same number of game pieces, and those pieces are on the
same relative spots on the board, and the pieces are the same color,
and the pieces have the same stickers on them, and the stickers have
the same adhesive, etc.. You can see how it gets laborious to generate
a hash code based on the value of one's contents if it is composed of
objects which are composed of other objects and so on. The boardgame is
not the same if the adhesive on the game pieces is different, so the
hash codes should be different in that case, right? If the hash code
were based simply on the number of pieces, their location, and their
color then I could expect collisions and that would introduce all sorts
of confusion.

Would you just use Object.GetHashCode and forget about it... or would
you generate a hash code from an exhaustive dig into an object's
contents? I was thinking of generating a hashcode from the hash codes
of the member variables? What is the best approach for this issue?

Thanks
 
Matt,

Your best bet is to XOR the GetHashCode() of all your significant
non-computed member vairables:

public class myClass
{

private int intID = 0;
private string strName = null;
private decimal decPrice = 0;
private decimal decQuantity = 0;

private decimal decTotalValue = 0;

public void Calculate()
{
this.decTotalValue = this.decPrice * this.decQuantity;
}

public override int GetHashCode()
{
int intHash = this.intID.GetHashCode() ^ this.decPrice.GetHashCode()
^ this.decQuantity.GetHashCode();
return this.strName == null ? intHash : intHash ^
this.strName.GetHashCode();
}

}

If everyone's done their homework and implemented honest GetHashCode()
overrides in all the classes of your member variables -- everything in
the framework has -- your GetHashCode() should satisfy the
requirements. Note that I'm not including decTotalValue, because it's
nothing more than decPrice times decQuantity, and now's a good time to
ask yourself, "do I really need to cache that value?" Just beware of
the normal pitfalls, like null member variables (a good reason to use
structs and enums when they'll suffice instead of full-blown classes).


Stephan
 
Thanks for the quick reply, would you still XOR everything if the
member variables were not primitive - so the GetHashCode calls would
call other GetHashCode calls internally? My main concern is the
performance hit, if I have a lot of objects in a hierarchical fashion
and I call GetHashCode on the topmost object which in turn generates a
hash based on its own member variables, each of which in turn generates
their own hash code from its member variables, and so on and so forth..
Generating a hash code at the top level could result in a huge number
of operations.

For a simple ADT with primitive types, your approach makes total sense
to me. When collections and nested types are brought into the picture,
the performance hit seems like it could be tremendous if I am using my
objects extensively in collections that query the hash code frequently.
Since hash codes in .NET are not guaranteed to be constant, and since
it should change if the contents of a value based equality object
change, then I would imagine that the included .NET collections would
make potentially frequent calls to GetHashCode while performing common
operations. Am I correctly understanding this?


Thanks!
 
I usually combine the ToString of each significant property and return the
hashcode of the combined string. I also override ToString on most of my
classes so that the ToString method returns a string that uniquely identifies
the instance. That takes care of the values for reference types.
 
ssamuel said:
Your best bet is to XOR the GetHashCode() of all your significant
non-computed member vairables:

XOR can easily lead to collisions - in your example, for instance, the
hashcodes of the two decimals are XORed - that means that an item with
a price and quantity of 2 will (everything else being equal) be the
same as an item with a price and quantity of 1.

I use the trick from Josh Bloch's Java book:

int hash = 37;
hash = 17*hash + hashOfFirstMember;
hash = 17*hash + hashOfSecondMember;
hash = 17*hash + hashOfThirdMember;
etc

37 and 17 aren't magical, just coprime. It's not guaranteed to be a
*great* hashcode, but it's a little better than XOR :)
 
XOR can easily lead to collisions - in your example, for instance, the
hashcodes of the two decimals are XORed - that means that an item with
a price and quantity of 2 will (everything else being equal) be the
same as an item with a price and quantity of 1.

Jon,

This is true. It's also irrelevant.

According to the framework documentation (ask if you need a link),
GetHashCode() isn't meant for uniqueness or object identification. It's
meant to provide a good (i.e. -- random) distribution across a
hashtable. Aside from the fact that the documentation recommends XOR
implementations, it's fairly clear from a mathematical basis that XOR
functions provide better orthogonality when correlating multiple
distributions, which is what we're trying to accomplish.

I believe the example you gave, while neat, simply adds a constant
offset to your distribution. All this serves to do is skew your random
variable: it provides no benefit to randomness of the resulting
distribution, all else being equal. The probability distribution is the
same.

There are certainly times when a purely random distribution of hashes
is undesireable. However, you'd have to know quite a bit about the
nature of your data to determine what a better hashing algorithm would
be. I'll admit that I didn't ask if this was the case, largely because
I don't want to go through the analysis for someone else's data, and
also because that's an optimization step that shouldn't come into play
until it's clear that a poor hashing algorithm is killing hash
performance. I believe the question was directed towards, "hey, the
compiler's making me do it, so how do I make it shut up simply?"


Stephan
 
ssamuel said:
This is true. It's also irrelevant.

Not unless you think that collisions are irrelevant. They're irrelevant
(unless your code is broken) in terms of *correctness* but they're
important in terms of performance. Let's face it, "return 0;" is a
*valid* implementation of GetHashCode(), but not a terribly useful one.
According to the framework documentation (ask if you need a link),
GetHashCode() isn't meant for uniqueness or object identification. It's
meant to provide a good (i.e. -- random) distribution across a
hashtable.

Yes - but that suggests that collisions should be avoided where
possible. Using XOR increases the chance of collision.
Aside from the fact that the documentation recommends XOR
implementations

There are lots of bad recommendations in the docs :)
it's fairly clear from a mathematical basis that XOR
functions provide better orthogonality when correlating multiple
distributions, which is what we're trying to accomplish.

Not really. For example, using XOR with lots of numbers which all give
small hashcodes will only give you a range with
I believe the example you gave, while neat, simply adds a constant
offset to your distribution. All this serves to do is skew your random
variable: it provides no benefit to randomness of the resulting
distribution, all else being equal. The probability distribution is the
same.

I believe that in real life it's more likely that you'll see hash
collisions using XOR than using the method I've shown.

For instance, if you have two string properties, it's more likely in
the real world that you'll have two instances which either both use the
same value for both properties (although that value may be different
between the instances) or that use the same values in a different
order.
There are certainly times when a purely random distribution of hashes
is undesireable. However, you'd have to know quite a bit about the
nature of your data to determine what a better hashing algorithm would
be. I'll admit that I didn't ask if this was the case, largely because
I don't want to go through the analysis for someone else's data, and
also because that's an optimization step that shouldn't come into play
until it's clear that a poor hashing algorithm is killing hash
performance. I believe the question was directed towards, "hey, the
compiler's making me do it, so how do I make it shut up simply?"

My point was trying to get to a hashing algorithm which was *more*
random than yours.

Here's an example program - it mimics a class containing 3 ints, all of
which will be in the range -10 to 10. In real life, this is far from
uncommon - even if the theoretical range of the data is large, the
actual range is often very small. We calculate every possible hashcode
using both methods, and then see how many *different* ones there are in
each case.

Using XOR, you only get 32 possible hashcodes. With multiplication
(even completely "untuned") you get 6141. With a *tiny* bit of
knowledge/guesswork about the data, you can improve this to the full
9261 possible values, just by changing the multiplication factor.

Why use a hashcode algorithm which is more likely to result in
duplicate hashcodes (which won't prevent anything from working, but
will slow things down) when it's easy to give a wider range?


using System;
using System.Collections.Generic;

class Test
{
static void Main()
{
// Use maps as sets
Dictionary<int,int> xorMap = new Dictionary<int,int>();
Dictionary<int,int> multMap = new Dictionary<int,int>();

for (int x=-10; x <= 10; x++)
{
for (int y=-10; y <= 10; y++)
{
for (int z=-10; z <=10; z++)
{
int xorHash = x^y^z;
int multHash = 37;
multHash = 17*multHash + x;
multHash = 17*multHash + y;
multHash = 17*multHash + z;

xorMap[xorHash] = xorHash;
multMap[multHash] = multHash;
}
}
}

Console.WriteLine ("Number of different hashcodes using XOR: "+
xorMap.Count);
Console.WriteLine ("Number of different hashcodes using */+: "+
multMap.Count);
}
}
 
Jon,

I highly respect you for taking the time to write sample code, but I
think you're chasing a proof by example. Given your assumptions, you
are completely correct. The fact that you have to justify your
assumptions as "in real life," which is a highly subjective assumption,
your view being one with which I cannot agree, proves my point: if you
know your data well, you can always come up with a better hashing
algorithm.

My proof is more abstract than yours:

1. It is too hard to collect all the data used as member variables in
every class in every industry through the history, present and future
of all C# applications.

2. Therefore, one cannot characterize 'all data' or even 'real life
data.'

(For the record, your posit that an array of evenly distributed
integers between any x and -x is common would need lots more empirical
data before I begin to believe it; in the last three industries I've
worked in, such data is uncommon. I claim that your other posit, that a
well-designed class should contain two string members that contain the
same value, is very poor.)

3. Uncharacterizable data should be considered randomly distributed, by
the definition of random distribution, and thus should data in member
variables of a given class.

4. Given two random distributions, A and B, A XOR B is random.

5. It follows that given a class with random member variable values,
XOR-ing member variable values will give a random hash.

QED, unless you have specific characterization of your data, XOR is
efficient. Seeing that it's the method recommended in the docs -- I
won't even touch your comment about their incorrectness -- I'll follow
Ockham's Razor and stick with it.

Mind you, I'm not trying to say your algorithm is incorrect. I'm simply
stating that one should implement the simplest efficient algorithm
until optimization, when one can gain insight into the true
distribution of values in ones real data set.


Stephan
 
ssamuel said:
I highly respect you for taking the time to write sample code, but I
think you're chasing a proof by example. Given your assumptions, you
are completely correct. The fact that you have to justify your
assumptions as "in real life," which is a highly subjective assumption,
your view being one with which I cannot agree, proves my point: if you
know your data well, you can always come up with a better hashing
algorithm.

My proof is more abstract than yours:

It's not really more abstract. You're effectively saying that a good
default position is to treat all data as if it were randomly and evenly
distributed. I'm saying that's not a good default position because so
much of the time in the real world we see repeated values, or values
which tend to lie within relatively small ranges.
1. It is too hard to collect all the data used as member variables in
every class in every industry through the history, present and future
of all C# applications.

2. Therefore, one cannot characterize 'all data' or even 'real life
data.'

That's false logic - or at least, it's effectively claiming that
because we can't know *everything* it's not worth using what we *do*
know as a reasonable representative sample.

Here's a counterexample: I can't collect all the data used in every
English language text ever written, but if I were to look at all the
English language books I happen to have in my book cases, I can
reasonably quickly come to the conclusion that 'e' and 's' are used
more often than 'x' and 'z'. Similarly I don't think it's unreasonable
to suggest that the number 8 comes up more often as a value for real-
life data stored in ints than 8327646.
(For the record, your posit that an array of evenly distributed
integers between any x and -x is common would need lots more empirical
data before I begin to believe it; in the last three industries I've
worked in, such data is uncommon.

a) There's no need for the range to be -x to x.
b) There's no need for the distribution to be even

In fact, there's no need for the values even to be in the same range -
try the code I gave with different ranges for the three variables:
you'll never get out of the number of bits which can be different. With
my algorithm you quickly get a broader range.

Do you really think it's particularly uncommon for real-world data to
not end up using the full range of the datatype's possible values?
I claim that your other posit, that a
well-designed class should contain two string members that contain the
same value, is very poor.)

I didn't say that well-designed classes "should" contain two string
members that contain the same value - I claimed it often happened.
Think "shipping address" and "invoice address" - as references to
Address objects, if you will (it really doesn't matter whether they're
strings or not - the point is that repeating a value effectively
cancels out the first value's hashcode when using XOR).
3. Uncharacterizable data should be considered randomly distributed, by
the definition of random distribution, and thus should data in member
variables of a given class.

I think that's as much a guess as any other distribution - but I'd
consider it far worse than my guess that numeric values, particularly
integers, are often relatively small, usually positive. This is
reasonable, as many real-world values have the same attributes.
Consider:

1) Age
2) Day of month
3) Month of year
4) Year
5) House number
6) Price (in cents)
7) Length of DVD or CD in minutes
8) Dimension (eg in centimeters)
9) Number of messages in my inbox
10) Number of replies to a blog post

Many of these will end up being represented using an int in simple
classes. Do you really believe they have a uniform random distribution
over the full range of ints?
4. Given two random distributions, A and B, A XOR B is random.

5. It follows that given a class with random member variable values,
XOR-ing member variable values will give a random hash.

If life really consisted of uniformly distributed random values, XOR
would be fine. However, I contest that the real world isn't like that.
QED, unless you have specific characterization of your data, XOR is
efficient. Seeing that it's the method recommended in the docs -- I
won't even touch your comment about their incorrectness -- I'll follow
Ockham's Razor and stick with it.

Why won't you touch my comment about the docs not being correct? I've
submitted many corrections to MSDN before, and there are *huge* numbers
of bad code examples. I'm sure I could easily find some if you doubt
me...

Now, as to it being good solely by being recommended in documentation -
"my" algorithm is recommended in a book which is widely recognised as a
great text. Okay, so it's about Java, but I don't believe that there's
any significant difference between .NET and Java that would affect the
advice about hashcodes.
Mind you, I'm not trying to say your algorithm is incorrect. I'm simply
stating that one should implement the simplest efficient algorithm
until optimization, when one can gain insight into the true
distribution of values in ones real data set.

I don't think there's any significant difference in simplicity, to be
honest. I *do* believe you'll get more collisions in the real world
using XOR than using multiply-and-add.
 
Jon,

You've proven my point. Every time you list an example, you're
demonstrating that by knowing your data, you can come up with a better
hash than a generic implementation.

The likelihood of producing collisions using my method is sufficiently
low in classes that contain the following data sets:

1. stock ticker, price
2. patient name, age, ssn, collection of descriptive flags
3. date/time, event description, event data

Further, for those sets of data, an XOR hash produces a sufficiently
random distribution for hashes to work just fine. Your algorithm works
just as well in these cases, but mine does well also. If you disagree
with that, we need to start to develop a metric for how "good" a hash
has to be and what your benchmark is. Then we're departing from the
realm of developing a good generic algorithm; thus specific examples
haven't gotten us far.

Saying that XOR produces collisions is not very useful. XOR works on
the basis of collisions, so the task is not to prove that you could
have collisions, but that the collisions inside two data structures
that contain this data would lead to low orthogonality if the data was
highly orthogonal.

Your argument that you can generalize about the English language
doesn't apply because your assumptions have no bearing on the
assumptions that make up this case. It's about as logical as me saying,
"I'm right because the Cardinals won the World Series in 2006," then
going on to prove that the Cardinals did in fact win.

You've made statements about how you have some ability to characterize
all "real world" data. All you've done to prove it is list single data
types that aren't themselves randomly distributed, although some that
are used as commonly as those which you didn't mention are. Correlating
two or more of those changes the orthogonality of your random
variables, be they of Gaussian, random, lognormal or any other
distribution. A few cases, or two dozen cases for that matter, don't
prove a point.

Your hash works well in some scenarios and my hash works well in
general scenarios. That's not to say that yours doesn't work well in
general scenarios -- it appears as if it would -- or that there are
some in which mine doesn't work well -- we know there are. You have,
however, failed to prove that mine doesn't work well in the "real
world."

I'm not going to get into a petty argument over whose book is better.
You're certainly not the only person ever to notice an error somewhere,
and you're even more certainly not an authority on whether MSDN has
more errors per unit volume than any particular book in existence,
except perhaps any that you may have written. Either prove that the
specific passage I quoted is incorrect or drop the "MSDN is the root of
all evil" argument.



Stephan
 
Matt said:
For a simple ADT with primitive types, your approach makes total sense
to me. When collections and nested types are brought into the picture,
the performance hit seems like it could be tremendous if I am using my
objects extensively in collections that query the hash code frequently.
Since hash codes in .NET are not guaranteed to be constant, and since
it should change if the contents of a value based equality object
change, then I would imagine that the included .NET collections would
make potentially frequent calls to GetHashCode while performing common
operations. Am I correctly understanding this?

Hashtable shouldn't make too many calls to GetHashCode(). I can't speak
for the actual implementation of System.Collections.Hashtable, but it's
entirely possible that GetHashCode() would only need to be called once
per object, on insert. If you don't move or resort stuff, the original
hash should stand. I don't know all that much about specific
performance enhancements that would require re-hashing, but whoever
used one of these would have to account for hash generation time when
deciding if implementation would save or lose time.

As for increasing times in cascaded hashes, your statements are
dead-on. GetHashCode() should return "quickly," but that's a relative
thing. If you have sub objects, they too should return quickly, so
overall hash computation should be fast. Keeping things to simple
integer math speeds a lot up, as opposed to any string operation, which
is may take close to an order of magnitude longer. If you're designing
a data structure, you may want to build the hash while you build the
data structure to prevent having to do it later. A good DBA type may be
able to help you with this: it's like building an table index after
adding data instead of before.

This is a necessary evil. In order to have a competent hash, you have
to have some representation of all the essential component parts of an
object. While it's true that we want to reduce the number of cycles our
programs suck up, the whole point of the CPU is to process things that
need to be processed.


Stephan
 
As for increasing times in cascaded hashes, your statements are
dead-on. GetHashCode() should return "quickly," but that's a relative
thing. If you have sub objects, they too should return quickly, so
overall hash computation should be fast. Keeping things to simple
integer math speeds a lot up, as opposed to any string operation, which
is may take close to an order of magnitude longer. If you're designing
a data structure, you may want to build the hash while you build the
data structure to prevent having to do it later. A good DBA type may be
able to help you with this: it's like building an table index after
adding data instead of before.

One option is to determine the hash lazily. Either have one "magic
value" which indicates "I haven't worked out the hash yet" (and make
sure your hash algorithm never ends up with that as the value) or have
a separate flag to say whether or not the hash has been calculated.

This is the approach Sun's Java implementation takes for strings,
although there's nothing there to stop the "magic number" (0) being the
result of a real hash, and thus being recalculated every time. Indeed,
it's guaranteed to be the case for empty strings - although at that
point it doesn't take long to recalculate :)
 
Back
Top