HashTable

Sreekanth · Jan 20, 2005

Hello,
Is there any better collection than HashTable in terms of performance, when
the type of the key is integer?
Regards,
Sreekanth.

Jon Skeet [C# MVP] · Jan 20, 2005

Sreekanth said:
Is there any better collection than HashTable in terms of performance, when
the type of the key is integer?

Well, you *could* write your own version - but have you made sure that
the bottleneck in your code is in Hashtable first?

Dennis Myrén · Jan 20, 2005

I have written such a Hashtable.
It uses Int32 for keys and values(the type to use for the values one could
change), beating System.Collections.Hashtable in terms of performance.

Let me know if this is anything of interest for you.

Ollie Riches · Jan 20, 2005

By how much?

Dennis Myrén said:
I have written such a Hashtable.
It uses Int32 for keys and values(the type to use for the values one could
change), beating System.Collections.Hashtable in terms of performance.

Let me know if this is anything of interest for you.

Dennis Myrén · Jan 20, 2005

You mean by how much, in performance?
If so, i think it is almost twice as fast.

If you mean the price, of course it is free.

Ollie Riches · Jan 20, 2005

when you 'twice as fast' how many values did you test up to? 100, 1000,
10000, 1000000 etc

Cheers

Ollie

Dennis Myrén · Jan 20, 2005

Well, first i tested adding a million slots to my hashtable and
System.Collections.Hashtable.
I repeated that action 5 times.
The results(on my computer) is as follows:
Baretta.Int32Hashtable added 1000000 slots in 500ms
Baretta.Int32Hashtable added 1000000 slots in 490ms
Baretta.Int32Hashtable added 1000000 slots in 550ms
Baretta.Int32Hashtable added 1000000 slots in 490ms
Baretta.Int32Hashtable added 1000000 slots in 590ms
System.Collections.Hashtable added 1000000 slots in 540ms
System.Collections.Hashtable added 1000000 slots in 741ms
System.Collections.Hashtable added 1000000 slots in 931ms
System.Collections.Hashtable added 1000000 slots in 801ms
System.Collections.Hashtable added 1000000 slots in 821ms

So, when adding the keys and values to the table, my implementation
is a little, but not much faster.

But, the real performance gain we get when we want to access these keys and
values.
I looped a million times and accessed each slot in the hashtables.
I repeated that action 5 times.
These are the results here:
Baretta.Int32Hashtable read 1000000 slots in 40ms
Baretta.Int32Hashtable read 1000000 slots in 30ms
Baretta.Int32Hashtable read 1000000 slots in 40ms
Baretta.Int32Hashtable read 1000000 slots in 40ms
Baretta.Int32Hashtable read 1000000 slots in 30ms
System.Collections.Hashtable read 1000000 slots in 290ms
System.Collections.Hashtable read 1000000 slots in 280ms
System.Collections.Hashtable read 1000000 slots in 430ms
System.Collections.Hashtable read 1000000 slots in 340ms
System.Collections.Hashtable read 1000000 slots in 280ms

As you can see, now we are starting to really earn in terms of performance.

You are free to obtain my implementation and do your own benchmarks.

Helge Jensen · Jan 20, 2005

Dennis said:
You mean by how much, in performance?
If so, i think it is almost twice as fast.

Is there a place I can see this implementation?

Have you any idea about how much of this performance gain came from not
doing boxing?

I have implemented a hashtable mapping object to object, and have
achieved about 10-30% better performance than
System.Collections.Hashtable, while using less memory. Testing Add's and
Remove's was done using ints both as key and value, either ordered or
randomly selected, sizes of 100 upto 3M was tested, showing my
implementation to be consistently faster and less memory expending than
System.Collections.Hashtable.

System.Collections.Hashtable didn't work very well with sizes above 3M
which is acceptable for me. BTW: I have 512M real-mem.

The (important) choice that made a performance impact was the choice of
datatstructure and indexing: hash in object[2*sz*Math.Log/Math.Log(sz)],
index using key: x*y_sz+y, value: x*y_sz+y+1, instead of [x,y].

The resulting datastructure has really good locality of reference for
searching the collided keys, they are litterally next to eachother in
memory.

Some day, when i switch to .NET2 i'll try the impl out as a
IDictionary<T>, and see what I gain from that in the "int"-case.

Helge Jensen · Jan 20, 2005

Helge said:
Is there a place I can see this implementation?

Rudely, I forgot to show you mine before asking you to show me yours.
Sorry, forgot to put the URL in.

Straight from the subversion repo:
https://slog.dk/svn/home/jensen/graphutil/trunk/

WARNING: while the code passes all my tests, there may be bugs.
Certainly some areas of it need to get cleaned up, (and now for the
exuses) but while experinmenting to gain performance strange things
happens to code, and in not in a hurry to fix it, since it seems to work
so...

Dennis Myrén · Jan 20, 2005

I will be happy to see your implementation.

I have now put mine in a ZIP so you can download it.

Regarding your question about how much i earn from not boxing/unboxing,
yes i think actually i earn a lot from that.
In .NET/C# 2.0, i might not earn anything from this implementation anymore,
since the support for Generics will solve the boxing/unboxing "issue".

I also posted a small article to DeveloperLand as i had written this
implementation,
you can read here if you want:
http://www.developerland.com/DotNet/General/97.aspx
But, do *not* download that source, but rather download from the URL below,
as the version posted to DeveloperLand is old.
So, use this one, updated today:
http://dev.oslokb.no:8080/dennis/widgets/Int32Hashtable.zip

Please let me know of any eventual bugs you might find.

Helge Jensen · Jan 20, 2005

Dennis said:
I will be happy to see your implementation.

Please let me know of any eventual bugs you might find.

I can see that your implementation uses a linked list for colliding
keys. You can try my idea out, with using an array, it really made quite
a lot of runtime difference, esp. on the lookups and on big
data-structures, since the list of collided keys to be searched when
looking up would be on the same 1 or 2 pages of memory.

The 2 optimizations: data-structure and and removal of boxing are
independant, and you should probably see a considerable performance benefit.

Dennis Myrén · Jan 20, 2005

Helge,

That is probably a good idea.

I will look into it.

Thank you for your advice.

Helge Jensen · Jan 20, 2005

Please let me know of any eventual bugs you might find.

Here are a few things that jumped into my eyes:

* try/catch:

try
{
return _host [_keys [_i]];
}
catch (System.IndexOutOfRangeException e)
{
throw e;
}

Could be written just as:

return _host [_keys [_i]];

There is really no good reason to catch an rethrow that exception, is there?

If you need to do some work on the way out in a callstack, you can use
finally, or when you need to use the Exception (for logging for
example), using "throw" instead of "throw e" preserves the original
stacktrace.

* Enumerator:

You define enumeration to be sorted, there is no reason for that, the
basic iteration of hashtables or dictionaries are not guranteed to be
sorted, sorting is expensive (in this context where operations are O(1)

and requires you to spen O(capacity) memory, which is pretty much for
large hashed.

To do sorted iteration, the user can easily do:

int[] keys = new int[dict.Count]
dict.keys.CopyTo(keys, 0);
foreach ( int key in keys ) {
object val = dict[key];
}

or you can define a class SortedCollection that accepts an ICollection
and provides sorted Enumeration.

* indexing:

why do your do "& int.MaxValue"?

int i = (key & int.MaxValue) % _slots.Length;

Rather than

int index = (uint)key % _slots.Length;

It's less expensive, and doesn't disregard the most-significant bit of
your key

* .Values

Return a new instance of ICollection that simply iterates the hashtable,
returning only the keys. This saves a lot of memory.

If the user want's an array he can do dict.Values.CopyTo(array, index).

* int this[int key] { set }

* Add(key, value, overwrite)

Why do you accept ref's to key and value?

* Clear:

use the System.Array methods for clearing an array, or simple assign
a new array to _slots

Dennis Myrén · Jan 20, 2005

Thank you, Helge.
I have done all changes you suggested except in get_Values.
The ZIP is updated:
http://dev.oslokb.no:8080/dennis/widgets/Int32Hashtable.zip

The reason of the Add method of signature:
Add ( ref int, ref int, bool )
is that it is a private method used internally only.
I use ref to prevent a copy of the Int32 being made, doing what i can to
gain performance.

--
Regards,
Dennis JD Myrén
Oslo Kodebureau

Helge Jensen said:
Please let me know of any eventual bugs you might find.

Click to expand...

Here are a few things that jumped into my eyes:

* try/catch:

try
{
return _host [_keys [_i]];
}
catch (System.IndexOutOfRangeException e)
{
throw e;
}

Could be written just as:

return _host [_keys [_i]];

There is really no good reason to catch an rethrow that exception, is
there?

If you need to do some work on the way out in a callstack, you can use
finally, or when you need to use the Exception (for logging for example),
using "throw" instead of "throw e" preserves the original stacktrace.

* Enumerator:

You define enumeration to be sorted, there is no reason for that, the
basic iteration of hashtables or dictionaries are not guranteed to be
sorted, sorting is expensive (in this context where operations are O(1)
and requires you to spen O(capacity) memory, which is pretty much for
large hashed.

To do sorted iteration, the user can easily do:

int[] keys = new int[dict.Count]
dict.keys.CopyTo(keys, 0);
foreach ( int key in keys ) {
object val = dict[key];
}

or you can define a class SortedCollection that accepts an ICollection and
provides sorted Enumeration.

* indexing:

why do your do "& int.MaxValue"?

int i = (key & int.MaxValue) % _slots.Length;

Rather than

int index = (uint)key % _slots.Length;

It's less expensive, and doesn't disregard the most-significant bit of
your key

* .Values

Return a new instance of ICollection that simply iterates the hashtable,
returning only the keys. This saves a lot of memory.

If the user want's an array he can do dict.Values.CopyTo(array, index).

* int this[int key] { set }

* Add(key, value, overwrite)

Why do you accept ref's to key and value?

* Clear:

use the System.Array methods for clearing an array, or simple assign a
new array to _slots

Richard Blewett [DevelopMentor] · Jan 20, 2005

Aren't you just putting a 32 bit reference rather than a 32 bit integer on the stack instead? I don't see the benefit myself

Regards

Richard Blewett - DevelopMentor
http://www.dotnetconsult.co.uk/weblog
http://www.dotnetconsult.co.uk

Thank you, Helge.
I have done all changes you suggested except in get_Values.
The ZIP is updated:
http://dev.oslokb.no:8080/dennis/widgets/Int32Hashtable.zip

The reason of the Add method of signature:
Add ( ref int, ref int, bool )
is that it is a private method used internally only.
I use ref to prevent a copy of the Int32 being made, doing what i can to
gain performance.

Helge Jensen · Jan 20, 2005

Dennis said:
Thank you, Helge.

No problemo.

I have done all changes you suggested except in get_Values.

Don't accept my advice as definitive, run your benchmarks and see if
things change, and in what direction

Optimizing code-structure to suit the compiler is often a guessing-game.
For example I wrote 4 different implementations of an interpreter for a
small language: iterative, recursive and
multiple-tailrecursive-functions(passing the state in an "accumulator"),
which was the fastest do you think?

HINT: gcc and MSVC6 produced the fastest code on one of them, MSVC7 on
another.

The reason of the Add method of signature:
Add ( ref int, ref int, bool )
is that it is a private method used internally only.
I use ref to prevent a copy of the Int32 being made, doing what i can to
gain performance.

Off the top of my head I would doubt that you are getting any benefit
from the ref's, Does your benchmarks show a performance gain from it?

Dennis Myrén · Jan 20, 2005

Helge and Richard,
Yes you are absolutely right, i probably do not earn anything(in this
particular case of Int32)
using ref.
I decided to remove the internal Add method, resulting in 10 more lines of
code
but saving a method call and an if statement. It is getting quick now.
Now i have an average access time for a million slots of 30ms, compared
to System.COllections.Hashtable's 380ms.
That is a greater benefit than i would ever imagine.
Have you done any benchmarks on it? Please give me your results in that
case.
I once again updated the ZIP:
http://dev.oslokb.no:8080/dennis/widgets/Int32Hashtable.zip
if you are interested.

And Helge, I would guess that the iterative approach was your fastest
implementation?

--
Regards,
Dennis JD Myrén
Oslo Kodebureau

Dennis Myrén said:
Thank you, Helge.
I have done all changes you suggested except in get_Values.
The ZIP is updated:
http://dev.oslokb.no:8080/dennis/widgets/Int32Hashtable.zip

The reason of the Add method of signature:
Add ( ref int, ref int, bool )
is that it is a private method used internally only.
I use ref to prevent a copy of the Int32 being made, doing what i can to
gain performance.

--
Regards,
Dennis JD Myrén
Oslo Kodebureau

Helge Jensen said:

Please let me know of any eventual bugs you might find.

Click to expand...

Here are a few things that jumped into my eyes:

* try/catch:

try
{
return _host [_keys [_i]];
}
catch (System.IndexOutOfRangeException e)
{
throw e;
}

Could be written just as:

return _host [_keys [_i]];

There is really no good reason to catch an rethrow that exception, is
there?

If you need to do some work on the way out in a callstack, you can use
finally, or when you need to use the Exception (for logging for example),
using "throw" instead of "throw e" preserves the original stacktrace.

* Enumerator:

You define enumeration to be sorted, there is no reason for that, the
basic iteration of hashtables or dictionaries are not guranteed to be
sorted, sorting is expensive (in this context where operations are O(1)
and requires you to spen O(capacity) memory, which is pretty much for
large hashed.

To do sorted iteration, the user can easily do:

int[] keys = new int[dict.Count]
dict.keys.CopyTo(keys, 0);
foreach ( int key in keys ) {
object val = dict[key];
}

or you can define a class SortedCollection that accepts an ICollection
and provides sorted Enumeration.

* indexing:

why do your do "& int.MaxValue"?

int i = (key & int.MaxValue) % _slots.Length;

Rather than

int index = (uint)key % _slots.Length;

It's less expensive, and doesn't disregard the most-significant bit of
your key

* .Values

Return a new instance of ICollection that simply iterates the hashtable,
returning only the keys. This saves a lot of memory.

If the user want's an array he can do dict.Values.CopyTo(array, index).

* int this[int key] { set }

* Add(key, value, overwrite)

Why do you accept ref's to key and value?

* Clear:

use the System.Array methods for clearing an array, or simple assign a
new array to _slots

Click to expand...

HashTable

Sreekanth

Jon Skeet [C# MVP]

Dennis Myrén

Ollie Riches

Dennis Myrén

Ollie Riches

Dennis Myrén

Helge Jensen

Helge Jensen

Dennis Myrén

Helge Jensen

Dennis Myrén

Helge Jensen

Dennis Myrén

Richard Blewett [DevelopMentor]

Helge Jensen

Dennis Myrén