Hash table with coarse integer keys

guillaume · Jun 7, 2010

My question is related to the discussion "about the hashtable and
equals" where I learnt a few interesting things.

In my company we deal with big volume of data indexed by unique
integer ids (for background information, we deal with road traffic
data and and the road network is modelled by a set of edges
representing road sections). The edges' ids are ordered from 0 to the
total number of edges (ie around 20 millions for the French network).
However we often only deal with a subset of the road network of only
100000, 1 million or 10 millions of ids which are uniformly spread
between 0 and 20 millions.

We usually use C# Dictionary with integer keys to store any data by
edges' ids (road section characteristics or archived data, ...) and it
scales very well until several millions of ids. However using a
generic hashing algorithm with 32bits integers (potentially stretching
from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys
are much coarser always seemed to me as an overkill.

One optimization I came up with was just using a dense array of the
maximal size where the key of each object is in fact its index inside
this array. It works pretty well : for example when initializing a
dictionary with 10 millions of keys randomly chosen between 0 and 20
millions, there is a gain of more than 200Mb not using a 10 millions
key/value Dictionary in spite of the fact that our array contains
actually 20 millions values. In addition, even if the Dictionary
access time theoretically approaches O(1), in my example random key
access was in average 2 times quicker using the array. Memory-wise,
this approach is also very good when I want to access read-only data
modelled by a C# struct object, as I can pre-allocate a big array of
structs on the stack and removing all the overhead creating by C#
objects (pointers, GC reference and so on...). I even pushed my idea
forward by implementing a generic ArrayDictionary<int,T> object
inheriting C# IDictionary<int,T> interface in order to switch at
runtime between the two implementation.

So all in all, my questions are :
- What is the hashing algorithm used by C# Dictionary? Is it generic
to all type of data once the GetHashCode function is provided?
- Is my optimization sounds good or should I have studied more the
possibilities provided by the existing Dictionary (using another
hashing algorithm adapted to my needs?).

Sorry for the long message, and thanks you if you read through it!

Guillaume

Rick Lones · Jun 7, 2010

guillaume said:
My question is related to the discussion "about the hashtable and
equals" where I learnt a few interesting things.

In my company we deal with big volume of data indexed by unique
integer ids (for background information, we deal with road traffic
data and and the road network is modelled by a set of edges
representing road sections). The edges' ids are ordered from 0 to the
total number of edges (ie around 20 millions for the French network).
However we often only deal with a subset of the road network of only
100000, 1 million or 10 millions of ids which are uniformly spread
between 0 and 20 millions.

We usually use C# Dictionary with integer keys to store any data by
edges' ids (road section characteristics or archived data, ...) and it
scales very well until several millions of ids. However using a
generic hashing algorithm with 32bits integers (potentially stretching
from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys
are much coarser always seemed to me as an overkill.

One optimization I came up with was just using a dense array of the
maximal size where the key of each object is in fact its index inside
this array. It works pretty well : for example when initializing a
dictionary with 10 millions of keys randomly chosen between 0 and 20
millions, there is a gain of more than 200Mb not using a 10 millions
key/value Dictionary in spite of the fact that our array contains
actually 20 millions values. In addition, even if the Dictionary
access time theoretically approaches O(1), in my example random key
access was in average 2 times quicker using the array. Memory-wise,
this approach is also very good when I want to access read-only data
modelled by a C# struct object, as I can pre-allocate a big array of
structs on the stack and removing all the overhead creating by C#
objects (pointers, GC reference and so on...). I even pushed my idea
forward by implementing a generic ArrayDictionary<int,T> object
inheriting C# IDictionary<int,T> interface in order to switch at
runtime between the two implementation.

So all in all, my questions are :
- What is the hashing algorithm used by C# Dictionary? Is it generic
to all type of data once the GetHashCode function is provided?
- Is my optimization sounds good or should I have studied more the
possibilities provided by the existing Dictionary (using another
hashing algorithm adapted to my needs?).

Sorry for the long message, and thanks you if you read through it!

Guillaume

If your n keys are in fact each unique and span the range from 1 to n (or
similar) then I don't see how you can do better than a dense array, even
theoretically. A hash table helps you most when your keys are sparse and not
inherently ordered. As for the dictionary: I believe that the underlying
structure is a red-black tree, but that also will not perform as well either
space or time -wise as a straight indexed lookup for a dense well-ordered set of
keys.

The dictionary lookup would be O(log n), not O(1). Hash table access is O(1),
but with overhead to maintain the bucket structure and deal with collisions.
The array access should be O(1) with almost no overhead except for things
outside your control (e.g., paging) which affect the other methods as well.

Where are you coming from with this question? Dictionaries don't hash their
keys, they use them as is - they just have to be an IComparable type so they can
be properly placed within the tree. GetHashCode would have nothing to do with
dictionaries, AFAIK.

HTH,
-rick-

Jeroen Mostert · Jun 7, 2010

On 2010-06-07 13:14, Rick Lones wrote:

As for the dictionary: I believe that the underlying structure is a
red-black tree, but that also will not perform as well either space or
time -wise as a straight indexed lookup for a dense well-ordered set of
keys.

Careful. SortedDictionary<> uses a red-black tree; Dictionary<> implements a
hash table. The names of SortedDictionary and SortedList are poorly chosen,
but there you go.

Rick Lones · Jun 8, 2010

Jeroen said:
On 2010-06-07 13:14, Rick Lones wrote:

Careful. SortedDictionary<> uses a red-black tree; Dictionary<>
implements a hash table. The names of SortedDictionary and SortedList
are poorly chosen, but there you go.

Yes, thank you. SortedDictionary was the one I meant. And I was assuming the
OP also meant that, maybe incorrectly, because it would be the best case for him.

-rick-

Arne Vajhøj · Jun 8, 2010

My question is related to the discussion "about the hashtable and
equals" where I learnt a few interesting things.

In my company we deal with big volume of data indexed by unique
integer ids (for background information, we deal with road traffic
data and and the road network is modelled by a set of edges
representing road sections). The edges' ids are ordered from 0 to the
total number of edges (ie around 20 millions for the French network).
However we often only deal with a subset of the road network of only
100000, 1 million or 10 millions of ids which are uniformly spread
between 0 and 20 millions.

We usually use C# Dictionary with integer keys to store any data by
edges' ids (road section characteristics or archived data, ...) and it
scales very well until several millions of ids. However using a
generic hashing algorithm with 32bits integers (potentially stretching
from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys
are much coarser always seemed to me as an overkill.

One optimization I came up with was just using a dense array of the
maximal size where the key of each object is in fact its index inside
this array. It works pretty well : for example when initializing a
dictionary with 10 millions of keys randomly chosen between 0 and 20
millions, there is a gain of more than 200Mb not using a 10 millions
key/value Dictionary in spite of the fact that our array contains
actually 20 millions values. In addition, even if the Dictionary
access time theoretically approaches O(1), in my example random key
access was in average 2 times quicker using the array. Memory-wise,
this approach is also very good when I want to access read-only data
modelled by a C# struct object, as I can pre-allocate a big array of
structs on the stack and removing all the overhead creating by C#
objects (pointers, GC reference and so on...). I even pushed my idea
forward by implementing a generic ArrayDictionary<int,T> object
inheriting C# IDictionary<int,T> interface in order to switch at
runtime between the two implementation.

So all in all, my questions are :
- What is the hashing algorithm used by C# Dictionary? Is it generic
to all type of data once the GetHashCode function is provided?

I think so.

But you can check the source code.

Either get it from MS or simply use Reflector.

- Is my optimization sounds good or should I have studied more the
possibilities provided by the existing Dictionary (using another
hashing algorithm adapted to my needs?).

Nothing will beat direct indexing into an array.

Dictionary are intended to be a solution where that is
not possible.

Arne

dictionary (hash table) with two values?	5	Apr 3, 2007
Hashing and comparing of arrays	19	Apr 18, 2006
Storing Hash Codes	3	May 26, 2006
What does KeyValuePair.GetHashCode do?	4	Jan 14, 2014
Dictionary<key,item> trimming internally	2	Dec 17, 2009
Dictionary with object for key	2	Nov 4, 2008
Yahoo investigates possible data breach	0	Aug 2, 2016
Hash table & threading	2	Jun 28, 2005

Hash table with coarse integer keys

guillaume

Rick Lones

Jeroen Mostert

Rick Lones

Arne Vajhøj

Ask a Question

Similar Threads