G
guillaume
My question is related to the discussion "about the hashtable and
equals" where I learnt a few interesting things.
In my company we deal with big volume of data indexed by unique
integer ids (for background information, we deal with road traffic
data and and the road network is modelled by a set of edges
representing road sections). The edges' ids are ordered from 0 to the
total number of edges (ie around 20 millions for the French network).
However we often only deal with a subset of the road network of only
100000, 1 million or 10 millions of ids which are uniformly spread
between 0 and 20 millions.
We usually use C# Dictionary with integer keys to store any data by
edges' ids (road section characteristics or archived data, ...) and it
scales very well until several millions of ids. However using a
generic hashing algorithm with 32bits integers (potentially stretching
from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys
are much coarser always seemed to me as an overkill.
One optimization I came up with was just using a dense array of the
maximal size where the key of each object is in fact its index inside
this array. It works pretty well : for example when initializing a
dictionary with 10 millions of keys randomly chosen between 0 and 20
millions, there is a gain of more than 200Mb not using a 10 millions
key/value Dictionary in spite of the fact that our array contains
actually 20 millions values. In addition, even if the Dictionary
access time theoretically approaches O(1), in my example random key
access was in average 2 times quicker using the array. Memory-wise,
this approach is also very good when I want to access read-only data
modelled by a C# struct object, as I can pre-allocate a big array of
structs on the stack and removing all the overhead creating by C#
objects (pointers, GC reference and so on...). I even pushed my idea
forward by implementing a generic ArrayDictionary<int,T> object
inheriting C# IDictionary<int,T> interface in order to switch at
runtime between the two implementation.
So all in all, my questions are :
- What is the hashing algorithm used by C# Dictionary? Is it generic
to all type of data once the GetHashCode function is provided?
- Is my optimization sounds good or should I have studied more the
possibilities provided by the existing Dictionary (using another
hashing algorithm adapted to my needs?).
Sorry for the long message, and thanks you if you read through it!
Guillaume
equals" where I learnt a few interesting things.
In my company we deal with big volume of data indexed by unique
integer ids (for background information, we deal with road traffic
data and and the road network is modelled by a set of edges
representing road sections). The edges' ids are ordered from 0 to the
total number of edges (ie around 20 millions for the French network).
However we often only deal with a subset of the road network of only
100000, 1 million or 10 millions of ids which are uniformly spread
between 0 and 20 millions.
We usually use C# Dictionary with integer keys to store any data by
edges' ids (road section characteristics or archived data, ...) and it
scales very well until several millions of ids. However using a
generic hashing algorithm with 32bits integers (potentially stretching
from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys
are much coarser always seemed to me as an overkill.
One optimization I came up with was just using a dense array of the
maximal size where the key of each object is in fact its index inside
this array. It works pretty well : for example when initializing a
dictionary with 10 millions of keys randomly chosen between 0 and 20
millions, there is a gain of more than 200Mb not using a 10 millions
key/value Dictionary in spite of the fact that our array contains
actually 20 millions values. In addition, even if the Dictionary
access time theoretically approaches O(1), in my example random key
access was in average 2 times quicker using the array. Memory-wise,
this approach is also very good when I want to access read-only data
modelled by a C# struct object, as I can pre-allocate a big array of
structs on the stack and removing all the overhead creating by C#
objects (pointers, GC reference and so on...). I even pushed my idea
forward by implementing a generic ArrayDictionary<int,T> object
inheriting C# IDictionary<int,T> interface in order to switch at
runtime between the two implementation.
So all in all, my questions are :
- What is the hashing algorithm used by C# Dictionary? Is it generic
to all type of data once the GetHashCode function is provided?
- Is my optimization sounds good or should I have studied more the
possibilities provided by the existing Dictionary (using another
hashing algorithm adapted to my needs?).
Sorry for the long message, and thanks you if you read through it!
Guillaume