fastest searchable datastructure?

  • Thread starter Thread starter Pieter
  • Start date Start date
P

Pieter

Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,


Pieter
 
I need some type of array/list/... In which I can store objects together
with a unique key.
Sounds like Dictionary said:
The list may contain upto 10^12
Seriously? You do realise that even at one byte per item, with no
index overhead, padding, etc that's a TB?

I'm going to assume that is a typo - but even so, for large numbers
you may be better using a database approach, with a non-clustered
unique index on this field (only; don't span - let it use a bookmark
lookup to get the value) and ideally with the index on a different
file-group to the main data.

Marc
 
Thansk both for your answer!

Actually: it's not a type: it could go theoretical upto 6*10^12 :-)
But as I said: I expect a practical implementation of +- 10^6...

I will use indeed SQL Server when the amoutn will be too big... But in case
it's underneath 10^6: Will a dictionary be better than a Hashtable? Or
soemthing else?
 
Do you know how much memory 10^12 of these "items" is going to require?
Because if you run out of physical RAM, your whole process will either blow
up or slow down dramatically due to paging.
So the best alternative may be a database to hold your items. A search on a
table with a clustered index on a primary key in this case will be the
fastest.
-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
MetaFinder: http://www.blogmetafinder.com
 
Hehe I do know :-)
The problem is: it will be for an experiment, and not every possiblity will
happen as much as the others. So I want definetly put the most popular in
some kind of local cache... See it as putting the first block of a B-tree in
the RAM memory...
 
Pieter said:
Thansk both for your answer!

Actually: it's not a type: it could go theoretical upto 6*10^12 :-)
But as I said: I expect a practical implementation of +- 10^6...

I will use indeed SQL Server when the amoutn will be too big... But in case
it's underneath 10^6: Will a dictionary be better than a Hashtable? Or
soemthing else?
I would carefully avoid solutions that require you to write the same
type of code multiple times, one type for < 10^6, one for <10^9 and one
for <10^12. If you need as many as 10^12, use a database.

If you absolutely want to draw the best performance out of everything,
abstract out the storage of this list to a new class, so that you can
inherit for it for an in-memory data structure, and inherit from it for
a database structure.
 
Thanks Lasse. What it actually will do is: every object will be another
possivble state. But some states will be much more needed than others. So
the 'popular' ones will be put in a local cache. But it's the structure of
that cache that I'm worrying about: As mayb 80% of the searching will happen
in there, it shoudl be as fast as possible: So: which structure to use?
 
Pieter said:
I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

You will get the fastest performance with a Hash Table. Assuming that
you choose a good algorithm to assign the hash values, the hash table has
the advantage that the average number of accesses to the table to find a key
does not depend on the size of the table, but only on the percent of the
table that is full. With a table that is not terribly full (say 80 o 90%, I
don't remember the figures), the average number of accesses is one point
something. This beats a search on a B-Tree, which requires a number of
accesses that grows with the logarithm of the size of the table. Note that
I'm speaking about the *average* number of accesses. The *worst-case*
scenario is horribly bad, since it would require a full-scan of the hash
table. Fortunately, this is extremely improbable, on the condition that the
hashing algorithm and the collisions algorithm are properly chosen.

If you are going to deal with in-memory elemets using .Net, you can use
a Dictionary<key,value>, which will automatically use a hash table when the
number of stored elements is larger than some internally coded threshold.
You will need to use 64-bit code (and run it in a huge machine) if you want
to address 10^12 elements. If you need to store your data on disk, a
properly programmed hashing algorithm against a flat file can outperform a
database server, which uses trees to store its indices.
 
A Hashtable would be faster in terms of searching, as long as one was
searching by a key value.

--
HTH,

Kevin Spencer
Chicken Salad Surgeon
Microsoft MVP
 
(of course, given the numbers in question, *any* in-memory approach is
doomed... doomed...)

[Cor Ligthert]
System.Collections.SortedList
&
[Kevin Spencer]
A Hashtable would be faster in terms of searching, as long as one was
searching by a key value.

Yes, a hash-table implementation is better than a flat list, but
Hashtable is the non-typed version; for the reasons already given by
Peter the generic versions would generally be preferable -
specifically one of Dictionary<TKey, TValue>, SortedDictionary<TKey,
TValue> and SortedList<TKey, TValue> (depending on what operations are
important). The OP indicates that *read* performance is the key here
(not insertion/deletion); these report O(1), O(log n) and O(log n) for
retrieval - hence Dictionary<TKey, TValue> would be my first choice.
But the real answer is to test it with typical data: if you code the
app to work against the IDictionary<TKey, TValue> interface (rather
than any concrete implementation) you should be able to try different
implementations with likely data.

Marc
 
Hello Pieter

Balena did some tests and posted the results on his website , however with
the transition form VB2themax to dotnet2themax the article seems to be
disapeared
However the result was that the hashtable was and still is the fastest key
value pair object , the higher the amount of data was the more efiicient the
hashtable was in comparisation to other methods for sowell insertion as
retrieving of the key value pairs ( Balena timed both the time it took to
built up the data tree as the time it took to retrieve n elements ) ,
however with the hughe amounts you are talking about a database would be a
much bether idea


HTH

Michel
 
Found some numbers ( needed to do some translation )

The test was with 100.000 items

compared was ArrayList, HashTable en SortedList

Arraylist was 4 times faster as the hashtable , hashtable was 8 - 100 times
faster as the sortedlist

please if you write some comparisation routines yourself then don`t forget
to set compile options in VB.Net remove overflow checks and enable
optimizations
otherwise we get again trolling messages here that C# is so much faster as
VB.Net with these options the same as they are in C# they should evaluate at
the same speed
 
Pieter said:
Thanks Lasse. What it actually will do is: every object will be another
possivble state. But some states will be much more needed than others. So
the 'popular' ones will be put in a local cache. But it's the structure of
that cache that I'm worrying about: As mayb 80% of the searching will happen
in there, it shoudl be as fast as possible: So: which structure to use?

If you have a key and need the object related to that key, use a
Dictionary<Key, Value> for this, it uses a hashtable and is generally
the fastest one for this kind of query.

If you have a key that is "in the neighbourhood of", then you need
something that keeps values in a sorted order, a tree or similar would
be good here I think.
 
Found some numbers ( needed to do some translation )

The test was with 100.000 items

compared was ArrayList, HashTable en SortedList

Arraylist was 4 times faster as the hashtable , hashtable was 8 - 100 times
faster as the sortedlist

It really surprises me that you found an ArrayList faster than a
Hashtable with that many items. I could understand it for a really
small dataset, but at 100,000 a hashtable should be much faster than a
list. Could you post a link to the code?

Jon
 
Arraylist was 4 times faster as the hashtable
The other question would be: "at what, exactly?"

If the timings include the insertion time, then yes: a hash-table
(Hashtable or Dictionary<TKey,TValue>) will take longer: it needs to
obtain hashes, build buckets, etc (an ArrayList or List<T> just has to
copy an array a few times). But the *read* performance should then be
/significantly/ better, assuming that the items are being fetched by
key (and not by looping).

Without inspectable, reproducable code I'm going to take this with a
pinch of salt...

Marc
 
Before anyone takes this out of perspective

1 . these tests were done by Balena not by me , however the results and code
are not annymore on the website , but i have read on a dutch website that
you can find this in his 2003 book ( i have the 2002 and 2005 version )

2 . The conclusion was that the hashtable was the fastest when values were
retrieved with the key , however a arraylist was faster when looping
through the values

"Although the statistics might tell you that the Amazone rivers average
depth is 50 cm this does not mean that you can walk to the other side"

Michel
 
Michel Posseth said:
Before anyone takes this out of perspective

1 . these tests were done by Balena not by me , however the results and code
are not annymore on the website , but i have read on a dutch website that
you can find this in his 2003 book ( i have the 2002 and 2005 version )

2 . The conclusion was that the hashtable was the fastest when values were
retrieved with the key , however a arraylist was faster when looping
through the values

Right. That makes a lot more sense. As far as I'm aware, this thread is
about looking up by key, at which point the hashtable should win
easily.
 
Back
Top