What are HashTables good for?

  • Thread starter Thread starter Andy B.
  • Start date Start date
A

Andy B.

I am new to HashTables. What are they good for and what kinds of uses do
they have? I am looking for a broad generalized list of what people most
commonly use them for. It is for research on my project to determine if they
would be of some sort of good use in the project.
 
Andy said:
I am new to HashTables. What are they good for and what kinds of uses do
they have? I am looking for a broad generalized list of what people most
commonly use them for. It is for research on my project to determine if they
would be of some sort of good use in the project.

http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/Hash_table

--
John Mishefske, Microsoft MVP 2007 - 2009
UtterAccess Editor
Tigeronomy Software
web: http://www.tigeronomy.com
email: sales ~at~ tigeronomy.com
 
Like everything in Net, there is probably nobody which knows everything, so
there is no need that you try that.

Be aware that you should use things as soon as you need them or when a
course asks for that.

The hashtable is a very old mechanism to store whatever data using tag keys
which can help to find and sort things quicker.

Today, it is more a kind of base class, because there are more classes which
are easier and more maintainable to use.

Cor
 
Andy said:
I am new to HashTables. What are they good for and what kinds of uses do
they have? I am looking for a broad generalized list of what people most
commonly use them for. It is for research on my project to determine if they
would be of some sort of good use in the project.

The short simplified answer is to serve as fast Lookup method. You
would be very safe with that basic definition. A 2nd safe answer is
they are used to create "unique" hash representation of data (i.e.,
common hash methods are MD5, SHA1, SHA256 which uses a specific hash
table to generate a unique hash).

Long answer:

A HASH is a translation of a string or array of bytes to a fixed size
representation, could be a string or number, so that you can use the
HASH to find where something is stored in a block of data.

For example, lets say you have a list of contacts with people names
and email address:

Contacts["Andy B."] = "(e-mail address removed)"
Contacts["Mike"] = "(e-mail address removed)"

Behind the scene, this "mapping" concept will hash "Andy B" and "Mike"
to some generic unique set of indexes in a HASH TABLE that closely
match the string.

Its a way of INDEXING a string that allows for a faster "ZOOM" of
where the the answer might be.

Lets use a very simplicity example here:

lets say your hash table is an array of 26 bytes (characters) "A" to "Z"

HashTable[0] = "A"
HashTable[1] = "B"
HashTable[2] = "C"
...
HashTable[12] = "M"
...
HashTable[25] = "Z"

Its not really a hash table because a HASH uses some algorithm to
create a numeric representation of "A" to "Z", but for the sake of
simplicity, this example offers the same principle.

So what is the hash index would "Andy B" fit and "Mike" fit?

Well, HashTable[0] for "Andy B" because it starts with "A" and
HashTable[12] for "Mike' because it starts with a "M"

Thats the basic idea of a HastTable. But like I said, a HashTable
generically uses some HASH algorithm or function to calculate a hash
of some string. The algorithm could be just the first four bytes of a
string to generate a HASH to determine where to place the string in
the table. Maybe the first four character people names is good enough
to separate names into different groups.

What that means is that one or more people can have the same hash,

ANDY B.
ANDY GARCIA
ANDY SMITH
ANDY WHOEVER

etc, would all have the same hash index.

So one make ask, it is possible to had an untold number of "ANDY" in
the world. How many can you fit in the same Hash location?

That is where Hash Table Page Size come in to play. You might want to
say a page will have 100 people before a new PAGE is create for the
hash. So you can have:

HashTable[0][0] = 1st group of 100 "ANDY" people
HastTable[0][1] = 2nd group of 100 "ANDY" people
HastTable[0][2] = 3nd group of 100 "ANDY" people, but 1/2 full

etc.

When it comes time to adding more "ANDY" people, it will fill up the
page before a new hash page is created. Since Page 2 is 1/2 empty,
it will begin there.

Now, consider the situation where not everyone is ANDY, but maybe ANDI
or ANDREW.

Depending of hashing function, it could maybe lump all of these under
the same hash. For example if only the first letter was used they
would all be lumped together in page 0, HashTable[0][0].

HashTable[0][0] = "ANDY", "ANDI", "ANDREW"

So this is very much dependent of how the Hash function is defined,
the number of indexes you have and the page size.

Sometimes in practice, the hasttable helps create a tree:

HashTable[index] = array of pages of size X

and each page is further expand with another tree of pages that is
based on the LEFT or RIGHT sort order of a string. So for example,
using ANDY, ANDI and ANDREW, the sort order is:

ANDI
ANDREW
ANDY

As the page is filled with the sorted names, once it fills up, a new
page is created in the position that make its sort consistent. So for
example, given new names ANDRIOD and ANDRA, you will may be placed
like this in the hash table tree of pages:

ANDI
ANDRA ANDREW ANDRIOD
ANDY

The above is probably wrong placements, including the idea that pages
may be moved around, but overall the ideas is that left and right
pages based on sorting.

The basic idea of using hash table is to create a way to store a list
of information using a specific hash indexing method that will allow
for a fast lookup.

Almost all, if not all, the collection classes or templates use a hash
table for the key=value associations. Generally, the type of
collection is based on your needs:

For example, like in the mapping example used above:

Collection[key] = value

If the collection is one dimension, using the same key MAY replace the
existing one:

Contacts["Andy B."] = "(e-mail address removed)"
Contacts["Andy B."] = "(e-mail address removed)"

The 2nd one is a replacement of the value for the same key.

So just like DATABASE ideas of UNIQUE or ALLOW DUPES, you have classes
that may offer the same idea.

If that was the case here, then you might has two "ANDY B." keys.

It all depends on the collection class behavior and what it offers
you. These are things you might look for.

Some classes even allow you to replace it default HASH function.

A good example for that is case sensitivity:

Contacts["Andy B."] = "(e-mail address removed)"
Contacts["andy b."] = "(e-mail address removed)"

That could be two or one key=value pairs depending on collection class
you are using. If the class does not offer a option for case
sensitivity, it may offer you a HASH function override where in hash
function override you lower or upper case the key before calculating
the hash. As you can probably imagine, this need is common to
consider, "Andy B" and "andy b" are really the same person. :-)

Finally, what are practical examples of a hash function?

I can't tell you what the .NET collection class uses, but the C/C++
CMap collection class template uses this:

template<class VALUE, class ARG_VALUE>
AFX_INLINE UINT CMapEx<VALUE, ARG_VALUE>::HashString(TKEYARG Key) const
{
LPCSTR key = Key;
UINT nHash = 0;
while (*key) {
nHash = (nHash<<5) + nHash + towlower(*key++);
}
return nHash;
}

It loops thru all key characters. It lower cases the character. Makes
the key=value association case insensitive. It accumulates the lower
case with the last hash value and the binary bit left shifting of the
hash value. Go figure! Probably some computer science guy can tell us
why the above is good enough. :-)

But other common hash functions used in practice are:

CRC Cyclic Redundancy Check, Adds all bytes, result 2 bytes
CRC32 32 bit (four byte) version of CRC
MD5 Message Digest
SHA1 Secure Hash Algorithm
SHA256 Secure Hash Algorithm 256 bit version
SOUNDEX Popular method for getting a common phonetic of a word

and many others, but the above the long time common ones you hear
about used in many things and embedded in many things we take for
granted.

Modern Modem and File transfers and even the internet (TCP) uses
cyclic redundancy checks to make sure what was was sent is the same
that was received.

MD5 and SHAx uses specific HASH TABLES to generate a final hash value.
With these the idea is to create an unique hash for a given string.
You should explore th .NET crypto assembly for these.

DIM hash as string = MD5("Andy B.")

SHA1 and SHA256 is popular in security applications (like SSL, DKIM).

SHA1 is now considered obsolete since the Chinese has shown it is
"easy" (relatively speaking) for two different strings can produce the
same hash value. This is called clobber.

SHA256 is the replacement recommendation and more secured (harder to
clobber), but it was also been recently shown that it has weaknesses
for security purposes but good enough for the next 2-3 years.


--
 
Do you have a useful replacement then?
Cor Ligthert said:
Like everything in Net, there is probably nobody which knows everything,
so there is no need that you try that.

Be aware that you should use things as soon as you need them or when a
course asks for that.

The hashtable is a very old mechanism to store whatever data using tag
keys which can help to find and sort things quicker.

Today, it is more a kind of base class, because there are more classes
which are easier and more maintainable to use.

Cor
 
Andy said:
I am new to HashTables. What are they good for and what kinds of uses do
they have? I am looking for a broad generalized list of what people most
commonly use them for. It is for research on my project to determine if they
would be of some sort of good use in the project.

Nothing - you should use the Dictionary<> class instead. It's the
generic replacement for the HashTable class, which means that it's
strongly typed so that you don't have to cast everything that you read
from it.

A Dictionary is used whenever you want a fast lookup on the key for a
key-value pair. Getting an item by key is close to an O(1) operation,
which means that it practically takes the same time to get the item
regardless of how many items there are in the Dictionary.
 
I use the hashtable a lot in my projects , for the simple reasson that it is
lightweight and fast in fact it is unbeatable in terms of speed when
retrieving key value pairs ( tested and proofed by Balena see the 2003
book )

Where do i use it for ?

If for some reasson i need to store a key value and must check if this value
already exists in the table in a later stage or if i want to filter
duplicates and simply want to ignore the same keys
( if you set the item property in a hashtable you do not need to check for
existence ) or if i need a storage that can hold anny sort of object mixed

I you have a rough estimation of data and initilaize the table with so manny
slots ( or with a slight higher margin ) then the hashtable uses turbo boost

Regards

Michel
 
Andy

The hashtable is what i call a programming fundament , just like for
instance arrays and collections they are always there somewhere behind the
scenes of the fancy new methods
however if you know how and when ( and also important when not ) to use
them you can sure program more efective and can even code the fastest code
as these "Low level" mechanisms are not bloated in anny way

HTH

Michel
 
Back
Top