Andy said:
I am new to HashTables. What are they good for and what kinds of uses do
they have? I am looking for a broad generalized list of what people most
commonly use them for. It is for research on my project to determine if they
would be of some sort of good use in the project.
The short simplified answer is to serve as fast Lookup method. You
would be very safe with that basic definition. A 2nd safe answer is
they are used to create "unique" hash representation of data (i.e.,
common hash methods are MD5, SHA1, SHA256 which uses a specific hash
table to generate a unique hash).
Long answer:
A HASH is a translation of a string or array of bytes to a fixed size
representation, could be a string or number, so that you can use the
HASH to find where something is stored in a block of data.
For example, lets say you have a list of contacts with people names
and email address:
Contacts["Andy B."] = "(e-mail address removed)"
Contacts["Mike"] = "(e-mail address removed)"
Behind the scene, this "mapping" concept will hash "Andy B" and "Mike"
to some generic unique set of indexes in a HASH TABLE that closely
match the string.
Its a way of INDEXING a string that allows for a faster "ZOOM" of
where the the answer might be.
Lets use a very simplicity example here:
lets say your hash table is an array of 26 bytes (characters) "A" to "Z"
HashTable[0] = "A"
HashTable[1] = "B"
HashTable[2] = "C"
...
HashTable[12] = "M"
...
HashTable[25] = "Z"
Its not really a hash table because a HASH uses some algorithm to
create a numeric representation of "A" to "Z", but for the sake of
simplicity, this example offers the same principle.
So what is the hash index would "Andy B" fit and "Mike" fit?
Well, HashTable[0] for "Andy B" because it starts with "A" and
HashTable[12] for "Mike' because it starts with a "M"
Thats the basic idea of a HastTable. But like I said, a HashTable
generically uses some HASH algorithm or function to calculate a hash
of some string. The algorithm could be just the first four bytes of a
string to generate a HASH to determine where to place the string in
the table. Maybe the first four character people names is good enough
to separate names into different groups.
What that means is that one or more people can have the same hash,
ANDY B.
ANDY GARCIA
ANDY SMITH
ANDY WHOEVER
etc, would all have the same hash index.
So one make ask, it is possible to had an untold number of "ANDY" in
the world. How many can you fit in the same Hash location?
That is where Hash Table Page Size come in to play. You might want to
say a page will have 100 people before a new PAGE is create for the
hash. So you can have:
HashTable[0][0] = 1st group of 100 "ANDY" people
HastTable[0][1] = 2nd group of 100 "ANDY" people
HastTable[0][2] = 3nd group of 100 "ANDY" people, but 1/2 full
etc.
When it comes time to adding more "ANDY" people, it will fill up the
page before a new hash page is created. Since Page 2 is 1/2 empty,
it will begin there.
Now, consider the situation where not everyone is ANDY, but maybe ANDI
or ANDREW.
Depending of hashing function, it could maybe lump all of these under
the same hash. For example if only the first letter was used they
would all be lumped together in page 0, HashTable[0][0].
HashTable[0][0] = "ANDY", "ANDI", "ANDREW"
So this is very much dependent of how the Hash function is defined,
the number of indexes you have and the page size.
Sometimes in practice, the hasttable helps create a tree:
HashTable[index] = array of pages of size X
and each page is further expand with another tree of pages that is
based on the LEFT or RIGHT sort order of a string. So for example,
using ANDY, ANDI and ANDREW, the sort order is:
ANDI
ANDREW
ANDY
As the page is filled with the sorted names, once it fills up, a new
page is created in the position that make its sort consistent. So for
example, given new names ANDRIOD and ANDRA, you will may be placed
like this in the hash table tree of pages:
ANDI
ANDRA ANDREW ANDRIOD
ANDY
The above is probably wrong placements, including the idea that pages
may be moved around, but overall the ideas is that left and right
pages based on sorting.
The basic idea of using hash table is to create a way to store a list
of information using a specific hash indexing method that will allow
for a fast lookup.
Almost all, if not all, the collection classes or templates use a hash
table for the key=value associations. Generally, the type of
collection is based on your needs:
For example, like in the mapping example used above:
Collection[key] = value
If the collection is one dimension, using the same key MAY replace the
existing one:
Contacts["Andy B."] = "(e-mail address removed)"
Contacts["Andy B."] = "(e-mail address removed)"
The 2nd one is a replacement of the value for the same key.
So just like DATABASE ideas of UNIQUE or ALLOW DUPES, you have classes
that may offer the same idea.
If that was the case here, then you might has two "ANDY B." keys.
It all depends on the collection class behavior and what it offers
you. These are things you might look for.
Some classes even allow you to replace it default HASH function.
A good example for that is case sensitivity:
Contacts["Andy B."] = "(e-mail address removed)"
Contacts["andy b."] = "(e-mail address removed)"
That could be two or one key=value pairs depending on collection class
you are using. If the class does not offer a option for case
sensitivity, it may offer you a HASH function override where in hash
function override you lower or upper case the key before calculating
the hash. As you can probably imagine, this need is common to
consider, "Andy B" and "andy b" are really the same person.
Finally, what are practical examples of a hash function?
I can't tell you what the .NET collection class uses, but the C/C++
CMap collection class template uses this:
template<class VALUE, class ARG_VALUE>
AFX_INLINE UINT CMapEx<VALUE, ARG_VALUE>::HashString(TKEYARG Key) const
{
LPCSTR key = Key;
UINT nHash = 0;
while (*key) {
nHash = (nHash<<5) + nHash + towlower(*key++);
}
return nHash;
}
It loops thru all key characters. It lower cases the character. Makes
the key=value association case insensitive. It accumulates the lower
case with the last hash value and the binary bit left shifting of the
hash value. Go figure! Probably some computer science guy can tell us
why the above is good enough.
But other common hash functions used in practice are:
CRC Cyclic Redundancy Check, Adds all bytes, result 2 bytes
CRC32 32 bit (four byte) version of CRC
MD5 Message Digest
SHA1 Secure Hash Algorithm
SHA256 Secure Hash Algorithm 256 bit version
SOUNDEX Popular method for getting a common phonetic of a word
and many others, but the above the long time common ones you hear
about used in many things and embedded in many things we take for
granted.
Modern Modem and File transfers and even the internet (TCP) uses
cyclic redundancy checks to make sure what was was sent is the same
that was received.
MD5 and SHAx uses specific HASH TABLES to generate a final hash value.
With these the idea is to create an unique hash for a given string.
You should explore th .NET crypto assembly for these.
DIM hash as string = MD5("Andy B.")
SHA1 and SHA256 is popular in security applications (like SSL, DKIM).
SHA1 is now considered obsolete since the Chinese has shown it is
"easy" (relatively speaking) for two different strings can produce the
same hash value. This is called clobber.
SHA256 is the replacement recommendation and more secured (harder to
clobber), but it was also been recently shown that it has weaknesses
for security purposes but good enough for the next 2-3 years.
--