Nick said:
Hello Helge,
I could have replied to your reply, but I wanted to reply to the long post
so that I can refer to your text.
Let me first restate the problem so that you can tell me if you and I are on
Thank you. I replied in length another place in the thread with some
additional information, since I gather that I have not been too clear
the first time: news://msnews.microsoft.com:119/
[email protected]
Thank you for the very long and elaborate reply, unfortunatly we seem to
have a slight misunderstanding about what I was asking for. Please
accept my apologies for not formulating that clearly enough.
the same page. You have implemented a Collection type that you have called
Set. You recognize that the methods and behavior of this class are typical
to the mathematical concept of a Set, where it is important not only to add
and delete members but also to determine membership.
Correct. I have implemented a collection-type, with the properties of a
mathematical Set. The actual class is called "HashSet".
You did not find a
Collection object that performed this function for you, and you wanted to
know (a) if it were there, how could you have missed it? (b) why isn't it
there, since a Set is fundamental to so many mathematical concepts? and (c)
is there a simple workaround that you missed?
Did I capture your questions correctly?
Not really, I have the implementation, what I am missing is a suitable
interface that defined the operations of a mathematical set.
Like System.Collections contains an interface called IList to ease
cooperation between components that exposes parts or data-structures
that can be thought of as having list-structure without divulging into
the details of how that list-structure is implemented, I was kind of
hoping there would be an ISet fulfilling the same purpose for the Set
datastructure as IList does for lists, and IDictionary does for
dictionaries/maps.
Different parts of code could then easily pass set-structured data to
each-other without knowing the exact implementation or having to use
adapters between each others definitions of the set-interface.
I realize that not every datastructure in the world can have an
interface in System.Collections, but I was rather surprised sets weren't
included.
The standard-library for most languages (at-least: C++, JAVA, ML,
python) have a definition of (in whatever is the appropriate parallel in
that language to "interface") sets. I'm really not interested in the
implementation, I have that.
I suggest that you could use a Hashtable and your reply was that your
requirement is not for a Dictionary.
I am not looking for an implementation, I have an implementation. That
implementation does not implement a dictionary (it can be thought of as
a function into {true,false}, but it's not *convinient* (or common
practice) to formulate the behaviour of mathematical sets as dictionaries.
OK... so I looked again at your text. You said that you want a collection
with three abilities: add, delete, and member, where the member method has
to run quickly.
I think I said I *had* a collection, where these member methods run quickly.
exists.
From this, I gather that you expect and plan to use a hashing function not
only to determine membership but also to determine equality. E.G. The hash
value is the objects identity. Let me know if I am off here. Therefore,
two objects are equal if their hash value is equal.
No, I use the hashing function to hash the nodes down to "buckets", and
search linerly in the appropriate bucket. If the buckets grow larger
than some amount (currently log^2 N), I rehash over a larger set of buckets.
This is a standard way to implement (near-)constant-time set operations
for hashable members, and probably very similar to the implementation of
Hashtable.
Now, who determines the hash value of an object. Let's look at an
inheritance tree, where you have a basic INode interface. From INode, you
inherit various network members (I'm guessing: computer, resource, person,
etc... like a directory). So you will have various different items that you
want to store in your Set collection. While I do not know your hash
function, in general, I would consider it a good practice to define the
interface for the hash function at INode (or above) and to implement the
hash function in the object itself.
I Use a IHashCodeProvider, passed to the HashSet constructor to
calculate the hash.
The actual hashcode provider implementation uses only information
obtainable via the INode interface, so i do not have a problem with
implementations having to provide proper implementation of GetHashCode()
Using a IHashCodeProvider alos allows me to easily experiment with
different hashing functions on the same objects.
Therefore, each object needs to provide its own hash function. So, not only
do you need to have a generic Set collection, but you need to have a generic
GetHashCode method defined in the inheritance tree that your "hashable
objects" can use. Also, Hash functions do not, by their nature, guarantee
that the hash value is absolutely unique. Their promise is that the value
is evenly distributed across a wide spectrum of values, and that two objects
that are the same will produce the same hash function. However, two objects
that are different could, potentially, produce the same hash function.
Therefore, you also need to override the Equal() method of your hashable
objects, so that your generic collection has a good way to compare items.
I am aware what a hash-function is, this is not the first time I have
implemented high-performance datastructures. It is, however, the first
time I have done it in C#.
I do not need to override the Equals method, because I compare members
using an IComparator, passed in the Hashset constructor. The actual
IComparator implementation used uses only data obtainable via the INode
interface; that exposes a rather lengthy unique ID; thus the expensive
equality coparison.
The is *astoundingly* similar to the way the
System.Collections.Hashtable works

except it doesn't map keys to
values, it just decides membership, thereby halving memory-consumption.
Let's look again at the criteria: you need a collection that can support
add, delete, membership, generate-hash, and equality. Let me turn that
around: ANY collection that can efficiently support add, delete, membership,
and can appropriatly delegate the generate-hash and equality methods can be
considered to be a valid and useful collection for Set.
I have such a collection implementation. I just wish there was a name
for the properties it has, instead of a name for the specific
implementation, lists have "IList", dictionaries (or maps, as they are
often called in other languages) have "IDictionary", but apparently sets
didn't cut it to get a System.Collections.ISet...
Look at the second sentence carefully in the above paragraph. This is the
key to understanding why you didn't find the Set collection when you were
looking right at it. The HashTable collection completely fills the criteria
stated in the second collection.
1) Hashtable is not a description of properties, it's an implementation
(of IDictionary)
2) IDictionaries are not (naturally) sets
3) While a mathematical set *can* be thought of as a dictionary from
elements into true/false, that is not a very common thing to do. And
using "just the keys" of a hashtable as a set is a hack; it can at times
be a good hack, but it's not good for making people understand what they
can do: "what do you mean i should say foo.Add(item, null) to insert
into the set?".
4) Of course one can implement a HashSet by using an IDictionary, and
just defining that "all the entries which have keys are members, and you
are not supposed to look at the values they map to". But I would rather
just have:
interface ISet: ICollection {
void Add(Object item);
void Remove(Object item);
bool Contains(Object item);
}
Which formulates the properties of a mathematical set very nicely.
As i said, I have ended up just declaring this interface myself, in my
own namespace, which solves my immedeate itch: to just have to explain
to my users what they can do with the data they have been given, which
is: set-operations, without tying me down to a specific implementation.
It allows me to talk about the properties of the data instead of the
representation of the data.
Insert is called HashTable.Add
Delete is called HashTable.Delete
Membership is called HashTable.Contains
Hashtable is not a set, I cannot new HashTable().Add(item), since a
Hashtable maps keys to values, So it would be .Add(item, null).
Generate-Hash is called Object.GetHashCode
Equality is called Object.Equals
HashSet, (and Hashtable) construtors accept an IHashCodeProvider and
IComparator, so I am not bound to using the objects GetHashCode() and/or
Equals() for hashing/comparison, even *if* I used Hashtable as an
implementation for my sets.
Finally, you complain that you don't have a dictionary and therefore you
cannot use a hashtable collection because it is implemented using the
dictionary collection base class (DictionaryBase). Do you care how the
collection is implemented, as long as the implementation is efficient? The
Dictionary collection IS a hash table. It isn't a sorted list.
Implementing the Dictionary as a sorted list requires a different base class
(HybridList).
I have never said anything about DictionaryBase, or claimed that I was
unable to use the Hashtable for implementation of a set, what I did
claim was that a Hashtable (or more generally: IDictionary) does not fit
the description of a set: a set does not map keys to values, it simply
decides containment.
Therefore, I assert categorically that .Net did not leave out the Set
you better have a catch clause for that assertion
A dictionary and a set are not the same. The claim that a set is just a
dictionary where you "only look at the keys" doesn't really help.
collection. It just used a different name. Not only does the collection
exist, but it is highly efficient.
I do not doubt the efficiency of the Hashtable implementation. I known
how hashtables work, and have implemented many different data-structures
for specific applications before: red-black-, AVL-, splay-trees, and
some space-efficient sets and maps for use in code analysis, which
provides runtime-tradeability between memory and cpu utilization, while
still being able to guarantee the asymptotic performance of the
operations on the structures.
You can easily prove this by creating a
simple performance test using your objects. Create a large list of your
objects, and run each collection through its paces: inserting, deleting, and
determining membership. To be fair, the hash function must produce widely
distributed values and both collections must use the same hash function.
Given that criteria, I will put $20 on the table right now that the .Net
HashTable object is just as efficient or more than your collection class.
Out of curiosity, i did a small (totally unscientific) test: HashSet is
my first-draft set implementation, looping 1000 times over inserting and
deleting the ints from 0 to 100000. Of course you could start twiddling
with loadfactors and stuff like that, and probably come up with a
reverse result.
Hashtableset is:
public class HashtableSet: Hashtable, ISet
{
public HashtableSet(int size): base(size) {}
void ISet.Add(Object item) { base.Add(item, null); }
void ISet.Remove(Object item) { base.Remove(item); }
bool ISet.Contains(Object item) { return base.ContainsKey(item); }
}
Both ISet implementations are given an initial capacity argument of
100000/3.
Inserting and removing 100000 elements: 1000 times
graphutil.HashSet: 00:00:28.6111408
graphutil.HashtableSet: 00:00:39.7571680