Generic Dictionary - Is there a speed difference keying off an int,rather then a string.

  • Thread starter Thread starter DaTurk
  • Start date Start date
D

DaTurk

That's basically my question, if I use a generic dictionary, and key
it off a 32 bit int, rather then a string ... will there be any speed
difference when I search? Even if there is a benefit, however
negligible, I'd still like to know about it.

Thank you in advance,

Matthew
 
DaTurk said:
That's basically my question, if I use a generic dictionary, and key
it off a 32 bit int, rather then a string ... will there be any speed
difference when I search? Even if there is a benefit, however
negligible, I'd still like to know about it.

The only way to answer the question with certainty is for _you_ to
measure the performance in both scenarios, in the context of your own code.

As a general rule, I'd guess that int could be much faster than a string
in a pure benchmark. After all, an array of ints is a single
allocation, whereas every new string you have to allocate and add as a
key to your dictionary requires a new object. And calculating the hash
code for a 32-bit int is much faster than for a string, since it's just
the int value itself.

But: strings used as indices for your collection may be strings that
have to be allocated anyway, the hash code may be cached and/or
precalculated, and if the int isn't a natural connection to your
underlying data, there could be additional overhead associated with
using an int that isn't readily apparent in a pure benchmark.

There's no way for anyone to be able to know whether _in your specific
case_ which would be faster, because there are too many things that can
affect the overall performance. If it's that important (and frankly, I
doubt it is) you'll just have to implement it yourself and measure each
way to find out which is faster.

Pete
 
DaTurk said:
That's basically my question, if I use a generic dictionary, and key
it off a 32 bit int, rather then a string ... will there be any speed
difference when I search? Even if there is a benefit, however
negligible, I'd still like to know about it.
Searching in a dictionary that uses ints as keys is faster, becausing
hashing an int is an O(1) operation while hashing a string is O(N). It's as
simple as that.

But of course that's not the question. Because why would you even choose to
hash a string if you could hash an int with equal ease? The question is:
what's the impact on the *rest* of your system if you do this, in terms of
maintainability but also performance? Why exactly do you think you need to
be faster in the first place? Why do you think improving hashing performance
will be significant for the performance of your system overall? *Those* are
the interesting questions. If you were foolish enough to use only
dictionaries with ints for keys because "it's faster", you'd end up with
very convoluted systems that might actually be slower by sheer virtue of
their cleverness.
 
That's basically my question, if I use a generic dictionary, and key
it off a 32 bit int, rather then a string ... will there be any speed
difference when I search? Even if there is a benefit, however
negligible, I'd still like to know about it.

Anything left in the binary world is faster, so YES an int will be faster,
as there is no need to "translate" individual letters. This is an
oversimplification and someone might ding me on a technicality, but strings
are slower than numbers.

Having said that, performance is often overrated as the highest concern. In
business, maintainability often costs far more than performance and gets
far less press.

Peace and Grace,

--
Gregory A. Beamer
MVP; MCP: +I, SE, SD, DBA

Twitter: @gbworld
Blog: http://gregorybeamer.spaces.live.com

*******************************************
| Think outside the box! |
*******************************************
 
Thank you all for the replies. I figured ints were faster, I just
wanted to confirm, and the why of it. My applications are used in
finance, so any time I can shave off time of a movement through the
code it is a benefit to me and my firm. But using ints just doesn't
seem feasible, so strings it is.

Matthew
 
DaTurk said:
Thank you all for the replies. I figured ints were faster, I just
wanted to confirm, and the why of it. [...]

Just to be clear: so far, all you've confirmed is that ints _might_ be
faster. The only way to _prove_ that they are is to implement your code
that way and compare the performance to the string-based version.

Noting, of course, that anything you prove today could be wrong
tomorrow...for example, the System.String class could start caching the
calculated hash codes, or precalculate them as the string is created.
Though, the fact that the hash code isn't cached for the String class
now (an implementation detail no one should rely on knowing) suggests
that someone at Microsoft looked at the question, and determined that
the hash code calculation for String isn't generally a bottleneck, and
so isn't worth avoiding as a way to improve overall throughput.

Pete
 
Peter said:
DaTurk said:
Thank you all for the replies. I figured ints were faster, I just
wanted to confirm, and the why of it. [...]

Just to be clear: so far, all you've confirmed is that ints _might_ be
faster. The only way to _prove_ that they are is to implement your code
that way and compare the performance to the string-based version.

Noting, of course, that anything you prove today could be wrong
tomorrow...for example, the System.String class could start caching the
calculated hash codes, or precalculate them as the string is created.

This cannot affect the overall observation that hashing ints will be faster,
because "hashing" an integer will never require such a step.
Though, the fact that the hash code isn't cached for the String class
now (an implementation detail no one should rely on knowing) suggests
that someone at Microsoft looked at the question, and determined that
the hash code calculation for String isn't generally a bottleneck

This is a bit misleading. It's not that it's not a bottleneck or that
optimizing it wouldn't be worth it, it's just that caching the results won't
help because you wouldn't be able to use it often enough.

The problem is that you only benefit from this if you very frequently hashed
the same string instances -- not just the same *strings*, the same
*instances*. If it's not the same instance, you would at least need to
consider the entire string to ascertain you've processed it before -- which
defeats the purpose, and is also why interning the string doesn't get around
this.

Frequently hashing strings that are only computed once is not a common
scenario. If the strings are compile-time constants, you'd use enums
instead; if they're computed once at runtime... Well, I have trouble even
imagining a useful program where this would happen, but it's doubtlessly not
worth adding an integer field to every string for.
 
Jeroen said:
Peter said:
DaTurk said:
Thank you all for the replies. I figured ints were faster, I just
wanted to confirm, and the why of it. [...]

Just to be clear: so far, all you've confirmed is that ints _might_ be
faster. The only way to _prove_ that they are is to implement your
code that way and compare the performance to the string-based version.

Noting, of course, that anything you prove today could be wrong
tomorrow...for example, the System.String class could start caching
the calculated hash codes, or precalculate them as the string is created.

This cannot affect the overall observation that hashing ints will be
faster, because "hashing" an integer will never require such a step.

You can't make any statement like that without knowing the full details
of the rest of the code. For example, the OP could (likely does) have
strings that are natural candidates as keys already. If in the future,
all strings have their hash precalculated and cached, then for strings
the program has to have anyway, the hash calculation is sunk cost and
doesn't show up in any performance measurement of the dictionary itself.
This is a bit misleading. It's not that it's not a bottleneck or that
optimizing it wouldn't be worth it, it's just that caching the results
won't help because you wouldn't be able to use it often enough.

Again, you can't say that for sure. Not the least of which because my
statement was a hypothetical, necessarily lacking the details that would
be required for you to make any definitive statement about what is or is
not a bottleneck.
The problem is that you only benefit from this if you very frequently
hashed the same string instances -- not just the same *strings*, the
same *instances*. If it's not the same instance, you would at least need
to consider the entire string to ascertain you've processed it before --
which defeats the purpose, and is also why interning the string doesn't
get around this.

That entire paragraph is irrelevant to my point.
Frequently hashing strings that are only computed once is not a common
scenario. If the strings are compile-time constants, you'd use enums
instead; if they're computed once at runtime... Well, I have trouble
even imagining a useful program where this would happen, but it's
doubtlessly not worth adding an integer field to every string for.

Likewise.

I stand by my previous statement: unless someone has actually written
both implementations _in context_ of the actual program in which they
are being considered for use and compared the performance costs of each,
there is no way to know a priori which one will perform faster in the
context of the actual program.

Pete
 
Peter said:
Jeroen said:
Peter said:
DaTurk wrote:
Thank you all for the replies. I figured ints were faster, I just
wanted to confirm, and the why of it. [...]

Just to be clear: so far, all you've confirmed is that ints _might_
be faster. The only way to _prove_ that they are is to implement
your code that way and compare the performance to the string-based
version.

Noting, of course, that anything you prove today could be wrong
tomorrow...for example, the System.String class could start caching
the calculated hash codes, or precalculate them as the string is
created.

This cannot affect the overall observation that hashing ints will be
faster, because "hashing" an integer will never require such a step.

You can't make any statement like that without knowing the full details
of the rest of the code.

I am not discussing the full details of the rest of the code. I in fact
pointed out to the OP that focusing on the hashing itself was the wrong
thing to do, precisely because focusing on the hashing only tells you
nothing about the performance of the program overall.

But if we *are* focusing on the hashing itself, my statement is that *on the
whole*, hashing integers is faster than hashing strings, irrespective of how
or when System.String computes hashes. Nothing more. There's really nothing
controversial about that, I hope. It's trivial to conceive of programs that
"hash ints slowly" and "hash strings fast" because they compute those ints
in extremely roundabout ways, but obviously any discussion of relative
performance of individual operations can be rendered moot that way.

I'm granting you the rest of the discussion, because you're not reading what
I'm writing at all. You're only reading that I'm disagreeing with you,
without even trying to consider my points outside of the boxing ring you're
willing to fight in. This is no basis for a productive discussion.

Luckily we've gotten around to just repeating arguments, so we can leave it
at this anyway.
 
Jeroen said:
I am not discussing the full details of the rest of the code.

But I am. Why reply to my post, which is discussing the full details of
the rest of the code, if you do not intend to discuss the full details
of the code?
I in fact
pointed out to the OP that focusing on the hashing itself was the wrong
thing to do, precisely because focusing on the hashing only tells you
nothing about the performance of the program overall.

But if we *are* focusing on the hashing itself, [...]

If we *are* focusing on the hashing itself, then there was no need to
reply to my post, as I was specifically excluding the analysis that
considers only the hashing itself.

Pete
 
Peter said:
But I am. Why reply to my post, which is discussing the full details of
the rest of the code, if you do not intend to discuss the full details
of the code?
Because, my pointlessly specific friend, neither of us *have* the full
details of the code. Platitudinous meandering about how we really can't talk
in specifics about anything is not interesting. The point that the OP needs
to consider the full picture was already addressed by both of us.
I in fact pointed out to the OP that focusing on the hashing itself
was the wrong thing to do, precisely because focusing on the hashing
only tells you nothing about the performance of the program overall.

But if we *are* focusing on the hashing itself, [...]

If we *are* focusing on the hashing itself, then there was no need to
reply to my post, as I was specifically excluding the analysis that
considers only the hashing itself.
You were almost, but not quite. My post was about why your interesting
excursions into optimizing hash code calculation for strings were unlikely
to hold water in the majority of cases, and possibly none that were
realistic. If your defense then is that you have a whole big ball of
"complete program" that by virtue of its elusiveness could validate or
invalidate any point you'd care to make, I am not impressed.
 
Jeroen said:
Because, my pointlessly specific friend,

Well, I'm coming across as rather bitchy here, for which due apologies. It's
high time I went to bed anyway, so you're free to chew me out any way you
like and I won't read it for quite a while. :-)
 
Back
Top