Fast deserialisation of strings from byte[]

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I have an application that performs custom deserialisation of object state
from byte arrays. This happens very regularly, so needs to be fast. In
addition, most of the strings repeat, meaning I'm deserialising the same
sequence of bytes repeatedly, giving the same output string. Let's ignore
the text encoding method, as it's not relevant to my question.

Right now, I'm using BinaryReader.ReadString() which gives the correct
result, however it creates a new instance of System.String for each byte
sequence. What I'd really like is to detect the repeated byte sequence, and
return a reference to an existing deserialised version.

A colleague put me onto string.Intern, but this won't help as by the time
I'm calling that method, I've already allocated the string.

Note that these strings are very short lived. After deserialisation, they
will be processed and (for the most part) garbage collected before they get
promoted to generation 1. This happens several thousand times a second under
normal conditions, giving the garbage collector (what I assume is) a lot of
work. I'm seeing the classic sawtooth pattern in a heap timeline but with
very high frequency.

I'd like to know whether this is a situation in which I can improve
performance. I can envisage some sort of structure (perhaps a Trie) that
hones in on the stored string as we progress through the byte sequence.
However this structure cannot be pre-populated (the strings will be
determined at runtime).

The big question is: do the benefits of reducing string allocation justify
the overhead in finding a stored string? This, no doubt, depends upon the
implementation.

There may also be knock-on benefits from knowing strings having the same
value are identical objects (eg. object.ReferenceEquals rather than
object.Equals), but this is secondary.

This seems to me a great performance question. I hope others find it as
interesting as I do and will share their ideas and experience.

Regards,

Drew Noakes.
 
Drew,

I after reading your question twice is the answer in the first section of
your question.

A string is in Net never mutable. It will forever been build new even with
the slightest change.

The only "string-like" is a stringbuilder which is a kind of collection of
characters, however maybe can that help you.

http://msdn.microsoft.com/library/d...ml/frlrfsystemtextstringbuilderclasstopic.asp

Be aware that the description is wrong. There cannot be a mutable string. In
the remarks it is written right.

I hope this helps,

Cor
 
Hi Cor,

Thanks for your prompt response. I'm aware of the behaviour of strings with
regards to mutability, but this issue is different. Perhaps I didn't explain
myself clearly enough. I simply do not want to instantiate two different
string objects that have the same value.

Therefore, when I'm stepping through the byte[], the first time I see a
given pattern I would create the string and store it. The next time I see
the same pattern, I'll return a reference to the string I have stored. This
avoids the overhead of having two strings on the heap that have identical
values.

Bear in mind that I'm talking about doing this many many times a second, to
a point where I believe there is a performance gain to be reaped from this
added complexity.

Regards,

Drew.
 
Hi Cor,

If I use a Hashtable, I must create a new byte[] which in turn is another
object allocation. I wish to achieve this lookup without allocating any
object on the heap.

Drew.
 
Hi Cor,

Keying a hash table on byte[] will not reduce my memory overhead. Besides,
I still have to allocate an object (byte[] is an object, not a value-type)
before I can look up the string in the hashtable.

Drew.
 
Back
Top