G
Guest
I have an application that performs custom deserialisation of object state
from byte arrays. This happens very regularly, so needs to be fast. In
addition, most of the strings repeat, meaning I'm deserialising the same
sequence of bytes repeatedly, giving the same output string. Let's ignore
the text encoding method, as it's not relevant to my question.
Right now, I'm using BinaryReader.ReadString() which gives the correct
result, however it creates a new instance of System.String for each byte
sequence. What I'd really like is to detect the repeated byte sequence, and
return a reference to an existing deserialised version.
A colleague put me onto string.Intern, but this won't help as by the time
I'm calling that method, I've already allocated the string.
Note that these strings are very short lived. After deserialisation, they
will be processed and (for the most part) garbage collected before they get
promoted to generation 1. This happens several thousand times a second under
normal conditions, giving the garbage collector (what I assume is) a lot of
work. I'm seeing the classic sawtooth pattern in a heap timeline but with
very high frequency.
I'd like to know whether this is a situation in which I can improve
performance. I can envisage some sort of structure (perhaps a Trie) that
hones in on the stored string as we progress through the byte sequence.
However this structure cannot be pre-populated (the strings will be
determined at runtime).
The big question is: do the benefits of reducing string allocation justify
the overhead in finding a stored string? This, no doubt, depends upon the
implementation.
There may also be knock-on benefits from knowing strings having the same
value are identical objects (eg. object.ReferenceEquals rather than
object.Equals), but this is secondary.
This seems to me a great performance question. I hope others find it as
interesting as I do and will share their ideas and experience.
Regards,
Drew Noakes.
from byte arrays. This happens very regularly, so needs to be fast. In
addition, most of the strings repeat, meaning I'm deserialising the same
sequence of bytes repeatedly, giving the same output string. Let's ignore
the text encoding method, as it's not relevant to my question.
Right now, I'm using BinaryReader.ReadString() which gives the correct
result, however it creates a new instance of System.String for each byte
sequence. What I'd really like is to detect the repeated byte sequence, and
return a reference to an existing deserialised version.
A colleague put me onto string.Intern, but this won't help as by the time
I'm calling that method, I've already allocated the string.
Note that these strings are very short lived. After deserialisation, they
will be processed and (for the most part) garbage collected before they get
promoted to generation 1. This happens several thousand times a second under
normal conditions, giving the garbage collector (what I assume is) a lot of
work. I'm seeing the classic sawtooth pattern in a heap timeline but with
very high frequency.
I'd like to know whether this is a situation in which I can improve
performance. I can envisage some sort of structure (perhaps a Trie) that
hones in on the stored string as we progress through the byte sequence.
However this structure cannot be pre-populated (the strings will be
determined at runtime).
The big question is: do the benefits of reducing string allocation justify
the overhead in finding a stored string? This, no doubt, depends upon the
implementation.
There may also be knock-on benefits from knowing strings having the same
value are identical objects (eg. object.ReferenceEquals rather than
object.Equals), but this is secondary.
This seems to me a great performance question. I hope others find it as
interesting as I do and will share their ideas and experience.
Regards,
Drew Noakes.