StreamReader: How to read from a Stream with changing encoding?

  • Thread starter Thread starter Andreas Huber
  • Start date Start date
A

Andreas Huber

Hi

I have a stream containing data with potentially many different encodings.
It seems this precludes the direct use of the StreamReader class, for the
following reasons:

1) The encoding is fixed at StreamReader construction time
2) When I'm done reading data in encoding X and would want to continue in
encoding Y, I cannot simply trash my current StreamReader and continue with
a new one. This is because the old one very likely has yet undecoded bytes
left in its internal buffer. Those bytes have already been read from the
stream but were not decoded because we've come to the conclusion that we
need to continue with a different encoding. The bytes in the old
StreamReader buffer would therefore be lost.
3) Constructing a StreamReader with

new StreamReader(myStream, myEncoding, false, 1) // note the request for a
buffer size of 1 byte

doesn't help as the constructor internally enforces a minimal buffer size of
128 bytes.

Just wanted to check if someone else has a better idea before I go and
either implement my own StreamReader (duplicating much of the stock .NET
StreamReader) or chop up the underlying stream at the byte level (e.g. by
using one byte buffer for each area with a given encoding).

Thanks,
 
Andreas said:
I have a stream containing data with potentially many different
encodings. It seems this precludes the direct use of the StreamReader
class, for the following reasons:

1) The encoding is fixed at StreamReader construction time
2) When I'm done reading data in encoding X and would want to continue
in encoding Y, I cannot simply trash my current StreamReader and
continue with a new one. This is because the old one very likely has yet
undecoded bytes left in its internal buffer. Those bytes have already
been read from the stream but were not decoded because we've come to the
conclusion that we need to continue with a different encoding. The bytes
in the old StreamReader buffer would therefore be lost.

You can't "trash" the StreamReader and continue with another one even if you
wanted to, because a StreamReader owns the stream its reading for this (and
other) reasons. Closing a StreamReader closes the underlying stream. Even if
you do not explicitly close it, it could be garbage collected and disposed,
so you'd have to keep it alive artificially for this to work. You could get
away with a single StreamReader per encoding and calling
..DiscardBufferedData() when you switch, but the administration of this is
tedious.
3) Constructing a StreamReader with

new StreamReader(myStream, myEncoding, false, 1) // note the request for
a buffer size of 1 byte

doesn't help as the constructor internally enforces a minimal buffer
size of 128 bytes.

Just wanted to check if someone else has a better idea before I go and
either implement my own StreamReader (duplicating much of the stock .NET
StreamReader) or chop up the underlying stream at the byte level (e.g.
by using one byte buffer for each area with a given encoding).
I think implementing a new MultiEncodingStreamReader is your best (cleanest)
option, if only because using multiple StreamReaders on a single stream is
hard for reasons outlined above. The "chop up the stream" idea could be used
*within* the new class, together with the multi-StreamReader approach I
outlined above. You would only have to implement the encoding switching, not
the actual reading. This does not play very well with asynchronous I/O,
though -- if you need that, you're better off not using the existing
StreamReader at all.
 
Peter said:
[...]
I think implementing a new MultiEncodingStreamReader is your best
(cleanest) option, if only because using multiple StreamReaders on a
single stream is hard for reasons outlined above. The "chop up the
stream" idea could be used *within* the new class, together with the
multi-StreamReader approach I outlined above. You would only have to
implement the encoding switching, not the actual reading. This does
not play very well with asynchronous I/O, though -- if you need that,
you're better off not using the existing StreamReader at all.

Another alternative would be to create a Stream-derived class that wraps
an actual Stream, and doesn't close/dispose that Stream when it itself
is closed/disposed. That would allow for switching of StreamReaders
without having to do the "discard buffered data" bit.
This was my first idea (several framework classes use this trick as well),
but it's not as simple as that. If you don't end the stream, you have to
know exactly how much characters (not bytes) are left in order for the
StreamReader to pick them up, as .Read() works in characters, not bytes. If
you have to jump through hoops like that, you might as well not use
StreamReader at all and use Stream.Read() and Encoding (this is actually an
obvious solution I didn't think of first...)
Or, just reopen and reposition the Stream each time the encoding
changes. It's a bit inefficient, but it's probably the simplest
solution, and simplest is often the best. :)
It may not be possible to reopen the stream or seek in it (depending on what
kind of stream it is), but if it is, sure.
 
Jeroen & Peter,

Thank you both for your answers. It is correct that closing a StreamReader
also inevitably closes the underlying stream so we have one more reason why
using different readers on the same stream is probably not a good idea.

The stream I was talking about is actually a NetworkStream so
repositioning/seeking is not possible.

I'll probably implement a MultiEncodingStreamReader as suggested by Jeroen.
The class will internally split the original stream into multiple
MemoryStream objects (one per area with a given encoding) and then use one
stock .NET StreamReader for each of the MemoryStream objects.

Thanks & Regards,
 
Back
Top