GZipStream and buffering

  • Thread starter Thread starter Agendum
  • Start date Start date
A

Agendum

I wrote a client / server networking application and all communication
between the two is compressed using GZipStream. However, I found a wierd
buffering problem. Despite how I had the network stream configured,
buffering would still occur (on a small scale). I pinpointed the problem to
GZipStream. Basically it seems to buffer content and calling Flush has no
effect.

Below I wrote a small sample application which demonstrates this. Basically
in two CMDs call "App.exe L" to listen, and just "App.exe" to start the
client. You will notice the GZipStream is buffering. Simply uncommenting
"//#define DONT_USE_GZIP_STREAM_READ" and "//#define
DONT_USE_GZIP_STREAM_WRITE" will cause the application to begin working. I
understand I am not getting any gain from compression here because it is just
1 byte. That is besides the point - this is just a demonstration of the
problem (the original app has sizable messages to transmit). Regardless of
size, GZipStream.Flush should flush data to the network stream and it does
not. Also, I know it is on the sending side because if I only uncomment
"//#define DONT_USE_GZIP_STREAM_READ" then every couple seconds the stream
does not read anything implying nothing was sent over the network.

How do I get this demo app to work as expected?

Thanks!

// BUGGY CODE USED FOR DEMONSTRATION ONLY
//#define DONT_USE_GZIP_STREAM_READ
//#define DONT_USE_GZIP_STREAM_WRITE
using System;
using System.IO.Compression;
using System.Net;
using System.Net.Sockets;

class Program
{
static void Main(String[] args)
{
if (args.Length > 0 && args[0] == "L")
{
TcpListener listener = new TcpListener(IPAddress.Loopback, 40);
listener.Start();
using (TcpClient client = listener.AcceptTcpClient())
{
client.NoDelay = true;
client.ReceiveBufferSize = 1;
#if DONT_USE_GZIP_STREAM_READ
NetworkStream stream = client.GetStream();
#else
GZipStream stream = new GZipStream(
client.GetStream(), CompressionMode.Decompress, false);
#endif
while (true)
{
Console.WriteLine("{0}", stream.ReadByte());
}
}
}

else
{
using (TcpClient client = new TcpClient())
{
client.Connect(IPAddress.Loopback, 40);
client.NoDelay = true;
client.SendBufferSize = 1;
#if DONT_USE_GZIP_STREAM_WRITE
NetworkStream stream = client.GetStream();
#else
GZipStream stream = new GZipStream(
client.GetStream(), CompressionMode.Compress, false);
#endif
for (Byte b = 0;; ++b)
{
stream.WriteByte(b);
stream.Flush();
Console.WriteLine("{0}", b);
System.Threading.Thread.Sleep(1000);
}
}
}
}
}
 
I wrote a client / server networking application and all communication
between the two is compressed using GZipStream. However, I found a wierd
buffering problem. Despite how I had the network stream configured,
buffering would still occur (on a small scale). I pinpointed the
problem to
GZipStream. Basically it seems to buffer content and calling Flush has
no
effect.

Not that I think it's such a great idea to:

-- Set NoDelay to true,
-- Set the send and receive buffers to 1 byte in length, or
-- Flush the GZipStream after each write

But, basically the problem here is that you expect there to be no
buffering when it's impossible for there to be no buffering.

The job of GZipStream is to take a stream of bytes and turn it into a
shorter stream of bytes. Since you get fewer bytes on the receiving end,
it should be obvious that for at least some of the bytes you send, you
will not receive a byte on the output of GZipStream.

Likewise, at the receiving end, the job of the class there is to take a
short stream of bytes and turn it back into the longer stream. Thus,
there it should also be obvious that for every byte you actually do
receive on the network, for at least some of them, you will get more than
one byte on the output of the GZipStream.

In other words, GZipStream is doing exactly what it's supposed to, as is
each TcpClient given how you've configured them (however obscenely that
may be :p).

In general, trying to disable buffering on a network stream is a really
bad idea. But at the very least, it is simply impossible to avoid at
least some buffering within the compression/decompression stages, because
that's a fundamental aspect of how compression works (it's essentially a
corallary to the pigeon-hole principle...you only have so many
"pigeon-holes" on the output of the GZipStream to put the input, which has
more elements than there are "pigeon-holes", so obviously some of the
input elements don't have their own unique output "pigeon-hole").

Pete
 
The fact I use NoDelay, have a 1 byte buffer, and flush is just a
demonstration that theres no other options other than for the byte to be
transmitted. Also, I don't Flush after each write (each byte!) in the
original app -- it is just a demonstration here.

In any case, I understand what you are saying about the GZipStream.
Basically to apply a reasonable amount of compression GZipStream reads a
minimum amount of bytes. The fact GZipStream has a Flush method is
irrelevant... it just flushes the already-compressed bytes to the stream. I
was incorrectly assuming it would compress any remaining bytes in the stream
and write it out. Apparently there is no method for doing that.

I mentioned the original application sends messages of a sizeable amount and
I am experiencing the same problem. I guess I can conclude from this that:

1) GZipStream compresses bytes on some internally defined byte boundary.
This would explain why just a "minimum number of bytes" is not enough.

2) The only way to invoke the call of "compress any remaining bytes in the
stream" is to actually close the GZipStream itself.

Thanks for your response.
 
The fact I use NoDelay, have a 1 byte buffer, and flush is just a
demonstration that theres no other options other than for the byte to be
transmitted.

Obviously, there _are_ other options other than for the byte to be
transmitted. The GZipStream instance can (and does) buffer it.
Also, I don't Flush after each write (each byte!) in the
original app -- it is just a demonstration here.

Okay, that's a relief.
In any case, I understand what you are saying about the GZipStream.
Basically to apply a reasonable amount of compression GZipStream reads a
minimum amount of bytes.

It's not really about being "reasonable". It's simply how that particular
compression algorithm works. It builds a dictionary as it goes, and when
certain conditions are fulfilled (e.g. some new sequence of bytes not
already in the dictionary is seen, or a given sequence of bytes seen does
match something in the dictionary, etc.) the compression algorithm emits
bytes on the output end.

Depending on the input, this may in fact result in unreasonable amounts of
compression, or even inflation of the stream. "Reasonable" doesn't come
into play; it's basically a dynamic state machine, and at certain states,
bytes are emitted, hopefully (but not always) in a compressed state as
compared to the input.
The fact GZipStream has a Flush method is
irrelevant... it just flushes the already-compressed bytes to the
stream. I
was incorrectly assuming it would compress any remaining bytes in the
stream
and write it out. Apparently there is no method for doing that.

Allowing that would be counter-productive from a compression point of
view, but would prevent the decompression side from working in any case.
I mentioned the original application sends messages of a sizeable amount
and
I am experiencing the same problem. I guess I can conclude from this
that:

1) GZipStream compresses bytes on some internally defined byte boundary.
This would explain why just a "minimum number of bytes" is not enough.

It's not "some internally defined byte boundary". It has to do with the
progress of the compression algorithm in matching the input to the current
state of its dictionary. The compression algorithm is documented. If you
care how it works, you should read about how it works.
2) The only way to invoke the call of "compress any remaining bytes in
the
stream" is to actually close the GZipStream itself.

Yes. That is the only way for that particular compression algorithm to
work.

Pete
 
* Peter Duniho wrote, On 22-9-2009 5:20:
Obviously, there _are_ other options other than for the byte to be
transmitted. The GZipStream instance can (and does) buffer it.


Okay, that's a relief.


It's not really about being "reasonable". It's simply how that
particular compression algorithm works. It builds a dictionary as it
goes, and when certain conditions are fulfilled (e.g. some new sequence
of bytes not already in the dictionary is seen, or a given sequence of
bytes seen does match something in the dictionary, etc.) the compression
algorithm emits bytes on the output end.

Depending on the input, this may in fact result in unreasonable amounts
of compression, or even inflation of the stream. "Reasonable" doesn't
come into play; it's basically a dynamic state machine, and at certain
states, bytes are emitted, hopefully (but not always) in a compressed
state as compared to the input.


Allowing that would be counter-productive from a compression point of
view, but would prevent the decompression side from working in any case.


It's not "some internally defined byte boundary". It has to do with the
progress of the compression algorithm in matching the input to the
current state of its dictionary. The compression algorithm is
documented. If you care how it works, you should read about how it works.


Yes. That is the only way for that particular compression algorithm to
work.

In my opinion, the best way to make this work is to compress the data
first, and then send the compressed data over the wire as if it were a
message.

For that you have two options:
1) Create the message beforehand by writing it to a MemoryStream, then
stream the contents of that over the network. The problem with this
approach is that it requires more memory.
2) Create the GZipStream with the second constructor, (see
http://msdn.microsoft.com/en-us/library/27ck2z1y.aspx), and specify
false on the second parameter. This allows you to close the GZipstream,
forcing it to send its contents along to the network while leaving the
connection open. This means that you would have to create a new
GZipStream to send a new message over the wire. The problem with this
approach is that the receiving end need to know when the end of one
zipped object has been received, so that it in turn can also create a
new GZipStream to decompress the message on the other end. Meaning
you'll have to add some protocol handling on both ends.

Not knowing exactly what you're trying to do here, but I wonder if it
wouldn't be a better idea to use WCF or some other existing
communication stack to solve your problem.
 
Back
Top