simple string compression?

  • Thread starter Thread starter Chris LaJoie
  • Start date Start date
C

Chris LaJoie

I'm looking for some kind of simple string compression code I can use. I'm
not looking for SharpZipLib. Their implimentation spans several classes and
is very complex. I'm just looking for something simple. A single class,
preferrably. Does such a thing exist? Thanks,

Chris
 
Hi Chris,

What kind of strings are you talking about here? What's in them and how
long are they? What strength of compression do you want? (yes, ideally down to
1 byte, lol, but what would be reasonable)

Regards,
Fergus

|| I'm looking for some kind of simple string compression code I can
|| use. I'm not looking for SharpZipLib. Their implimentation spans
|| several classes and is very complex. I'm just looking for something
|| simple. A single class, preferrably. Does such a thing exist?
|| Thanks,
|| Chris
 
the files being compressed will be about 300 to 600k. The files are plain
text, and consist mostly of spaces, so they compress extremely well. I just
tried one, normal zip compression brought the file from 599,111 to 54,337
bytes. That's 9% of the original file size. I'm looking for something
(anything) that compresses the text by at least 50%.

Chris
 
Chris LaJoie said:
the files being compressed will be about 300 to 600k. The files are plain
text, and consist mostly of spaces, so they compress extremely well. I just
tried one, normal zip compression brought the file from 599,111 to 54,337
bytes. That's 9% of the original file size. I'm looking for something
(anything) that compresses the text by at least 50%.

Could you say exactly why it matters that SharpZipLib is several
classes? The reason that's important might constrain other solutions.
Do you have a sample file that you could put on a website somewhere for
us to test potential solutions?
 
Hi Chris,

If over 50% of the file is spaces, run-length-encoding is an extremely
simple method (ie. you could do it in a single method, let alone a single
class!).

The files also sound as if they would be amenable to Huffman coding. This
is more complicated but still pretty straightforward.

Would you like further help with this?

Regards,
Fergus
 
Hi Jon,

|| Do you have a sample file that you could put on a
|| website somewhere for us to test potential solutions?

Yes, a few of your files in a zip for us to play with! :-)

Regards,
Fergus
 
Fergus Cooney said:
If over 50% of the file is spaces, run-length-encoding is an extremely
simple method (ie. you could do it in a single method, let alone a single
class!).

The files also sound as if they would be amenable to Huffman coding. This
is more complicated but still pretty straightforward.

Would you like further help with this?

I've been thinking about this a bit, and I reckon it would be quite
interesting as it's *text only* to work it out as a
System.Text.Encoding class - that way it would be really simple to use,
for one thing.

Of course, if we knew the type of text which would need to be encoded,
that would help - if it were all ASCII, for instance, that increases
the options somewhat. (As a *very* simple example, you could store any
non-single spaces as a byte with the top bit set and the rest of the
byte being the number of spaces (2-129 encoded as 0-127) and that would
do quite nicely.)
 
Chris LaJoie said:

Like Fergus, I couldn't get any useful data out of that.
Also, I found exactly what I was looking for here:
http://www.chipstips.info/showtopic.asp?topic=cshacc

the compressed text was 42% of the original size. Not good, but it will do.

Interesting that it returns it as a string - I'd anticipated that you'd
want it converting to a byte array. (One thing that's nasty about the
above is that it seems to believe that ASCII is 8 bit rather than 7 bit
- I don't believe it *is* actually compatible in environments that only
deal with ASCII.)

If you're still interested in seeing whether there are ways of getting
it down lower than that, feel free to mail me a couple of sample files
directly - if you only need the compressor to be able to compress real
ASCII, I'll try the scheme I proposed in my last post, just to see how
it does.
 
Like Fergus, I couldn't get any useful data out of that.

My appologies, the link now works.
If you're still interested in seeing whether there are ways of getting
it down lower than that, feel free to mail me a couple of sample files
directly - if you only need the compressor to be able to compress real
ASCII, I'll try the scheme I proposed in my last post, just to see how
it does.

Sure, if you think you can, I'd certainly be interrested in seeing what you
come up with. Even if it doesn't produce better compression ratios.
Thanks,

Chris
 
Chris LaJoie said:
My appologies, the link now works.


Sure, if you think you can, I'd certainly be interrested in seeing what you
come up with. Even if it doesn't produce better compression ratios.

Well, using my "space saver" technique got the 519,942 byte input file
down to 217,875 bytes - again, 42%. I suspect that getting it down
significantly from there would take a fair amount more work -
basically, writing a Huffman encoder, which is reasonably
straightforward (I've done it before) but not trivial, and not
something I fancy doing at half four in the morning :)
 
Well, using my "space saver" technique got the 519,942 byte input file
down to 217,875 bytes - again, 42%. I suspect that getting it down
significantly from there would take a fair amount more work -
basically, writing a Huffman encoder, which is reasonably
straightforward (I've done it before) but not trivial, and not
something I fancy doing at half four in the morning :)

Sounds good Jon. I'm anxious to see what you can do with that. If you
don't mind, I plan on submitting this code (with proper credit of course) to
various sites so compression code isn't so hard to come by.

Chris
 
Chris LaJoie said:
Sounds good Jon. I'm anxious to see what you can do with that. If you
don't mind, I plan on submitting this code (with proper credit of course) to
various sites so compression code isn't so hard to come by.

Sure. This is really only "early morning for fun" code though - no
comments, no error checking, hideously inefficient etc - I just wanted
to see how it would do.

Also, due to the fact that it *can* compress so well, the
GetMaxCharCount may well mean that some other classes will seriously
overallocate space.

So, with all the caveats over with, here's the code.

using System;
using System.Text;
using System.IO;

public class SpaceSaver : Encoding
{
public override int GetByteCount (char[] text, int start,
int length)
{
return ToBinary(text, start, length).Length;
}

public override int GetBytes (char[] text, int textStart,
int length, byte[] buffer,
int bufferStart)
{
byte[] binary = ToBinary(text, textStart, length);
Array.Copy (binary, 0, buffer, bufferStart, binary.Length);
return binary.Length;
}

public override int GetCharCount (byte[] binary, int start,
int length)
{
return ToString(binary, start, length).Length;
}

public override int GetChars (byte[] binary, int binaryStart,
int length, char[] buffer,
int bufferStart)
{
char[] chars = ToString (binary, binaryStart,
length).ToCharArray();
Array.Copy (chars, 0, buffer, bufferStart, chars.Length);
return chars.Length;
}

public override int GetMaxByteCount(int length)
{
return length;
}

public override int GetMaxCharCount(int length)
{
return length*129;
}

String ToString (byte[] data, int start, int length)
{
StringBuilder builder = new StringBuilder();
foreach (byte b in data)
{
if (b < 128)
builder.Append ((char)b);
else
builder.Append (new string (' ', b-126));
}
return builder.ToString();
}

byte[] ToBinary (char[] text, int start, int length)
{
using (MemoryStream ms = new MemoryStream())
{
int spaces=0;
foreach (char t in text)
{
if (t==' ')
{
spaces++;
if (spaces==130)
{
ms.WriteByte(255);
spaces=1;
}
}
else
{
switch (spaces)
{
case 0:
break;
case 1:
ms.WriteByte(32);
break;
default:
ms.WriteByte((byte)(spaces+126));
break;
}
spaces=0;
ms.WriteByte ((byte)t);
}
}
switch (spaces)
{
case 0:
break;
case 1:
ms.WriteByte(32);
break;
default:
ms.WriteByte((byte)(spaces+126));
break;
}
ms.Flush();
return ms.ToArray();
}
}

static void Main(string[] args)
{
string inputFile = args[0];

Encoding encoding = new SpaceSaver();

try
{
// Create the reader and writer with appropriate encodings.
using (StreamReader inputReader =
new StreamReader (inputFile, Encoding.ASCII))
{
string allText = inputReader.ReadToEnd();

byte[] data = encoding.GetBytes(allText);

Console.WriteLine ("data length="+data.Length);

string recovered = encoding.GetString(data);
if (recovered != allText)
{
Console.WriteLine ("Oops!");
Console.WriteLine (recovered.Substring (0, 100));
}
}
}
catch (IOException e)
{
Console.WriteLine ("Exception during processing: {0}",
e.Message);
}
}
}
 
Also, due to the fact that it *can* compress so well, the
GetMaxCharCount may well mean that some other classes will seriously
overallocate space.

Thanks a lot for the code. A quick test proved it works, and gets a
compression ratio of about 41% of the original file size. Good job :)
I'll go into the code more tommorow, but I may have to mail you if I can't
figure out what's going on.

Chris
 
Hi Jon,

Very simple, very effective - nice routine. :-)

I don't know if you looked at the reference that Chris gave - the HACC
op-code method. It's pleasing that yours achieves a comparable result without
the complexity (not that it's that much more). It does, of course, depend on
what the data is.

I had a play with it and, noticing repeated sequences of '---', turned it
from a space-saver to a true RLE encoder. There weren't enough dashes to
overcome the loss of using the high-bit, however, and it added an extra 10K. I
think that some more could be squeezed out by allowing it to be a two-char
saver and using 0x40 as the switch. Reduced run lengths, but it would allow
for a second favourite character, ie the '-'. Not worth the effort though.

Like you say, Jon, the next (decent) level of compression is going to come
with a lot more complexity.


GetMaxCharCount(). Oh boy. Lol. 'Nuff said !! :-))
So too GetBytes and GetChars - oh, a copying we will go....
But, no worries, I know this was just an experiment of implementing an
Encoding.


Chris, when you submit Jon's code, can I recommend that you make some
small changes. A key point in this method is the use of the top bit, so I
think this should be emphasised.

Instead of
builder.Append (new string (' ', b-126));
have
builder.Append (new string (' ', (b+2)-128));

Instead of
ms.WriteByte((byte)(spaces+126));
have
ms.WriteByte((byte)(128 + (spaces-2)));

And to show that this is in the same scheme:
if (t==' ')
{
spaces++;
if (spaces==130)
{
ms.WriteByte(255);
spaces=1;
}
Have
if (t==' ')
{
if (spaces==129)
{
ms.WriteByte(128 + (spaces -2));
spaces=0;
spaces++;
}

Thus there is a consistency in the use of +/-2 and +/-128. Maybe you'd
even go as far as having a constant for the compressed-spaces indicator, ie
the top bit.

Regards,
Fergus

ps. And now I know how an Encoding looks from the inside. :-)
 
Fergus Cooney said:
Very simple, very effective - nice routine. :-)

Cheers :)
I don't know if you looked at the reference that Chris gave - the HACC
op-code method. It's pleasing that yours achieves a comparable result without
the complexity (not that it's that much more). It does, of course, depend on
what the data is.

I looked at it briefly, but the description wasn't terribly well
written unfortunately...
I had a play with it and, noticing repeated sequences of '---', turned it
from a space-saver to a true RLE encoder. There weren't enough dashes to
overcome the loss of using the high-bit, however, and it added an extra 10K. I
think that some more could be squeezed out by allowing it to be a two-char
saver and using 0x40 as the switch. Reduced run lengths, but it would allow
for a second favourite character, ie the '-'. Not worth the effort though.

Right. Possibly not - depending on just how many of them there were.
GetMaxCharCount(). Oh boy. Lol. 'Nuff said !! :-))

Indeed. It's quite possible that by overriding more of the Encoding
methods, that wouldn't be a problem - I suspect it's used by the
default implementations though.
So too GetBytes and GetChars - oh, a copying we will go....
But, no worries, I know this was just an experiment of implementing an
Encoding.
:)

Chris, when you submit Jon's code, can I recommend that you make some
small changes. A key point in this method is the use of the top bit, so I
think this should be emphasised.

Instead of
builder.Append (new string (' ', b-126));
have
builder.Append (new string (' ', (b+2)-128));

Instead of
ms.WriteByte((byte)(spaces+126));
have
ms.WriteByte((byte)(128 + (spaces-2)));

Those make sense, yes.
And to show that this is in the same scheme:
if (t==' ')
{
spaces++;
if (spaces==130)
{
ms.WriteByte(255);
spaces=1;
}
Have
if (t==' ')
{
if (spaces==129)
{
ms.WriteByte(128 + (spaces -2));
spaces=0;
spaces++;
}

I presume you meant the spaces++; to come after the closing brace - and
note that you then need a cast to byte. One alternative would be:

if (t==' ')
{
spaces++;
if (spaces==129) // The most we can write in one byte
{
ms.WriteByte ((byte) (128+(spaces-2)));
spaces=0;
}
}

In other words, writing the spaces a bit more eagerly, and starting
with a clean slate, as it were.
Thus there is a consistency in the use of +/-2 and +/-128. Maybe you'd
even go as far as having a constant for the compressed-spaces indicator, ie
the top bit.

Possibly - although that would imply that it could be changed without
too much problem, whereas it can't - if it's anything *other* than the
top bit, you end up with it conflicting with ASCII. I'd prefer to just
document it carefully.
ps. And now I know how an Encoding looks from the inside. :-)

If you want a slightly more polished example, have a look at
http://www.pobox.com/~skeet/csharp/ebcdic
 
Hi Jon,

Effort for saving dashes? Using your code for '-' instead of ' ' reduced
it by c28700.

The corrections to the [if (t==' ') etc]. Yes, rather than copy from the
project (I'd only just got up and VS was still asleep) I copied the code from
the email and hacked away (too hurriedly, lol, I had an appointment to get
to). What you presumed was what I was actually using in my version of the live
code. But keeping the ++ early and writing as-soon-as ('eagerly') - yes, that
is cleaner.

With the constant I said 'maybe' because it's not hugely important. Anyone
working with bytes should recognise a 128 for what it is. But if I did have
the constant, I'd have a comment - top bit only because.. - or some such.

I've looked at the page for your other encoder but I'll leave the
examination for when the time comes. Shame it's for ebcdic - I tend to run
away from anything I can't pronounce, lol.! Lord of the Rings and Dune escaped
being devoured because of that!

Lader, Dude,

Best regards,
Fergus
 
Back
Top