M
morten.skarstad
ZIP has become somewhat of an industry standard, and is supported by
everything and everybody. It has been overtaken by other formats in
terms of performance and features, but it still remains the most widely
used and supported archive format.
One of the benefits of alternative formats is not only that compression
ratios alone are higher. But a lot of them also supports something
called solid archives. That is, redundancy is removed not only within
each single file, but also between all files in the archive. In some
cases, the differences in compression ratios caused by solid archiving
can be extreme.
My question is: Can solid archives somehow be emulated without breaking
compatibility with the ZIP standard?
Point in case: I have a bunch of similar files. Let's look at only a
couple of them, say file1 and file2. They are both 464 kB in size.
Compressed to individual ZIP files they both shrink to 403 kB each.
ZIPping both into one single ZIP gives me a larger ZIP file of 806 kB,
which is to be expected.
Now, let's try something else. I compress file1 and file2 individually
into 7-Zip files. The resulting files are 399 kB, i.e. slighly smaller
than the ZIP files. However, compressing file1 and file2 into one
single 7z file takes only 400 kB! Adding even more of these files (I
have a bunch of them) only seems to increase the 7z archive by about 1
kB each.
All examples are made using TugZip, maximum compression, unless
otherwise stated.
Next, I tried to first put file1 and file2 into a container without
compressing them (i.e. ZIP with no compression), resulting in a single
uncompressed 929 kB file, and then compressing this. I was disappointed
to find that compressing this container with ZIP gave me a 805 kB file,
only slightly smaller than the standard ZIP. Compressing the container
using 7-Zip yet again produced a 400 kB file. Why this difference? Does
it have something to do with search span or dictionary size of the two
algorithms? Can this difference somehow be worked around?
Out of curiousity I also tried making tgz (.tar.gz) and tbz (.tar.bz2)
archives of file1 and file2, since these formats are also solid. The
resulting archives were 805 kB and 505 kB respectively.
The reason for my concern with this is that I routinely receive and
send lots of files to various recipients in my work, either via e-mail
or from closed web download sites. In particular, mails bouncing due to
attachment sizes are a common problem. I have tried convincing some of
my contacts to consider the possibility of using something like .7z, so
far without results. From what I can gather, people are either using
WinZip or the builtin shell extension in Windows XP. Self extracting
executables are also out of the question, since these are commonly
blocked due to security policies of various companies.
everything and everybody. It has been overtaken by other formats in
terms of performance and features, but it still remains the most widely
used and supported archive format.
One of the benefits of alternative formats is not only that compression
ratios alone are higher. But a lot of them also supports something
called solid archives. That is, redundancy is removed not only within
each single file, but also between all files in the archive. In some
cases, the differences in compression ratios caused by solid archiving
can be extreme.
My question is: Can solid archives somehow be emulated without breaking
compatibility with the ZIP standard?
Point in case: I have a bunch of similar files. Let's look at only a
couple of them, say file1 and file2. They are both 464 kB in size.
Compressed to individual ZIP files they both shrink to 403 kB each.
ZIPping both into one single ZIP gives me a larger ZIP file of 806 kB,
which is to be expected.
Now, let's try something else. I compress file1 and file2 individually
into 7-Zip files. The resulting files are 399 kB, i.e. slighly smaller
than the ZIP files. However, compressing file1 and file2 into one
single 7z file takes only 400 kB! Adding even more of these files (I
have a bunch of them) only seems to increase the 7z archive by about 1
kB each.
All examples are made using TugZip, maximum compression, unless
otherwise stated.
Next, I tried to first put file1 and file2 into a container without
compressing them (i.e. ZIP with no compression), resulting in a single
uncompressed 929 kB file, and then compressing this. I was disappointed
to find that compressing this container with ZIP gave me a 805 kB file,
only slightly smaller than the standard ZIP. Compressing the container
using 7-Zip yet again produced a 400 kB file. Why this difference? Does
it have something to do with search span or dictionary size of the two
algorithms? Can this difference somehow be worked around?
Out of curiousity I also tried making tgz (.tar.gz) and tbz (.tar.bz2)
archives of file1 and file2, since these formats are also solid. The
resulting archives were 805 kB and 505 kB respectively.
The reason for my concern with this is that I routinely receive and
send lots of files to various recipients in my work, either via e-mail
or from closed web download sites. In particular, mails bouncing due to
attachment sizes are a common problem. I have tried convincing some of
my contacts to consider the possibility of using something like .7z, so
far without results. From what I can gather, people are either using
WinZip or the builtin shell extension in Windows XP. Self extracting
executables are also out of the question, since these are commonly
blocked due to security policies of various companies.