ZIP and solid archives

  • Thread starter Thread starter morten.skarstad
  • Start date Start date
M

morten.skarstad

ZIP has become somewhat of an industry standard, and is supported by
everything and everybody. It has been overtaken by other formats in
terms of performance and features, but it still remains the most widely
used and supported archive format.

One of the benefits of alternative formats is not only that compression
ratios alone are higher. But a lot of them also supports something
called solid archives. That is, redundancy is removed not only within
each single file, but also between all files in the archive. In some
cases, the differences in compression ratios caused by solid archiving
can be extreme.

My question is: Can solid archives somehow be emulated without breaking
compatibility with the ZIP standard?

Point in case: I have a bunch of similar files. Let's look at only a
couple of them, say file1 and file2. They are both 464 kB in size.
Compressed to individual ZIP files they both shrink to 403 kB each.
ZIPping both into one single ZIP gives me a larger ZIP file of 806 kB,
which is to be expected.

Now, let's try something else. I compress file1 and file2 individually
into 7-Zip files. The resulting files are 399 kB, i.e. slighly smaller
than the ZIP files. However, compressing file1 and file2 into one
single 7z file takes only 400 kB! Adding even more of these files (I
have a bunch of them) only seems to increase the 7z archive by about 1
kB each.

All examples are made using TugZip, maximum compression, unless
otherwise stated.

Next, I tried to first put file1 and file2 into a container without
compressing them (i.e. ZIP with no compression), resulting in a single
uncompressed 929 kB file, and then compressing this. I was disappointed
to find that compressing this container with ZIP gave me a 805 kB file,
only slightly smaller than the standard ZIP. Compressing the container
using 7-Zip yet again produced a 400 kB file. Why this difference? Does
it have something to do with search span or dictionary size of the two
algorithms? Can this difference somehow be worked around?

Out of curiousity I also tried making tgz (.tar.gz) and tbz (.tar.bz2)
archives of file1 and file2, since these formats are also solid. The
resulting archives were 805 kB and 505 kB respectively.

The reason for my concern with this is that I routinely receive and
send lots of files to various recipients in my work, either via e-mail
or from closed web download sites. In particular, mails bouncing due to
attachment sizes are a common problem. I have tried convincing some of
my contacts to consider the possibility of using something like .7z, so
far without results. From what I can gather, people are either using
WinZip or the builtin shell extension in Windows XP. Self extracting
executables are also out of the question, since these are commonly
blocked due to security policies of various companies.
 
zip doesn't support solid archives, but 7z is also widely supported format
and 7z archives can be unpacked by most archivers. The simplest solution to
your problem would be: use 7z.
 
Fran said:
zip doesn't support solid archives, but 7z is also widely supported format
and 7z archives can be unpacked by most archivers. The simplest solution to
your problem would be: use 7z.

I know that zip does not support solid archives. I know that 7z is
supported by a lot of archivers. That is not the issue. Please read my
entire post.

In case my point was unclear: I wanted to emulate solid archiving by
using the following technique:
1) Put all the files I want to compress into a single ZIP using _no_
compression. The result is a single file with size equal to the sum of
the original files.
2) Compress the uncompressed ZIP. Since I am now compressing one single
file rather than several small ones, redundancy throughout should be
removed even if the compressor does not support solid archiving.

The task performed by most archivers is actually a two step process:
Joining several files into a single container, and compression. The
difference between solid and non-solid archiving is basically in which
order these two tasks are performed. Non-solid archivers compress the
files first, before joining the compressed files. Solid archivers join
the files first, before compressing the full container.

I have previously achieved good results using the above technique with
LhA, which neither does not support solid archiving. Actually, I
originally picked up this tip from the LhA manual more than a decade
ago. However, using this technique with ZIP seems to yield little to
none of the potential gain, and I do not understand why.
 
Fran wrote:

In case my point was unclear: I wanted to emulate solid archiving by
using the following technique:
1) Put all the files I want to compress into a single ZIP using _no_
compression. The result is a single file with size equal to the sum of
the original files.
2) Compress the uncompressed ZIP. Since I am now compressing one single
file rather than several small ones, redundancy throughout should be
removed even if the compressor does not support solid archiving.

Tar and gzip

http://www.gzip.org/

<quote>
The gzip file format holds a single compressed file. On Unix systems,
compressed archives are typically created by rolling collections of
files into a tar archive, and then compressing that archive with gzip.
The final .tar.gz or .tgz file is usually called a "compressed tarball."
<quote>

http://www.irnis.net/soft/wingzip/
WinGZip utility

----

http://advancemame.sourceforge.net/comp-readme.html
Advance Projects - AdvanceCOMP

<quote>
The main purpose of this utility is to recompress and test the zip
archives to get the smallest possible size.
<quote>

----

http://www.bzip.org/
The bzip2 and libbzip2 home page

<quote>
bzip2 can be used combined or independently of tar: bzip2 file to
compress and bzip2 -d file.bz2 to uncompress (the alias bunzip2 for
decompression may also be used).
<quote>

http://gnuwin32.sourceforge.net/packages/bzip2.htm
bzip2 for Windows
 
FirstName said:
Tar and gzip

I am aware of tar + gzip, as well as tar + bzip2. I also tried these,
see my original post. tbz did provide better compression, tgz did not.

However, neither brings me nearer to my goal, which is better
compression ratios without breaking ZIP compatibility. As mentioned in
my original post, I need to be able interchange files with contacts
which due to own unwillingness and/or company policies solely rely on
WinZip and/or zipfldr.dll
http://advancemame.sourceforge.net/comp-readme.html
Advance Projects - AdvanceCOMP

<quote>
The main purpose of this utility is to recompress and test the zip
archives to get the smallest possible size.
<quote>

This is somewhat interesting. Not really related to solid archiving,
but rather an attempt to increase compression using an alternative ZIP
implementation. The achieved results (less than 1% of the original ZIP
in my initial tests) leaves a lot to be desired, but I'll look closer
into this one. Thanks.
 
I said:
My question is: Can solid archives somehow be emulated without breaking
compatibility with the ZIP standard?

After a little digging following up mr. Lastnames post, I discovered
that the procedure I have been describing is commonly known as nested
zipping. Further digging revealed that there actually exists a tool
that does this job without manual rezipping: VelcroFly
(http://www.randelshofer.ch/velcrofly/download.html)

The description fits my needs, but unfortunately the performance seems
to be identical to that of my manually nested ZIP files: file1+file2 in
a nested ZIP still takes up 805 kB.

However, results for much smaller files seems to be good. For instance,
the docs for AdvanceComp (the program suggested by mr. Lastname) take
up 43,6 kB in 14 files. Compressed to a plain ZIP they occupy 16 kB,
but in a nested ZIP they only take up 8 kB. In other words, solid
archiving with ZIP can be done. The problem seems to be that once the
original files exceed a certain size the ZIP deflate algorithm is
useless. This is further supported by my experience with tgz, which
uses the same algorithm and achieves the same mediocre results.

Is there really no way around this?
 
I have previously achieved good results using the above technique with
LhA, which neither does not support solid archiving. Actually, I
originally picked up this tip from the LhA manual more than a decade
ago. However, using this technique with ZIP seems to yield little to
none of the potential gain, and I do not understand why.

I seem to recall reading when zip compression was released into the
public domain after the PKarc lawsuit, that the compression of several
files in a zip file was performed individually on each file. I never
tested it because I didn't care. However your tests seem to confirm
it.

I don't know why they chose to do it that way. Possibly to make
splitting files off from a .zip easier? That's just a guess.
 
On 15 Mar 2006 01:24:10 -0800, (e-mail address removed) wrote:

My question is: Can solid archives somehow be emulated without breaking
compatibility with the ZIP standard?
<SNIPPED>

There's really a lot more to it than just making an archive solid. As
you've experienced, the Unix world has been doing things this way for some
time with tar+gzip. The method described is really the only way to make a
solid ZIP archive, that is archive first, then zip the archive. Better
compression rates are archived using better compression algorithms. It most
certainly is possible to improve the ZIP algorithm to provide better
compression, the trick is to leave it compatible with the existing UNZIP
algorithm. This may not be possible at all, and even if it were, the
tradeoff to keep it compatible might not provide enough gain to be worth
the effort.

In order to build a better ZIP it almost certainly is necessary to break
the "ZIP" campatibility. I suggest that what you do is create a .7z SFX and
then ZIP it as a ZIP file. The receiving party can then unzip it and
execute it. This is certianly not ideal but it won't get dumped because it
is an EXE and the receiving party won't need 7z to open the archive.
Another trick I've often used is to make it the archive a 7z SFX and simply
rename it to filename.xyz. In the body of the email, instruct the receiving
party to save it to a folder and rename it to filename.exe, then execute
it. Your choice, but better compression requires better algorithms, end of
story.
 
After a little digging following up mr. Lastnames post, I discovered
that the procedure I have been describing is commonly known as nested
zipping. Further digging revealed that there actually exists a tool
that does this job without manual rezipping: VelcroFly
(http://www.randelshofer.ch/velcrofly/download.html)

The description fits my needs, but unfortunately the performance seems
to be identical to that of my manually nested ZIP files: file1+file2 in
a nested ZIP still takes up 805 kB.

However, results for much smaller files seems to be good. For instance,
the docs for AdvanceComp (the program suggested by mr. Lastname) take
up 43,6 kB in 14 files. Compressed to a plain ZIP they occupy 16 kB,
but in a nested ZIP they only take up 8 kB. In other words, solid
archiving with ZIP can be done. The problem seems to be that once the
original files exceed a certain size the ZIP deflate algorithm is
useless. This is further supported by my experience with tgz, which
uses the same algorithm and achieves the same mediocre results.

Is there really no way around this?

The ZIP (and GZIP) dictionary size is only 32k, far too small to take
advantage of inter-file similarity except for small files. Nested zip
was a bit of a false dawn. Not sure if it would work any better under
"deflate64" - not sure if they extended the dictionary size (or even
added a "solid" option).

ZIP has had its day, and it's a pity Microsoft integrated handling of
such an old compression system, as it props up the old relic for
longer.

Either persuade your contacts to get 7-Zip or another (often freeware
too) program that can handle 7Z, or as suggested, bundle a 7Z self
extractor inside a ZIP to prevent it being blocked.

If they insist on commercial software, then the answer is RAR (with a
free UNRAR,).

7-Zip's "7Z" format needs to take it's place as the new, de-facto
archiving format, as an open format that is far superior to legacy
zip.
 
Matth skrev:
The ZIP (and GZIP) dictionary size is only 32k, far too small to take
advantage of inter-file similarity except for small files. Nested zip
was a bit of a false dawn.

Ah well. Dead end, then.

Thanks for the info.
 
Back
Top