binary diff

  • Thread starter Thread starter Ching-Lung
  • Start date Start date
C

Ching-Lung

Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It
works fine but the delta binary is pretty huge in size. I
have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary
formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL
 
Binary diff is a tough problem. So far, the best paper I have seen on the
subject is Andrew Tridgell's paper on rsync. It actually tackles a slightly
more difficult problem that just binary diff, i.e. binary diff between local
and remote file, but the general idea (rolling checksum + strong checksum)
can be used to compute a binary diff between two local files in an efficient
way.

Andrew's paper is on http://samba.org/rsync/tech_report/

I don't undestand how binary serialization would help you implement binary
diff. Seems like it is only going to make matters worse, but maybe you want
to do something slightly different than standard binary diff.

Bruno.
 
Instead of serializing the byte[] diff, why can't you just write that to a
binary diff file (i.e. no serialization)?
 
Hi Bruno,

"...I don't undestand how binary serialization would help
you implement binary diff..."

I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.

To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.

"...but maybe you want to do something slightly different
than standard binary diff..."

I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.

Thanks!
-CL
 
Hi Will,

I actually store the byte diff in a hashtable (key =
offset, value = byte diff). With serialization, it's
easier to maintain the state of the hashtable.

I was once thought about put everything in one byte[] and,
like you suggested, no serialization-write it to a file.
But this way, for each byte diff, I have to store the
offset, length of the byte diff, and the byte itself. And
to reconstruct file B back, I have to have 3 readings to
get the offset, length, and read the next length byte from
get the byte diff.

File A:
+-------------------+
| A | B | C | D | E |
+-------------------+

File B:
+-------------------+
| A | b | c | D | e |
+-------------------+

byte[] diff:
+---------------------------+
| 1 | 2 | b | c | 4 | 1 | e |
+---------------------------+

So:
- diff[0] = 1 indicates that A[1] has changed
- diff[1] = 2 indicates that I have to read the next 2
bytes from diff[] to get 'b' and 'c' and stick them into
file A
- and this goes on and on...

Isn't it too much work to do?

Well, above is just one example and the size of file A and
B happen to be the same. One also needs to consider other
cases.

Any more idea?
-CL




-----Original Message-----
Instead of serializing the byte[] diff, why can't you just write that to a
binary diff file (i.e. no serialization)?

--
William Stacey, DNS MVP

Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It
works fine but the delta binary is pretty huge in size. I
have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary
formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL
 
Your algorithm will work and produce a diff that is significantly smaller
than the complete files only in very special circumstances, i.e., if a few
isolated bytes get replaced by some other bytes in your file. But it does
not qualify for a general purpose binary diff utility.

If the file is reorganized in some way (bytes inserted, deleted or moved
around), your algo will not find any match and will produce an output which
is much larger than the files themselves.

And even if the file is not reorganized like that, your algo will perform
very poorly if sequences of bytes change between two versions, because your
are storing the diffs on individual bytes. You should at least try to find
differing sequences rather than individual bytes.

Finally, dumping a Hashtable is not an optimal way to save the difference. I
suggest you take a look at the GDIFF format for a good example on how a
binary diff could be stored: http://www.w3.org/TR/NOTE-gdiff-19970901.

Bruno.

"Ching-Lung" <[email protected]> a écrit dans le message de
Hi Bruno,

"...I don't undestand how binary serialization would help
you implement binary diff..."

I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.

To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.

"...but maybe you want to do something slightly different
than standard binary diff..."

I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.

Thanks!
-CL
 
To get a feel for what is being stored, serialize the class using XML
serializer and look at. As you will see, its not the data but the metadata
that takes a lot of space. Now imagin how you might store the same data and
metadata using a binary format. Its not xml (although there is bxml in the
works), but you see the issue with creating a general format to reconstruct
your object graph. Unless you come up with your own special method to
serialize your structure, binary is probably the smallest you will get. If
you must go more compact, I would probably look at a function to take your
hashtable and convert it to a byte[] or a comma delimited text file and back
again.

--
William Stacey, DNS MVP

Ching-Lung said:
Hi Will,

I actually store the byte diff in a hashtable (key =
offset, value = byte diff). With serialization, it's
easier to maintain the state of the hashtable.

I was once thought about put everything in one byte[] and,
like you suggested, no serialization-write it to a file.
But this way, for each byte diff, I have to store the
offset, length of the byte diff, and the byte itself. And
to reconstruct file B back, I have to have 3 readings to
get the offset, length, and read the next length byte from
get the byte diff.

File A:
+-------------------+
| A | B | C | D | E |
+-------------------+

File B:
+-------------------+
| A | b | c | D | e |
+-------------------+

byte[] diff:
+---------------------------+
| 1 | 2 | b | c | 4 | 1 | e |
+---------------------------+

So:
- diff[0] = 1 indicates that A[1] has changed
- diff[1] = 2 indicates that I have to read the next 2
bytes from diff[] to get 'b' and 'c' and stick them into
file A
- and this goes on and on...

Isn't it too much work to do?

Well, above is just one example and the size of file A and
B happen to be the same. One also needs to consider other
cases.

Any more idea?
-CL




-----Original Message-----
Instead of serializing the byte[] diff, why can't you just write that to a
binary diff file (i.e. no serialization)?

--
William Stacey, DNS MVP

Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It
works fine but the delta binary is pretty huge in size. I
have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary
formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL
 
Hi Ching-Lung,

I read your reply to William and saw that your algo is smart enough to find
sequences that differ, not just isolated bytes. So, my judgement was a bit
severe.

Anyway, you should still take a look at the GDIFF format, it is a smart
format for storing the diff and it is relatively easy to generate and
process (you can take some shortcuts and eliminate the byte/short/int
variants of the tags if you don't need an optimal size).

Bruno.

"Ching-Lung" <[email protected]> a écrit dans le message de
Hi Bruno,

"...I don't undestand how binary serialization would help
you implement binary diff..."

I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.

To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.

"...but maybe you want to do something slightly different
than standard binary diff..."

I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.

Thanks!
-CL
 
Back
Top