File-Compare "fc" falsely reports mismatch between identical files

  • Thread starter Thread starter Rich Pasco
  • Start date Start date
R

Rich Pasco

I'm using Windows XP Professional SP3.

I'm trying to compare pairs of binary files and test the result
through the errorlevel. I thought "fc" would do the job, but it
seems to fail by reporting a mismatch between certain pairs of
identical files.

Please refer to the transcript below. First I copy a raw file to
a temporary file, and verify that the copy was successful: It was:
C:\test>copy IMG_0001.dcm.raw file.bin
1 file(s) copied.

C:\test>comp IMG_0001.dcm.raw file.bin
Comparing IMG_0001.dcm.raw and file.bin...
Files compare OK

Compare more files (Y/N) ? n

Now, I use "fc" to compare the same two files:
C:\test>fc IMG_0001.dcm.raw file.bin
Comparing files IMG_0001.dcm.raw and FILE.BIN
***** IMG_0001.dcm.raw
0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0
0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0
***** FILE.BIN
0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0
0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°0°
*****

Notice that a mismatch is reported! This should not be!

Next, I use fc to compare the copy to itself:
C:\test>fc file.bin file.bin
Comparing files file.bin and FILE.BIN
FC: no differences encountered

Of course. But why didn't I get the same result above?

Here are the versions of the tools I am using:
C:\test>which fc
C:\WINDOWS\system32\fc.exe 04/14/2008 05:00 AM 14848 bytes

C:\test\abo>which comp
C:\WINDOWS\system32\comp.exe 04/14/2008 05:00 AM 15872 bytes


- Rich
 
Now, I use "fc" to compare the same two files:


Notice that a mismatch is reported! This should not be!

How long are the lines in the files? Are they unicode?

I'm wondering if the text-mode compare in FC is being fooled by
something... and if the files are binary files then you should be using a
binary compare instead of a text compare.
 
foxidrive said:
How long are the lines in the files? Are they unicode?

There are not lines. This is a binary file.
I'm wondering if the text-mode compare in FC is being fooled by
something... and if the files are binary files then you should be using a
binary compare instead of a text compare.

Thanks for the suggestion, now I see that fc /b does a binary compare.

Anyway:

fc IMG_0001.dcm.raw file.bin

reports a bunch of differences, while

fc /b IMG_0001.dcm.raw file.bin

reports "no differences encountered"

I still think it's strange that, even if fc thought it was
comparing text files, it would report any differences between
two copies of the same file.

- Rich
 
Thanks for the suggestion, now I see that fc /b does a binary compare.

Anyway:

fc IMG_0001.dcm.raw file.bin

reports a bunch of differences, while

fc /b IMG_0001.dcm.raw file.bin

reports "no differences encountered"

I still think it's strange that, even if fc thought it was
comparing text files, it would report any differences between
two copies of the same file.

One thing to consider is the buffer size used by FC when reading files.

It is conceivable that a fairly large binary file may not contain a CR/LF
pair and so the buffer overflows when comparing in text mode.
 
foxidrive said:
It is conceivable that a fairly large binary file may not contain a CR/LF
pair and so the buffer overflows when comparing in text mode.

Certainly possible, but would two copies of the same file overflow in
different places? Apparently that's what happened.

Interestingly,
fc IMG_0001.dcm.raw file.bin
reported differences, but
fc file.bin IMG_0001.dcm.raw
reported match identically

- Rich
 
Rich Pasco said:
Certainly possible, but would two copies of the same file overflow in
different places? Apparently that's what happened.

Interestingly,
fc IMG_0001.dcm.raw file.bin
reported differences, but
fc file.bin IMG_0001.dcm.raw
reported match identically

- Rich

Intriguing. I've seen FC do strange things in the past - comparing A against
B giving different results from B against A, but didn't have time to
investigate the problem.

The only difference I see in this case is the filename length - what happens
if both filenames have the same length, both as 8.3 names and long
filenames?

May be clutching at straws, but perhaps the buffer is shared between the
filename and the data read; fine for real-world ASCII but not for binary
being interpreted as line-oriented ASCII.
 
billious said:
Intriguing. I've seen FC do strange things in the past - comparing A
against B giving different results from B against A, but didn't have time
to investigate the problem.

The only difference I see in this case is the filename length - what
happens if both filenames have the same length, both as 8.3 names and long
filenames?

May be clutching at straws, but perhaps the buffer is shared between the
filename and the data read; fine for real-world ASCII but not for binary
being interpreted as line-oriented ASCII.

While this is indeed an interesting observation, it is illogical to
insist/assume that a tool designed to compare text files should be able to
consistently determine that two identical binary files are identical,
because the flip side of that is that it should therefore always be able to
determine that two non-identical binary files differ.

/Al
 
Al Dunbar said:
While this is indeed an interesting observation, it is illogical to
insist/assume that a tool designed to compare text files should be able to
consistently determine that two identical binary files are identical,
because the flip side of that is that it should therefore always be able
to determine that two non-identical binary files differ.

/Al

Hmm, I'd not completely agree.

Whereas the RESULT of the operation may not be correct, I'd suggest that the
result should be CONSISTENT between runs; FC A B and FC B A for identical
files being simply a particular case that /might/ point to a reason.

Talikng theory, FC/B must be the SIMPLEST (but not the most-frequently used
and hence default) mode. In the FC/B case, all that is required is to read
the data from the two files into a buffer, compare them byte-by-byte and
report any differences. Nothing complicated there.

FC/A is a completely different kettle of fish. The data has to be read and
assembled line-by-line using CRLF as an EOL, and stored to allow the
line-buffer to be used. How is the line stored? No doubt C-style as an
ASCIIZ - so how does the program react to a NULL read from its input data?
How is ^Z treated? Is it just accepted as a "normal" character, or is it EOF
since we're dealing with a ASCII-compare? COPY for instance appears to
recognise ^Z as end-of-file and will terminate copying a binary file at the
first one encountered - does FC follow this idea? By extension, should FC
report two ASCII files as being identical if one is straight-ASCII and the
other identical EXCEPT for an appended ^Z?

Then there's the /w problem - evidently implemented as a comparison between
the data after it has been buffered.

What, in /A (default) mode is the result of reading a long-long-long line -
is there a limit? And what about the last "line" of an ASCII file that ISN'T
CRLF-terminated?

I'd suggest that IDENTICAL input should produce IDENTICAL results, and any
inconsistent behaviour indicates an unwarranted assumption has been made -
possibly uninitialised variables, and I'd claim that if this behaviour
depends on the filenames involved, then that evidence reinforces my
suspicions.

I'd challenge your "flip-side" notion, too. There is a substantial amount of
software that has been written that appears to work, but actually it fails
to fail; working by chance rather than design. This can sometimes be proved
by applying carefully-selected data; often it's been discovered by users and
they've created manual procedures to avoid the known problems with the tools
that they've been supplied. Often also, to prove the point one has to try to
get an office manager with no IT comprehension to understand.

In the case of the FC problem to which I referred, I recall now that I was
executing FC A B where A and B were COBOL source files resident on remote
machines. Repeatedly running FC A B would suddenly fail with FC claiming
that one of the files couldn't be found, although DIR, COPY, EDIT, etc.
could find it - FC stubbornly refused to find it once it had decided that it
didn't exist.

I strongly suspect that this was actually something wrong with the network -
but the insane network administrator had to be in a rare calm mood for such
matters to be discussed, and she'd fly into a rage of screamed accusations
of "living in the past" if you were to use anything that wasn't
point-click-and-giggle. Nothing about her set-ups could ever, EVER be her
fundamental lack of appreciation of cause and effect.
 
billious said:
Hmm, I'd not completely agree.

No problem. I don't always fully agree with my own ideas...
Whereas the RESULT of the operation may not be correct, I'd suggest that
the result should be CONSISTENT between runs; FC A B and FC B A for
identical files being simply a particular case that /might/ point to a
reason.

While it seems reasonable to say that the result should be consistent
between runs, one could also say that, for consistency's sake, one should
only use tools in the manner and for the purpose they were designed and
intended.
Talikng theory, FC/B must be the SIMPLEST (but not the most-frequently
used and hence default) mode. In the FC/B case, all that is required is to
read the data from the two files into a buffer, compare them byte-by-byte
and report any differences. Nothing complicated there.
Agreed.

FC/A is a completely different kettle of fish.

Right. On my system FC /? indicates that this means to display only the
first and last lines of each set of differences, whereas /L is said to
compare files as ascii text.
The data has to be read and assembled line-by-line using CRLF as an
EOL, and stored to allow the line-buffer to be used. How is the line
stored? No doubt C-style as an ASCIIZ - so how does the program react to a
NULL read from its input data? How is ^Z treated? Is it just accepted as a
"normal" character, or is it EOF since we're dealing with a ASCII-compare?
COPY for instance appears to recognise ^Z as end-of-file and will
terminate copying a binary file at the first one encountered - does FC
follow this idea? By extension, should FC report two ASCII files as being
identical if one is straight-ASCII and the other identical EXCEPT for an
appended ^Z?

That depends on whether or not that is what everybody wants it to do. Since
we, as a species, have been unable to provide a rock-solid definition of
what a text file is (see
http://www.google.ca/search?hl=en&q=define:+text+file&meta=&aq=f&oq=), we
can hardly complain when this lack of clarity results in anomalies...
Then there's the /w problem - evidently implemented as a comparison
between the data after it has been buffered.

What, in /A (default) mode is the result of reading a long-long-long
line - is there a limit? And what about the last "line" of an ASCII file
that ISN'T CRLF-terminated?

I'd suggest that IDENTICAL input should produce IDENTICAL results, and any
inconsistent behaviour indicates an unwarranted assumption has been made -
possibly uninitialised variables, and I'd claim that if this behaviour
depends on the filenames involved, then that evidence reinforces my
suspicions.

I agree with you almost completely. But I come to the conclusion that, for
the most part, the "assumptions" are valid for most of the files that we
generally consider to be "text files". Show me a couple of "text files" that
fc/a does not compare properly, and I would argue that they are so extreme
in some way that I would not consider them "text files". One of the
definitions found by google is this: "A file that contains characters
organized into one or more lines. The lines must not contain null characters
and none can exceed the maximum line length allowed by the implementation."
Ah, the implementation. In this case FC would be the implementation, would
it not?

But, that said, what is your definition of a text file, and is that the
authoritative definition? I mean, if there is no general agreement on a
definition, then how can it be said that the assumptions made were
incorrect?

But I see no indication here that FC gives non-identical results on
identical input. FC A B may give different results from FC B A, however, I
would suggest that, by definition, and from the point of view of FC, the
input is therefore NOT identical. I would also suggest that FCA B appears to
ALWAYS give the same results as itself, and that the same goes for FC B A.
Reading ahead, you seem to suggest that this may not be true. I'll address
that further down-thread.
I'd challenge your "flip-side" notion, too.

That is the part where I expected the most challenges...
There is a substantial amount of software that has been written that
appears to work, but actually it fails to fail; working by chance rather
than design.

Oh, you've seen some of my work, then, have you? ;-)
This can sometimes be proved by applying carefully-selected data; often
it's been discovered by users and they've created manual procedures to
avoid the known problems with the tools that they've been supplied. Often
also, to prove the point one has to try to get an office manager with no
IT comprehension to understand.

OK, let's put this into context: which software packages would you propose
as having absolutely no flaws?
In the case of the FC problem to which I referred, I recall now that I was
executing FC A B where A and B were COBOL source files resident on remote
machines. Repeatedly running FC A B would suddenly fail with FC claiming
that one of the files couldn't be found, although DIR, COPY, EDIT, etc.
could find it - FC stubbornly refused to find it once it had decided that
it didn't exist.

You have basically demonstrated that, like most of the world's known
software, FC is not completely flawless. But you err by anthropomorphizing
its motives ;-)
I strongly suspect that this was actually something wrong with the
network - but the insane network administrator had to be in a rare calm
mood for such matters to be discussed, and she'd fly into a rage of
screamed accusations of "living in the past" if you were to use anything
that wasn't point-click-and-giggle. Nothing about her set-ups could ever,
EVER be her fundamental lack of appreciation of cause and effect.

If the writers of FC could have anticipated our expectations, perhaps they
would have had its help text explain its limitations in terms of line
length, number of lines, size, or whatever other assumptions might have been
implicit at the time. I would also have suggested to them that if either or
both of the files turned out to exceed the specifications of its definition
of a "text file", that it should return an indicator that the files are not
identical text files.

/Al
 
Al Dunbar said:
While it seems reasonable to say that the result should be consistent
between runs, one could also say that, for consistency's sake, one should
only use tools in the manner and for the purpose they were designed and
intended.


Right. On my system FC /? indicates that this means to display only the
first and last lines of each set of differences, whereas /L is said to
compare files as ascii text.

Frankly, the distinction escapes me. /A may or may not be ASCII - I believe
that it's assumed to be ASCII on the grounds that /B is Binary. Perhaps it
is not so.

To reproduce the line-before and the line-after (which appears to be the /A
documented and the default behaviour) it would seem that FC interprets the
file - let's assume it's a text file for simplicity's sake - in a
line-oriented manner. So what is the distinction between /L and /A?

Also, the documentation and behaviour appear difficult to reconcile. What is
really meant by "first and last lines for each set of differences?" For
instance, if we have the sequence SSDDSDDDS (Same/Different lines) then FC
appears to show SDDSDDDS, starting at the second "Same" line and ending with
the fourth. It could be argued that SDDS is one "set" and SDDDS is a second
"set" hence the third Same line should be reproduced both as the last line
of the first "set of differences" and also as the first of the second "set
of differences."

And if it is LINES that are being compared, what is the difference between
/A and /L mode? Both are line-oriented.
[behaviour with "non-ASCII" files]
That depends on whether or not that is what everybody wants it to do.
Since we, as a species, have been unable to provide a rock-solid
definition of what a text file is (see
http://www.google.ca/search?hl=en&q=define:+text+file&meta=&aq=f&oq=),
we can hardly complain when this lack of clarity results in anomalies...

Much heartache is caused by the assumption that what is "standard" in
Redmond is some variety of universal standard. The screaming banshee to
which I've referred would claim that I was lying when the SAME file was sent
to two different printers and produced different results. The fact that the
printers were attached to different machines using different OSs and
different drivers, being in her claimed area of expertise, was beyond her
comprehension. The fact that one printer was set to a UK character set and
another to a US character set, so that H'23' was rendered either as the
pound-currency symbol or as octothorpe were of no consequence, the files had
to be different as the results as processed by her perfect self-correcting
creation were different.
I agree with you almost completely. But I come to the conclusion that, for
the most part, the "assumptions" are valid for most of the files that we
generally consider to be "text files". Show me a couple of "text files"
that fc/a does not compare properly, and I would argue that they are so
extreme in some way that I would not consider them "text files". One of
the definitions found by google is this: "A file that contains characters
organized into one or more lines. The lines must not contain null
characters and none can exceed the maximum line length allowed by the
implementation." Ah, the implementation. In this case FC would be the
implementation, would it not?

I'd suspect that the eariest FC implementations were assembler, oriented
toward 80-column data. It would seem unreasonable in that environment to
produce a report wider than 80 columns, given the peripherals commonly in
use at the time.Even had the output of FC been sent to a file and typed,
word-wrapping on an 80-column screen would have been tedious and difficult
to interpret.

Also, in those days 7-bit ASCII was de force. A few control characters were
used - CR,LF,FF,TAB - but the others had little relevance to the printed
document. Were the "high-ASCII" characters graphics or special characters
used in non-English alphabets? Unicode was way in the future...

As techniques have moved away from these earlier ideas, so the definitions
have become more fuzzy.
But, that said, what is your definition of a text file, and is that the
authoritative definition? I mean, if there is no general agreement on a
definition, then how can it be said that the assumptions made were
incorrect?

Aye, that's the nub of the problem. I beliieve text files were originally
assumed to be 7-bit ASCII, organised as "lines" being terminated by a CRLF
sequence. "Lines" could be up to 80 characters long.

But each of these "requirements" is rubbery. "80" characters could be 132 -
the common printer width for 15" printers. Or 164 or so, in
compressed-print, or more with proportional-print, or more if the "text" was
data not meant to be printed. 7-bit could be expanded to take care of
accented characters, etc.

In the end, it becomes a meaningless, yet surprisingly commonly-used term.
If the line length is not limited, and the character-set processed is not
limited, then what is the difference between "ASCII" and "Binary?" It
becomes simply a binary-compare in blocks delimited by the arbirary CRLF
sequence. What "authority" is going to impose a line-length or character-set
limit - and remember that there will always be the dissenters who want "just
a few more characters" or "oh - and this character, too."

I suppose it's one of those things that slowly drifts. What is a "database"
for instance? Is it that set of data that is controlled by an instance of a
DBMS (which is what I'd tend tio use?) Or is it the entirety of data owned,
as some would have it? Or perhaps it is some random subset of that data, as
others would claim? I was surprised by a prompt from one DBMS that asked
whether I wanted to "format the database" when in fact just one table in its
own individual file had to be altered. Formatting something over 200 tables
seemed to be a little overkill...

Or perhaps like the Screaming Banshee to which I have referred who insisted
(despite allegedly holding a degree and therefore presumably having been
taught at least some of the terminology) on calling a fixed-column format
file a "tab-deliminated file" (not "tab-delimited," I suspect specifically
to annoy me) despite the fact that it didn't actually contain tab
characters...
But I see no indication here that FC gives non-identical results on
identical input. FC A B may give different results from FC B A, however, I
would suggest that, by definition, and from the point of view of FC, the
input is therefore NOT identical. I would also suggest that FCA B appears
to ALWAYS give the same results as itself, and that the same goes for FC B
A. Reading ahead, you seem to suggest that this may not be true. I'll
address that further down-thread.

Hmm. If A and B are two separate files with identical contents, then FC
processes the file differently depending on their names. This indicates that
FC's output does not depend WHOLLY on the contents of the files examined.

What guarantee is there then, that there are no other circumstances when the
NAMES of the files will influence the outcome of FC's processing?

Could it be that FC simply fails to fail? (and from there, why?)

OK, let's put this into context: which software packages would you propose
as having absolutely no flaws?

Well - anything I've written, naturally :D

IEFBR14? Oh wait - that was a single-instruction program that turned out to
have a bug...

I believe that there are bugs in most programs. The difference is the
willingness of the constructor to correct them. This varies from case to
case. In my case, as an old-time small-h hacker (in the original sense of
the word) I spend a lot of time perfecting code - even when it's only a
theoretical possibility that the fault will occur. It's called pride in work
or professionalism.

On the other hand, when a well-known accounting-system's manufacturer
releases a "localised" software version that prints dates in dd/mm/yy format
EXCEPT for one particular report - and then point-blank REFUSED to issue a
corrective patch. What worries me here is the technical competence involved.
The very fact that this occurred indicates that their code does not use a
centralised convert-date-to-display-format routine, contributing to
winbloat. What other poor coding practices have they used in rushing to
market, and that they are prepared to cover up and deny?

Again, the fail-to-fail scenario. At the company I've referred to on may
occasions, their previous programmers had incorrectly implemented a
price-loading formula (costing the company money) and had also incorrectly
calculated tax payable - for a period of six years. Company management
insisted on the first one being corrected, but that the second be ignored. A
few months later, the tax department reacted to a customer's complaint and
insisted that the faulty tax calculation be fixed. No pride from management
in doing the right thing, leading to problems. Believe me, they couldn't
AFFORD to have a proper tax investigation of their affairs....
You have basically demonstrated that, like most of the world's known
software, FC is not completely flawless. But you err by anthropomorphizing
its motives ;-)

....But I'm intrigued by its apparent capacity to remember. And it seems that
the little silicon monsters have you convinced they're not out to take
over...:D

If the writers of FC could have anticipated our expectations, perhaps they
would have had its help text explain its limitations in terms of line
length, number of lines, size, or whatever other assumptions might have
been implicit at the time. I would also have suggested to them that if
either or both of the files turned out to exceed the specifications of its
definition of a "text file", that it should return an indicator that the
files are not identical text files.

You mean - meaningful documentation?

In that case, the files were strictly 7-bit ASCII, limited to 80-character
records and even in 8.3-named files, but the pathnames were LFN-style. The
only "hiccough" being that they were mapped to servers. Perhaps there was a
problem with the mapping mechanism, or perhaps it was an FC weirdness. I
could not properly investigate the matter because SB defined her system as
flawless (despite its periodic crashes) and any such problems were
"obviously" using obsolete software to do something that didn't need to be
done anyway ('coz she said it didn't need to be done - not that she actually
knew what was being done, it just didn't need to be done because it didn't.
She'd decided...)

But as for FC - I'd like it to have a /Q option - to suppress output ans
simply set ERRORLEVEL, as a simple same/different condition is often all
that is required. Now that brings up /W - should /W simply define any
sequence of 1 or more TABs is the same as 1 or more SPACES, or should the
TABS be expanded out as spaces to - well, columns of 8, 5, 4 or 2 (I've seen
all defined as "standard" in different environments) - or perhaps go back to
the old typewriter standard of "wherever you want them this time." What
about tabs/spaces appearing as trailing whitespace? In this case, one or
more space/tabs would match zero or more space/tabs before CRLF. Perhaps we
need a switch for this, too. Trailing whitespace can cause much grief on a
batch "SET var=" line...

Possibly even a /i swich for a case-insensitive version...
 
Richard,
this has *nothing* to do with filename length. FC is never
influenced by the filename - it does not compare filenames, period.

It is probably to do with as little a thing as a single extra
terminating CR+LF !!

What you need to do to check this out is simply compare both files in a
binary editor.

If you do not have one to hand, you can use EDIT.COM but use a
"line-wrap switch", such as - edit.com /78 *But note - the CR+LF
characters will show up as - ??

If you do have a binary editor, then "invisible characters" will show
themselves up as numerical values, where in a normal text editor, they
will both look the same they can be some differences when viewed in
binary.


Here is an example using a simple boot.ini file, viewed in a binary
editor.



Boot1.ini -
00000000 5B 62 6F 6F 74 20 6C 6F - 61 64 65 72 5D 0D 0A 74 [boot
loader]..t
00000010 69 6D 65 6F 75 74 3D 35 - 0D 0A 64 65 66 61 75 6C
imeout=5..defaul
00000020 74 3D 6D 75 6C 74 69 28 -30 29 64 69 73 6B 28 30
t=multi(0)disk(0
00000030 29 72 64 69 73 6B 28 30 - 29 70 61 72 74 69 74
)rdisk(0)partiti
00000040 6F 6E 28 31 29 5C 57 49 -4E 44 4F 57 53 0D 0A 5B
on(1)\WINDOWS..[
00000050 6F 70 65 72 61 74 69 6E - 67 20 73 79 73 74 65 6D
operating system
00000060 73 5D 0D 0A 6D 75 6C 74 - 69 28 30 29 64 69 73 6B
s]..multi(0)disk
00000070 28 30 29 72 64 69 73 6B - 28 30 29 70 61 72 74 69
(0)rdisk(0)parti
00000080 74 69 6F 6E 28 31 29 5C - 57 49 4E 44 4F 57 53 3D
tion(1)\WINDOWS=
00000090 22 4D 69 63 72 6F 73 6F - 66 74 20 57 69 6E 64 6F
"Microsoft Windo
000000A0 77 73 20 58 50 22 0D 0A -
ws XP"..



Boot2.ini -
00000000 5B 62 6F 6F 74 20 6C 6F - 61 64 65 72 5D 0D 0A 74 [boot
loader]..t
00000010 69 6D 65 6F 75 74 3D 35 - 0D 0A 64 65 66 61 75 6C
imeout=5..defaul
00000020 74 3D 6D 75 6C 74 69 28 - 30 29 64 69 73 6B 28 30
t=multi(0)disk(0
00000030 29 72 64 69 73 6B 28 30 - 29 70 61 72 74 69 74
)rdisk(0)partiti
00000040 6F 6E 28 31 29 5C 57 49 - 4E 44 4F 57 53 0D 0A 5B
on(1)\WINDOWS..[
00000050 6F 70 65 72 61 74 69 6E - 67 20 73 79 73 74 65 6D
operating system
00000060 73 5D 0D 0A 6D 75 6C 74 - 69 28 30 29 64 69 73 6B
s]..multi(0)disk
00000070 28 30 29 72 64 69 73 6B - 28 30 29 70 61 72 74 69
(0)rdisk(0)parti
00000080 74 69 6F 6E 28 31 29 5C - 57 49 4E 44 4F 57 53 3D
tion(1)\WINDOWS=
00000090 22 4D 69 63 72 6F 73 6F - 66 74 20 57 69 6E 64 6F
"Microsoft Windo
000000A0 77 73 20 58 50 22 0D 0A - 0D 0A
ws XP"....




.....please notice the extra CR+LF terminating the file. Run a
comparison of these two boot.ini files with FC.EXE and they will report
a difference, though they will look identical in notepad and there will
even be no visible differences in what FC reports as the differences!

If you look at your files in a binary editor - I think you will find the
differences there.

==

Cheers, Tim Meddick, Peckham, London. :-)
 
Tim Meddick said:
Rich Pasco said:
Certainly possible, but would two copies of the same file overflow in
different places? Apparently that's what happened.

Interestingly,
fc IMG_0001.dcm.raw file.bin
reported differences, but
fc file.bin IMG_0001.dcm.raw
reported match identically

- Rich
this has *nothing* to do with filename length. FC is never
influenced by the filename - it does not compare filenames, period.

[snip]
==

Cheers, Tim Meddick, Peckham, London. :-)

Tim, Rich's complaint is that

FC A B
and
FC B A

produced different results (where A is IMG_0001.dcm.raw and B is file.bin)

FC A B responded with differences
FC B A responded that the files were identical.

The problem seems to be that the files are binary not text (as may be
gathered from their names) hence the /b switch should have been used.

But the question then arises why the default-mode comparison yields
different results depending on the sequence of the arguments. Assuming that
the files have identical content (which has not been confirmed by OP, but is
implied) then what reason could there be that makes FC sensitive to the
sequence of arguments?

Or, put another way - by chance, "FC A B" was used, leading to the discovery
of the problem. Had the command "FC B A" been used in the batch, the error
(wrong compare mode) might never have been discovered. "FC B A" would have
failed to fail.

If "FC is never influenced by the filename" as you claim, why is it that "FC
A B" and "FC B A" yield different results?
 
I see, so I got that completely wrong then.

I should be reading more closely, in future...

==

Cheers, Tim Meddick, Peckham, London. :-)




billious said:
Tim Meddick said:
Rich Pasco said:
foxidrive wrote:

It is conceivable that a fairly large binary file may not contain a
CR/LF
pair and so the buffer overflows when comparing in text mode.

Certainly possible, but would two copies of the same file overflow
in
different places? Apparently that's what happened.

Interestingly,
fc IMG_0001.dcm.raw file.bin
reported differences, but
fc file.bin IMG_0001.dcm.raw
reported match identically

- Rich
this has *nothing* to do with filename length. FC is
never influenced by the filename - it does not compare filenames,
period.

[snip]
==

Cheers, Tim Meddick, Peckham, London. :-)

Tim, Rich's complaint is that

FC A B
and
FC B A

produced different results (where A is IMG_0001.dcm.raw and B is
file.bin)

FC A B responded with differences
FC B A responded that the files were identical.

The problem seems to be that the files are binary not text (as may be
gathered from their names) hence the /b switch should have been used.

But the question then arises why the default-mode comparison yields
different results depending on the sequence of the arguments. Assuming
that the files have identical content (which has not been confirmed by
OP, but is implied) then what reason could there be that makes FC
sensitive to the sequence of arguments?

Or, put another way - by chance, "FC A B" was used, leading to the
discovery of the problem. Had the command "FC B A" been used in the
batch, the error (wrong compare mode) might never have been
discovered. "FC B A" would have failed to fail.

If "FC is never influenced by the filename" as you claim, why is it
that "FC A B" and "FC B A" yield different results?
 
billious said:
Frankly, the distinction escapes me. /A may or may not be ASCII - I
believe that it's assumed to be ASCII on the grounds that /B is Binary.
Perhaps it is not so.

To reproduce the line-before and the line-after (which appears to be the
/A documented and the default behaviour) it would seem that FC interprets
the file - let's assume it's a text file for simplicity's sake - in a
line-oriented manner. So what is the distinction between /L and /A?

I suspect that /A may *imply* /L. But, according to FC ?, /A does not mean
to do an ASCII comparison.
Also, the documentation and behaviour appear difficult to reconcile. What
is really meant by "first and last lines for each set of differences?" For
instance, if we have the sequence SSDDSDDDS (Same/Different lines) then FC
appears to show SDDSDDDS, starting at the second "Same" line and ending
with the fourth. It could be argued that SDDS is one "set" and SDDDS is a
second "set" hence the third Same line should be reproduced both as the
last line of the first "set of differences" and also as the first of the
second "set of differences."

And if it is LINES that are being compared, what is the difference between
/A and /L mode? Both are line-oriented.
apparently
[behaviour with "non-ASCII" files]
That depends on whether or not that is what everybody wants it to do.
Since we, as a species, have been unable to provide a rock-solid
definition of what a text file is (see
http://www.google.ca/search?hl=en&q=define:+text+file&meta=&aq=f&oq=),
we can hardly complain when this lack of clarity results in anomalies...

Much heartache is caused by the assumption that what is "standard" in
Redmond is some variety of universal standard.

You've made this assumption? I haven't.

I'd suspect that the eariest FC implementations were assembler, oriented
toward 80-column data. It would seem unreasonable in that environment to
produce a report wider than 80 columns, given the peripherals commonly in
use at the time.Even had the output of FC been sent to a file and typed,
word-wrapping on an 80-column screen would have been tedious and difficult
to interpret.

Also, in those days 7-bit ASCII was de force. A few control characters
were used - CR,LF,FF,TAB - but the others had little relevance to the
printed document. Were the "high-ASCII" characters graphics or special
characters used in non-English alphabets? Unicode was way in the future...

As techniques have moved away from these earlier ideas, so the definitions
have become more fuzzy.

You are probably right.
Aye, that's the nub of the problem. I beliieve text files were originally
assumed to be 7-bit ASCII, organised as "lines" being terminated by a CRLF
sequence. "Lines" could be up to 80 characters long.

But each of these "requirements" is rubbery. "80" characters could be
132 - the common printer width for 15" printers. Or 164 or so, in
compressed-print, or more with proportional-print, or more if the "text"
was data not meant to be printed. 7-bit could be expanded to take care of
accented characters, etc.

In the end, it becomes a meaningless, yet surprisingly commonly-used term.
If the line length is not limited, and the character-set processed is not
limited, then what is the difference between "ASCII" and "Binary?" It
becomes simply a binary-compare in blocks delimited by the arbirary CRLF
sequence. What "authority" is going to impose a line-length or
character-set limit - and remember that there will always be the
dissenters who want "just a few more characters" or "oh - and this
character, too."

The lack of an absolute definition of what a text file is does not mean that
there is no benefit in making the distinction between ASCII or text and
binary.

Hmm. If A and B are two separate files with identical contents, then FC
processes the file differently depending on their names. This indicates
that FC's output does not depend WHOLLY on the contents of the files
examined.

What guarantee is there then, that there are no other circumstances when
the NAMES of the files will influence the outcome of FC's processing?

The names, or some other factor that we have not considered - now *that* is
the question, especially if we are talking about the outcome when FC
compares files that are well within the simpler definitions of text files.
Could it be that FC simply fails to fail? (and from there, why?)

Well, as I have tried to imply, I am not convinced that it is rationale to
suggest that a tool can be said to fail when it is used in a manner and for
a purpose which it was clearly not intended ;-)

Again, the fail-to-fail scenario. At the company I've referred to on may
occasions, their previous programmers had incorrectly implemented a
price-loading formula (costing the company money) and had also incorrectly
calculated tax payable - for a period of six years. Company management
insisted on the first one being corrected, but that the second be ignored.
A few months later, the tax department reacted to a customer's complaint
and insisted that the faulty tax calculation be fixed. No pride from
management in doing the right thing, leading to problems. Believe me, they
couldn't AFFORD to have a proper tax investigation of their affairs....

Interesting analogy. Perhaps MS will be forced to fix FC when it has been
demonstrated to be a serious problem, serious meaning something like
lawsuits ;-)

You mean - meaningful documentation?

Yes. But I'm not holding my breath on this or on any of the many other
things that could stand to be corrected. Like, for example, when you
shutdown a windows 98 system, it is left with a message on the screen to the
effect that: it is now safe to turn your computer off. When it was my
employer's computer that I was shutting down, I wondered how the o/s could
tell that it would actually be safe for me to power off my computer at home
at that particular moment.
But as for FC - I'd like it to have a /Q option - to suppress output ans
simply set ERRORLEVEL, as a simple same/different condition is often all
that is required. Now that brings up /W - should /W simply define any
sequence of 1 or more TABs is the same as 1 or more SPACES, or should the
TABS be expanded out as spaces to - well, columns of 8, 5, 4 or 2 (I've
seen all defined as "standard" in different environments) - or perhaps go
back to the old typewriter standard of "wherever you want them this time."
What about tabs/spaces appearing as trailing whitespace? In this case, one
or more space/tabs would match zero or more space/tabs before CRLF.
Perhaps we need a switch for this, too. Trailing whitespace can cause much
grief on a batch "SET var=" line...

Possibly even a /i swich for a case-insensitive version...

Perhaps there is yet time for you to develop the ultimate file comparison
program...

/Al
 
Al said:
billious said:
Right. On my system FC /? indicates that this means to display only
the first and last lines of each set of differences, whereas /L is
said to compare files as ascii text.

[FC /L /A]

And if it is LINES that are being compared, what is the difference
between /A and /L mode? Both are line-oriented.

apparently

Ah - be very, very literal.

The default mode is apparently /L

If /A is used, then the begin/end lines of the mismatch-block are shown.

For instance, if lines are APPENDED to a file, then FC and FC/L give
identical results, showing the last-line-matching and the new-lines-added.

FC/A shows

the last-line-matching
....
the last-line-added

(where ... is literally that, ...)

(I tried this by simply appending lines; I did not try inserting a block of
lines in the middle of a file)

So there IS a difference.
[behaviour with "non-ASCII" files]

Much heartache is caused by the assumption that what is "standard" in
Redmond is some variety of universal standard.

You've made this assumption? I haven't.

Far from it - in fact, I've done a great deal of work to REVERSE the
consequences of this assumption.

There is however, a crowd based in Redmond that has accumulated itself a
vast herd of zombie-acolytes who, should you protest their assumption of the
right to create and change standards at will and whim, will hiss with
religious fervour "reboot the unbeliever."

The lack of an absolute definition of what a text file is does not
mean that there is no benefit in making the distinction between ASCII
or text and binary.

The distinction is made by humans, for human purposes.
[what constitutes a failure?]
Well, as I have tried to imply, I am not convinced that it is
rationale to suggest that a tool can be said to fail when it is used
in a manner and for a purpose which it was clearly not intended ;-)

FC has no documented limit. It may be that this problem only occurs when the
"line length" exceeds a given size. "clearly not intended?" Er, there's
always going to be someone who wants one more character, as I said - either
on the length or in the character-set being procesed.

FC /B processes byte-by-byte
FC [/L] processes in records delimited by a terminal CRLF (or EOF)
FC /A modifies and restricts FC /L's report.

There's no documented limit of how long there records may be. There is an
application-specific human interpretation, but that depends on the human's
definition.


Interesting analogy. Perhaps MS will be forced to fix FC when it has
been demonstrated to be a serious problem, serious meaning something
like lawsuits ;-)

Oh - you mean like they fixed the parsing into FCB2 with Dos - er, 6?, 5? 4?
Yes. But I'm not holding my breath on this or on any of the many other
things that could stand to be corrected. Like, for example, when you
shutdown a windows 98 system, it is left with a message on the screen
to the effect that: it is now safe to turn your computer off. When it
was my employer's computer that I was shutting down, I wondered how
the o/s could tell that it would actually be safe for me to power off
my computer at home at that particular moment.

Much less which one or ones of "my" computers it was safe to shut down.

But that's stretching things a little - sort of how the lawyers carry on (at
an astronomical fee, of course)
Perhaps there is yet time for you to develop the ultimate file
comparison program...

....to have it classed as a virus, or as spam or as an unsupported
third-party utility (by the most knowledgeable of Office Managers.)
 
billious said:
Tim, Rich's complaint is that

FC A B
and
FC B A

produced different results (where A is IMG_0001.dcm.raw and B is file.bin)

FC A B responded with differences
FC B A responded that the files were identical.

The problem seems to be that the files are binary not text (as may be
gathered from their names) hence the /b switch should have been used.

But the question then arises why the default-mode comparison yields
different results depending on the sequence of the arguments. Assuming that
the files have identical content (which has not been confirmed by OP, but is
implied)

The files are identical, as confirmed by the "COMP" command (which
is *only* binary). In fact they came to exist by the COPY command:

copy IMG_0001.dcm.raw file.bin
then what reason could there be that makes FC sensitive to the
sequence of arguments?

Or, put another way - by chance, "FC A B" was used, leading to the discovery
of the problem. Had the command "FC B A" been used in the batch, the error
(wrong compare mode) might never have been discovered. "FC B A" would have
failed to fail.

If "FC is never influenced by the filename" as you claim, why is it that "FC
A B" and "FC B A" yield different results?

Good question.

- Rich
 
sw0rdfish said:
why not get diff for windows at http://gnuwin32.sourceforge.net/packages/diffutils.htm
and give it a try.

Thanks, but it is at least as complicated as the troublesome Windows
command fc, in that it includes the machinery to compare text files
line by line and put out a list of differences.

All I wanted was a quick check to determine whether two binary files
are or are not identical, and return the answer to a client program
(e.g. by an ERRORLEVEL).

- Rich
 
Alex K. Angelopoulos said:
By the way, although the article doesn't explicitly mention the problem
you've described, it does appear that there is a more recent version of fc
available for XP that fixes at least one bug.

http://support.microsoft.com/kb/953929

Good spotting, Alex!

But this raises even more questions:

If Ulib.dll is ONLY used by fc.exe, why - at 275,456 bytes - is it not part
of FC.EXE? What on earth could FC.EXE be doing that requires code of this
size for its entire suite of operations?

Surely there is a vanishingly small possibility that there are so many
FC.EXE instances in parallel that the time and effort taken in loading and
management of a DLL is worthwhile? Even so, wouldn't it be better to use the
standard paging scheme and let the underlying OS take care of the bloat?

If ulib.dll is NOT only used by fc.exe, what other utilities - even
commercially-available applications - will be affected by this fix - which
implies that there is an underlying BUG in the .dll ?

Yes, I know that there is a warning that this patch should ONLY be applied
where there has been a problem found with using FC.EXE; but I note that on
the download page, it appears to be scheduled for inclusion in SP4. This
indicates a bug - which will be manifested where?

Utterly unbelievable!
 
Back
Top