Al Dunbar said:
While it seems reasonable to say that the result should be consistent
between runs, one could also say that, for consistency's sake, one should
only use tools in the manner and for the purpose they were designed and
intended.
Right. On my system FC /? indicates that this means to display only the
first and last lines of each set of differences, whereas /L is said to
compare files as ascii text.
Frankly, the distinction escapes me. /A may or may not be ASCII - I believe
that it's assumed to be ASCII on the grounds that /B is Binary. Perhaps it
is not so.
To reproduce the line-before and the line-after (which appears to be the /A
documented and the default behaviour) it would seem that FC interprets the
file - let's assume it's a text file for simplicity's sake - in a
line-oriented manner. So what is the distinction between /L and /A?
Also, the documentation and behaviour appear difficult to reconcile. What is
really meant by "first and last lines for each set of differences?" For
instance, if we have the sequence SSDDSDDDS (Same/Different lines) then FC
appears to show SDDSDDDS, starting at the second "Same" line and ending with
the fourth. It could be argued that SDDS is one "set" and SDDDS is a second
"set" hence the third Same line should be reproduced both as the last line
of the first "set of differences" and also as the first of the second "set
of differences."
And if it is LINES that are being compared, what is the difference between
/A and /L mode? Both are line-oriented.
[behaviour with "non-ASCII" files]
That depends on whether or not that is what everybody wants it to do.
Since we, as a species, have been unable to provide a rock-solid
definition of what a text file is (see
http://www.google.ca/search?hl=en&q=define:+text+file&meta=&aq=f&oq=),
we can hardly complain when this lack of clarity results in anomalies...
Much heartache is caused by the assumption that what is "standard" in
Redmond is some variety of universal standard. The screaming banshee to
which I've referred would claim that I was lying when the SAME file was sent
to two different printers and produced different results. The fact that the
printers were attached to different machines using different OSs and
different drivers, being in her claimed area of expertise, was beyond her
comprehension. The fact that one printer was set to a UK character set and
another to a US character set, so that H'23' was rendered either as the
pound-currency symbol or as octothorpe were of no consequence, the files had
to be different as the results as processed by her perfect self-correcting
creation were different.
I agree with you almost completely. But I come to the conclusion that, for
the most part, the "assumptions" are valid for most of the files that we
generally consider to be "text files". Show me a couple of "text files"
that fc/a does not compare properly, and I would argue that they are so
extreme in some way that I would not consider them "text files". One of
the definitions found by google is this: "A file that contains characters
organized into one or more lines. The lines must not contain null
characters and none can exceed the maximum line length allowed by the
implementation." Ah, the implementation. In this case FC would be the
implementation, would it not?
I'd suspect that the eariest FC implementations were assembler, oriented
toward 80-column data. It would seem unreasonable in that environment to
produce a report wider than 80 columns, given the peripherals commonly in
use at the time.Even had the output of FC been sent to a file and typed,
word-wrapping on an 80-column screen would have been tedious and difficult
to interpret.
Also, in those days 7-bit ASCII was de force. A few control characters were
used - CR,LF,FF,TAB - but the others had little relevance to the printed
document. Were the "high-ASCII" characters graphics or special characters
used in non-English alphabets? Unicode was way in the future...
As techniques have moved away from these earlier ideas, so the definitions
have become more fuzzy.
But, that said, what is your definition of a text file, and is that the
authoritative definition? I mean, if there is no general agreement on a
definition, then how can it be said that the assumptions made were
incorrect?
Aye, that's the nub of the problem. I beliieve text files were originally
assumed to be 7-bit ASCII, organised as "lines" being terminated by a CRLF
sequence. "Lines" could be up to 80 characters long.
But each of these "requirements" is rubbery. "80" characters could be 132 -
the common printer width for 15" printers. Or 164 or so, in
compressed-print, or more with proportional-print, or more if the "text" was
data not meant to be printed. 7-bit could be expanded to take care of
accented characters, etc.
In the end, it becomes a meaningless, yet surprisingly commonly-used term.
If the line length is not limited, and the character-set processed is not
limited, then what is the difference between "ASCII" and "Binary?" It
becomes simply a binary-compare in blocks delimited by the arbirary CRLF
sequence. What "authority" is going to impose a line-length or character-set
limit - and remember that there will always be the dissenters who want "just
a few more characters" or "oh - and this character, too."
I suppose it's one of those things that slowly drifts. What is a "database"
for instance? Is it that set of data that is controlled by an instance of a
DBMS (which is what I'd tend tio use?) Or is it the entirety of data owned,
as some would have it? Or perhaps it is some random subset of that data, as
others would claim? I was surprised by a prompt from one DBMS that asked
whether I wanted to "format the database" when in fact just one table in its
own individual file had to be altered. Formatting something over 200 tables
seemed to be a little overkill...
Or perhaps like the Screaming Banshee to which I have referred who insisted
(despite allegedly holding a degree and therefore presumably having been
taught at least some of the terminology) on calling a fixed-column format
file a "tab-deliminated file" (not "tab-delimited," I suspect specifically
to annoy me) despite the fact that it didn't actually contain tab
characters...
But I see no indication here that FC gives non-identical results on
identical input. FC A B may give different results from FC B A, however, I
would suggest that, by definition, and from the point of view of FC, the
input is therefore NOT identical. I would also suggest that FCA B appears
to ALWAYS give the same results as itself, and that the same goes for FC B
A. Reading ahead, you seem to suggest that this may not be true. I'll
address that further down-thread.
Hmm. If A and B are two separate files with identical contents, then FC
processes the file differently depending on their names. This indicates that
FC's output does not depend WHOLLY on the contents of the files examined.
What guarantee is there then, that there are no other circumstances when the
NAMES of the files will influence the outcome of FC's processing?
Could it be that FC simply fails to fail? (and from there, why?)
OK, let's put this into context: which software packages would you propose
as having absolutely no flaws?
Well - anything I've written, naturally
IEFBR14? Oh wait - that was a single-instruction program that turned out to
have a bug...
I believe that there are bugs in most programs. The difference is the
willingness of the constructor to correct them. This varies from case to
case. In my case, as an old-time small-h hacker (in the original sense of
the word) I spend a lot of time perfecting code - even when it's only a
theoretical possibility that the fault will occur. It's called pride in work
or professionalism.
On the other hand, when a well-known accounting-system's manufacturer
releases a "localised" software version that prints dates in dd/mm/yy format
EXCEPT for one particular report - and then point-blank REFUSED to issue a
corrective patch. What worries me here is the technical competence involved.
The very fact that this occurred indicates that their code does not use a
centralised convert-date-to-display-format routine, contributing to
winbloat. What other poor coding practices have they used in rushing to
market, and that they are prepared to cover up and deny?
Again, the fail-to-fail scenario. At the company I've referred to on may
occasions, their previous programmers had incorrectly implemented a
price-loading formula (costing the company money) and had also incorrectly
calculated tax payable - for a period of six years. Company management
insisted on the first one being corrected, but that the second be ignored. A
few months later, the tax department reacted to a customer's complaint and
insisted that the faulty tax calculation be fixed. No pride from management
in doing the right thing, leading to problems. Believe me, they couldn't
AFFORD to have a proper tax investigation of their affairs....
You have basically demonstrated that, like most of the world's known
software, FC is not completely flawless. But you err by anthropomorphizing
its motives ;-)
....But I'm intrigued by its apparent capacity to remember. And it seems that
the little silicon monsters have you convinced they're not out to take
over...
If the writers of FC could have anticipated our expectations, perhaps they
would have had its help text explain its limitations in terms of line
length, number of lines, size, or whatever other assumptions might have
been implicit at the time. I would also have suggested to them that if
either or both of the files turned out to exceed the specifications of its
definition of a "text file", that it should return an indicator that the
files are not identical text files.
You mean - meaningful documentation?
In that case, the files were strictly 7-bit ASCII, limited to 80-character
records and even in 8.3-named files, but the pathnames were LFN-style. The
only "hiccough" being that they were mapped to servers. Perhaps there was a
problem with the mapping mechanism, or perhaps it was an FC weirdness. I
could not properly investigate the matter because SB defined her system as
flawless (despite its periodic crashes) and any such problems were
"obviously" using obsolete software to do something that didn't need to be
done anyway ('coz she said it didn't need to be done - not that she actually
knew what was being done, it just didn't need to be done because it didn't.
She'd decided...)
But as for FC - I'd like it to have a /Q option - to suppress output ans
simply set ERRORLEVEL, as a simple same/different condition is often all
that is required. Now that brings up /W - should /W simply define any
sequence of 1 or more TABs is the same as 1 or more SPACES, or should the
TABS be expanded out as spaces to - well, columns of 8, 5, 4 or 2 (I've seen
all defined as "standard" in different environments) - or perhaps go back to
the old typewriter standard of "wherever you want them this time." What
about tabs/spaces appearing as trailing whitespace? In this case, one or
more space/tabs would match zero or more space/tabs before CRLF. Perhaps we
need a switch for this, too. Trailing whitespace can cause much grief on a
batch "SET var=" line...
Possibly even a /i swich for a case-insensitive version...