How to determine stream type?

  • Thread starter Thread starter Kaki
  • Start date Start date
K

Kaki

Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)

Thanks
 
Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)

There's no way of doing it, basically. A stream is just a sequence of
bytes, and it's perfectly possible to have a stream of bytes which is a
valid document when viewed from more than one perspective (e.g. a text
file in two different encodings).
 
Kaki said:
Given a file, how do I know if it's ascii or unicode or binary ?
And how do I know if it's rtf or html or etc? In other words,
how do I find the stream type or mime type?

(No, file extension cannot be the answer)

Only large-system operating systems such as VMS [DEC / Compaq] and MVS [IBM]
make any formal distinction between file types. In these systems there are
even physical differences between file types in so far as they are stored
differently, and are accessed with different code routines.

Under operating systems such as DOS / Windows-family, and *NIX / Linux, a
'file' is merely a named, persistent collection of bytes, and the only way
to tell whether a file contains data that is to be interpreted as text, or
as binary is by adherence to some conven'tion such as file extension usage
[e.g. '.txt' indicates a text file etc], and schemes such as searching
'magic numbers' [i.e. byte sequences known to uniquely identify file types]
in files, one heaviliy used in the *NIX / Linux world [the latter systems
also make distinctions between things like sockets, and devices at the
operating system level, but this hardly helps in identifying file types].

Thus, the answer is: there is no way of guaranteeing what a file's 'type'
actually is. All you can do is adhere to some convention, and hope that
everyone else follows suit. When attempting to access a particular file you
would check to ensure that the data read in conforms to the expected pattern
/ format for that file type.

For example, an HTML file could be expected to contain a <HTML> tag
somewhere near the start of the file, while many proprietary file formats
[e.g. MS Excel, Word etc] would sport a byte collection known as a 'header'
containing 'fields' with version information and the like. If, in reading
such files, the expected tags are found, or 'sensible' values for each
field are read in, then you can be reasonably sure [though not absolutuely
certain] that the 'correct' file type has been accessed.

Note that I made no mention of 'streams' which are nothing more than
program objects that are temporarily connected or linked to file(s) for
purposes of file data access / updating. Now, it might be possible for such
objects to report information about the file, or the current connection /
linkage status. However, when first creating establishing a link to a
specified file, such objects can merely make the checks mentioned earlier to
ascertain the 'correctness' of the file.

I'm not sure this is the type of response you were after, but the rather
general nature of your query seemed to warrant it. Additionally, it is the
type of issue that trancends any one programming language / environment.

I hope this helps.

Anthony Borla
 
Anthony Borla said:
Under operating systems such as DOS / Windows-family, and *NIX / Linux, a
'file' is merely a named, persistent collection of bytes

Actually that's not true - a file has other attributes under all of the
above. Under Windows a file may be read-only, or hidden, with various
security attributes. Under NT-based systems it may also have alternate
"streams" (not to be confused with the .NET concept of a stream) which
may give additional information. Some Linux file-systems have metadata
too.

A plain Stream in .NET terms, however, has none of this - that really
*is* just a sequence of bytes. Derived types may add more information,
as you've said.
 
Kaki said:
Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)

Thanks

Athough its not possible to be certain, enough tests should allow you to
figure out what it is(within a limited domain). There is a method[1] that
comes with Internet Explorer that can test for (according to the docs 26)
different types[2]. Its not perfect but the safest bet you have.
As for unicode\ascii differentation, unless you find byte order marks and
are reasonably sure its text, not binary, its not possible to say. Above
all, you should do your best to keep track of type upon loading, but these
should allow you to do some very basic checks.

1.
http://msdn.microsoft.com/library/d...iker/reference/functions/findmimefromdata.asp
2.
http://msdn.microsoft.com/library/d...op/networking/moniker/overview/appendix_a.asp
 
Hopefully we'll see this potentially nice feature in framework v1.2 and
beyond...

I hadnt really considered the issue but I do side with the original poster
in that there SHOULD be a common code base that can determine the type of
stream. And, since MIME is becoming a convienient standard then so be it.


--
Eric Newton
C#/ASP Application Developer
http://ensoft-software.com/
(e-mail address removed)-software.com [remove the first "CC."]

Daniel O'Connell said:
Kaki said:
Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)

Thanks

Athough its not possible to be certain, enough tests should allow you to
figure out what it is(within a limited domain). There is a method[1] that
comes with Internet Explorer that can test for (according to the docs 26)
different types[2]. Its not perfect but the safest bet you have.
As for unicode\ascii differentation, unless you find byte order marks and
are reasonably sure its text, not binary, its not possible to say. Above
all, you should do your best to keep track of type upon loading, but these
should allow you to do some very basic checks.

1.
http://msdn.microsoft.com/library/d...iker/reference/functions/findmimefromdata.asp
http://msdn.microsoft.com/library/d...op/networking/moniker/overview/appendix_a.asp
 
Eric Newton said:
Hopefully we'll see this potentially nice feature in framework v1.2 and
beyond...

I hadnt really considered the issue but I do side with the original poster
in that there SHOULD be a common code base that can determine the type of
stream. And, since MIME is becoming a convienient standard then so be it.
It has its ups, but it is still, unfortunatly, mostly a guess. Outside of
creating standard formats(for example, an xml document that had a <format>
tag), this will always be a guess, and bad luck could result in an incorrect
detection.
I suspect that it should be fairly trivial to get a good guess between image
formats, sgml derived, xml and other text formats, and perhaps other RIFF
type objects, but more complicated, propritary binary formats are probably
out of the question. Also text encoding is an issue because, with the
exception of some forms of unicode, there is no marker, only text data.

However, a managed implementation would be of value, especially if you could
plug in your own recognizers. Even if its not provided in the 1.2\2.0
framework, it is something an independent developer could write.
--
Eric Newton
C#/ASP Application Developer
http://ensoft-software.com/
(e-mail address removed)-software.com [remove the first "CC."]

Daniel O'Connell said:
Kaki said:
Given a file, how do I know if it's ascii or unicode or binary? And how
do I know if it's rtf or html or etc? In other words, how do I find the
stream type or mime type?
(No, file extension cannot be the answer)

Thanks

Athough its not possible to be certain, enough tests should allow you to
figure out what it is(within a limited domain). There is a method[1] that
comes with Internet Explorer that can test for (according to the docs 26)
different types[2]. Its not perfect but the safest bet you have.
As for unicode\ascii differentation, unless you find byte order marks and
are reasonably sure its text, not binary, its not possible to say. Above
all, you should do your best to keep track of type upon loading, but these
should allow you to do some very basic checks.

1.
http://msdn.microsoft.com/library/d...op/networking/moniker/overview/appendix_a.asp
 
Back
Top