Marty said:
Our web app lets users upload tab-delimited files to the site, where
the content is parsed & loaded to a database. Aside from looking for
tab characters, control/linefeed characters, what would be a good way
to detect if the user uploaded a binary file?
There is no hard rule what a binary file is but the definition of your
file spec. However, there are some traditional guidelines:
If the file did not expect characters less then ascii code 32 and over
127, then its generally considered a text file. Otherwise its binary
or "garbage."
In other words, your field type defines what is valid. If you don't
expect something, then its garbage.
In binary file transfer protocols (i.e. zmodem), you generally escape
characters if its going to interfere with the protocol. In the ASCII
transfer protocol, you have to escape "Terminal" control characters,
especially FF, ESC, XON, XOFF. For example, uploading a text file via
the ASCII protocol with form feeds, a control character but still
considered a "text file."
But you are doing an WEB upload which is a binary file transfer and
has no escaping and you will be having a content-type header too if
set the encoding right on the form tag.
So the first thing is to make sure the content type is not one that is
considered binary, like an image, zip, exe, etc. That it says
text/plain perhaps.
After that, your only option really is to check for control characters
(ascii code less than 32) and possibly ascii codes > 127 as you parse
it, and I think you have to do this anyway because you should not
trust the content-type saying its text/plain. It is provided an image
type than you don't need to bother with the parser, just reject it.