Hi All,
I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).
The only safe way would be to scan the entire file and look for (and count)
any delimiters you want to support, split the lines and verify the actual
data. Then you may or may not need to rescan the file from the start if
your default assumption does not hold. This way, you could even support
various different text quoting methods.
Depending on the context, if I needed to make a really flexible import
function, I would probably try simultaneously parsing as much as possible
using different delimiters until all but one methods failed.
If memory isn't a big problem, just import into internal structures, one
for each delimiter. Once you are sure what kind of file it is, you can get
rid of all the in-memory data and dump the correct version to your database
or whatever you want to do with it.
If there's a memory constraint, you may need to parse the file in several
passes, one for each delimiter.
Oh, and remember that CSV files aren't always comma separated in Windows.
Depending on the system locale, commas may be used as decimal delimiters
in numbers, and CSV files will then be semicolon delimited.
Regards,
Hans