Character Set Identification

  • Thread starter Thread starter David Elliott
  • Start date Start date
D

David Elliott

Given a file of unknown origin, I would like to determine the character set that the file is in.
Such formats would include: ASCII, CPxxxx, ISO-8859-xxxx, UTF-xxxx, etc.

I am NOT interested in file headers or file extensions, JUST the encoding type.

Is there a MS or 3rd Party library for detecting the encoding?

Thanks,
Dave
 
David Elliott said:
Given a file of unknown origin, I would like to determine the
character set that the file is in. Such formats would include: ASCII,
CPxxxx, ISO-8859-xxxx, UTF-xxxx, etc.

I am NOT interested in file headers or file extensions, JUST the encoding type.

Is there a MS or 3rd Party library for detecting the encoding?

There may be, but it would be heuristic at best. Many files will be
valid in many different encodings, for instance.
 
The best you will find is a program that tries to "guess" what the encoding
is.

With the ISO-8859 encodings, the bytes that you find when you read the file
won't help you much "as is", because there is nothing to tell you whether
byte 0xC0 is an accentuated char of Latin-1 or of Latin-2 or a Greek
character for example. So, you have to guess and investigate, for example,
the frequency of bytes in the 0x80-0xff range, or whether or not some
French/German/Greek/Russian/etc. words can be recognized.

You are in better shape with UTF8 and Unicode because there is a short
preamble (3 bytes in UTF8, 2 in Unicode) that acts as a signature. The
problem is that this preamble is optional. So, when you get it, there is a
very high probability that the file is UTF8/Unicode but when you don't get
it, you don't know (but Unicodes have lots of zeros in them if they contain
straight ASCII and chars above 0x80 never come alone (but by pairs or
triplets) in UTF8, so they are still a bit easier to recognize).

Bruno.
 
Back
Top