How to detect if a text file is ISO8859-1,ISO8859-15,UTF-8 or UniCode encoded

  • Thread starter Thread starter Karl Mondale
  • Start date Start date
K

Karl Mondale

Assume I have a text file. How can I detect if the text inside is encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl
 
Karl Mondale said:
Assume I have a text file. How can I detect if the text inside is encoded
in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

I would check Google or Wikipedia, e.g. here:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1. It explains the whole code in
detail. To find out programmatically you need to read the first few bytes.
The exact method depends on the tool you wish to use.
 
Assume I have a text file. How can I detect if the text inside is encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

You can't really, except by minutely examining the contents and seeing
whether there's something makes sense in one system but not another.
Even then you may not be sure what the creator intended, and it might
not matter anyway.

In UTF8, for example, characters in the 7-bit ASCII set are given in a
single byte. (The Unicode codes for those characters are the same as
the 7-bit ASCII encoding, just a bunch of zeroes in front). Other
Unicode characters are expressed in two or three bytes. So if the
entire file consists of 7-bit ASCII characters, the file will be
exactly the same whether UTF8 or ASCII was intended.
 
The short answer is that you can't alway determine the encoding from the
content of a file.

To see why, you can use Notepad to experiment with creating and saving text
as ANSI, Unicode, Unicode Big Endian, and UTF-8. Try pasting in some some
text from foreign web pages, as well as plain English text. Looking at the
files in a hex editor, like XVI32, you will see that for all but Ansi,
Notepad prepends a few bytes (called a Byte Order Mark) to indicate the type
of text file. For Unicode, it is the two byte sequence (hex) FFFE or FEFF,
to indicate either big endian or little endian unicode. Not all
applications prepend a BOM. Ansi and your two ISO encodings always use one
byte per character. Unicode always uses two bytes per character, except the
new Unicode-32 uses 4 bytes per character. UTF-8 uses a variable number of
bytes per character (one to five, I think), and can encode all two-byte
Unicode characters. For saving as Ansi, Notepad complains if all characters
can't be saved as one-byte characters.

-Paul Randall
 
(e-mail address removed) (Karl Mondale) wrote in
Assume I have a text file. How can I detect if the text inside is
encoded in

ISO8859-1
ISO8859-15
UTF-8
UniCode

Karl

That's rather difficult.

ISO8859-1 is almost identical to -15 where -15 replaces one encoding
with the Euro symbol and includes a few more french symbols. The only
way to tell them apart would be to look at the symbols in context.

UTF-8 is identical to ISO8859 for the first 128 ASCII characters which
include all the standard keyboard characters. After that, characters
are encoded as a multi-byte sequence.

Unicode is usually encoded in UTF-16. If you're lucky, there might be
a BOM (Byte Order Mark) of 0xFFFE or 0xFEFF as the first two characters
in the file. Otherwise, look for a 0x00 (Null character) as every
other character if the text file contains basic 7-bit ASCII characters.

HTH,
John
 
Back
Top