Can you please enumerate the mistakes in that article, or point to a
link that clears them up? He does have a disclaimer about the
elementary nature of the article: ....
But I am interested in know what parts of it are mistaken/inaccurate.
Here is a quick list.
I might spent some time to clean it up, add references, and make it into a
blog entry. But for most of them the Unicode standard (especially chapter 4)
is the best reference.
Also, I have only comment on incorrect statements, not on simplifications.
"A bit oversimplified" is ok, incorrect is not.
Some of the inaccuracies are benign (it does not matter for a programmer if
a code page is standard ANSI), but some are bad for developers (mixing UTF-16
and UCS-2, confusion about the Unicode ranges, what happens when you convert
from Unicode to a code page, or what kind of text wchar_t or UTF-7 can
represent).
================================
<joel>
Eventually this OEM free-for-all got codified in the ANSI standard.
In the ANSI standard, everybody agreed on what to do below 128,
which was pretty much the same as ASCII, but there were lots of
different ways to handle the characters from 128 and on up,
depending on where you lived.
These different systems were called code pages.
So for example in Israel DOS used a code page called 862, while Greek users
used 737.
</joel>
Not true.
The code pages mentioned are even today called "OEM code pages" they are not
(and never have been "ANSI")
The "ANSI code page" for Hebrew would be 1255, and 1253 for Greek.
And even the so called "ANSI code pages" used by Windows today are not part
of any ANSI standard.
The are modeled after iso-8859. Only one of them (iso-8895-1) was an ANSI
draft (never made it into
a final standard), and the ones used for other languages have never been
considered by ANSI.
Even replacing ANSI with ISO, the above section is still wrong (the OEM code
pages are not in ANSI, and not in ISO)
http://blogs.msdn.com/michkap/archive/2005/02/06/368081.aspx
http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx
================================
<joel>
Unicode was a brave effort to create a single character set that included
every
reasonable writing system on the planet and some make-believe ones like
Klingon, too.
</joel>
Even today (in 2010), 7 years after this article, Klingon is still not part
of Unicode.
Proposals to encode it have been rejected several times, and it is not on the
roadmap
for the forseable future (Unicode version 6.0).
================================
<joel>
There is no real limit on the number of letters that Unicode can define and
in fact they
have gone beyond 65,536 so not every unicode letter can really be squeezed
into two bytes,
but that was a myth anyway.
</joel>
There is a very clear limit. The Unicode code points are in the 0-10FFFF
range.
From that big range you can substract the surrogate areas, reserved
characters, noncharacter, etc.
So the final count is bit above 1.1 million. A lot, but "no real limit" is
incorrect.
http://www.unicode.org/faq/utf_bom.html#gen0
================================
<joel>
The traditional store-it-in-two-byte methods are called UCS-2 (because it has
two bytes) or UTF-16 (because it has 16 bits),
</joel>
Wrong! UTF-16 is surrogate aware and can represent the whole Unicode range,
from U+0000 to U+10FFFF.
UCS-2 is not surrogate aware and it is limited to the U+0000 - U+FFFF range
(or BMP, Basic Multilingual Plane)
Two completely different beasts!
================================
<joel>
Encodings
</joel>
This whole section merges into one two different concepts: "Encoding Forms"
and "Encoding Scheme"
The UTF-s are encoding forms , and only define ways of representing Unicode
code points "code units"
or certain sizes (1/2/4 bytes). They don't deal with representing bytes, so
there is nothing
to talk about little/bigger-endian. That is the realm of the "encoding
schemes"
Chapter 4 of the standard would be a good read.
================================
<joel>
you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C
U+006F) in ASCII,
or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several
hundred encodings
that have been invented so far, with one catch: some of the letters might not
show up!
</joel>
Not true. All characters in "Hello" are available in all OEM and ANSI (in
fact ISO) code pages.
In fact, few paragraphs above Joel himself says "Specifically, Hello, which
was U+0048 U+0065 U+006C U+006C U+006F,
will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored
in ASCII, and ANSI,
and every OEM character set on the planet."
================================
<joel>
If there's no equivalent for the Unicode code point you're trying to
represent in the encoding
you're trying to represent it in, you usually get a little question mark: ?
or, if you're really
good, a box.
</joel>
The result has nothing to do with how good you are, and it is not
standardized in any way.
In fact, what is "the right thing" depends a lot on what need to do.
It depends on how you call the code conversion APIs (most of them have flags
specifying what
to do if charactes can't be represented in the target code page).
You might get question marks, you might get best fit" characters, or a
developer-specified character.
Almost never "a box" (in fact, most code pages don't even have a box like
character)
If you see "a box", it is most likely that the string is still Unicode, not
damaged in any way,
but you are using the wrong font.
================================
<joel>
UTF 7, 8, 16, and 32 all have the nice property of being able to store any
code point correctly.
</joel>
Actually, UTF-7 can only represent characters in the U+0000 - U+FFFF range
(BMP).
Since Unicode is U+0000 - U+10FFFF, this is definitely not "any" code point.
Slightly related: UTF-7 (unlike UTF-8, 16, 32) is not part of the Unicode
standard, and it is not recommended.
<joel>
we decided to do everything internally in UCS-2 (two byte) Unicode, which is
what Visual Basic,
COM, and Windows NT/2000/XP use as their native string type.
</joel>
At least XP is surrogate-aware, so it uses UTF-16, not UCS-2 (and same does
COM on XP).
================================
<joel>
In C++ code we just declare strings as wchar_t ("wide char") instead of char
and use
the wcs functions instead of the str functions (for example wcscat and wcslen
instead
of strcat and strlen).
</joel>
This is Windows centric. On most other platforms wchar_t is 4 bytes, so UTF-
32.
And on few other platforms wchar_t is not even Unicode.
======================================================