Why use other encoding then UTF-8 when this support almost every language

Mihai N. · Mar 27, 2010

Are you saying that all *A Win32 calls convert to Unicode?

If you run on NT/2000/XP/Vista/Win7, then yes.
All *A APIs convert to UTF-16, call the *W version of the API,
then convert the answer back to ANSI code page.
http://www.mihai-nita.net/article.php?artID=20050306b

MacOS X support Unicode.

Unicode is the native character set in Qt.

Sorry, this is my bad phrasing.
I meant to say exactly what you did, that the native API for Mac and Qt is
UTF-16. But the sentence before changed the meaning.

It also converts for UTF-16.

It may be faster than UTF-8, but I would not expect a big difference.

I am not sure what you mean by "It also converts for UTF-16"
..NET uses UTF-16 natively, so there is no conversion
(except to comunicate with some external component that does not
understand UTF-16)

Harlan Messinger · Mar 27, 2010

Mihai said:
1. Not all your data is text.
In fact, I bet very little of it is text.

Note that the question here was why anyone would EVER use a different
encoding. An answer expressed in terms of a bet concerning the needs of
MY data isn't pertinent. I don't suppose you consider it to be outside
the realm of possibility that there's such a thing as a database whose
purpose is to store billions of documents.

2. The rule of thumb recomandation is:
- legacy code pages to "talk" with ancient software
- utf-8 for storage/serialization/comunication
- utf-16 for processing
- convert at the edge
There are exceptions, nothing is carved in stone, but you should know
why you decide differently

Processing on a systemt that uis

Did I recomend utf-8? Read again.

You ought to read again, because I didn't say you recommended UTF-8. You
were comparing the use of other encodings with the use of UTF-8, and
stated your comparison on terms of time wasted converting when encoding
with Shift-JIS or ASCII, as though the fact that conversion is needed
and takes time distinguishes them from UTF-8.

And even if utf-8 makes sense sometimes, you convert at the edge.

I have no idea what that means.

You answer recomended Shift-JIS and Latin 1. Same conversion overhead,
with no benefit (international text support)

I didn't express any of the reasons I listed for using UTF-8 in terms of
conversion overhead, and the very first thing I noted about why UTF-8
isn't used universally was that the "supports all languages" criterion
is inapplicable to applications that don't need to support all
languages. I didn't *recommend* *anything*. I gave an example where,
*factually*, an encoding other than UTF-8 has the advantage of using up
*a lot less storage*.

Are you communicate with other applications thru keyboard messages?
Anyway, this is trying to twist my answer to mean
"always, absolutely always use utf-16, this is a religious tenet and
you have to obey it blindly without using your brain"
That's not the case.

I would really appreciate if you would stop reading what I've written
and then telling me that what I've written is something else.

Mihai N. · Mar 28, 2010

An answer expressed in terms of a bet concerning the needs of

MY data isn't pertinent.

A bet usually means "I am highly confident that if you take the bet,
I will take your money" and be statistically right.
It does not mean "I know for sure" :-)

I don't suppose you consider it to be outside
the realm of possibility that there's such a thing as a database whose
purpose is to store billions of documents.

There are very few black and white answers in programming.
So yes, there are valid situations where one would have to think and
make the best decision for the given case.

If you are looking for holes in any answers, then there are always
counter-examples.

Mihai N. · Mar 28, 2010

Because that paragraph revealed a lack of knowledge so

big that I considered it a sure waste of time to read any
further.

Actually, even if the article is full of inaccuracies and misunderstandings,
it has a big merit: it is entertaining enough to be popular. And the message
at the end of the story is clear: programmers today must know about Unicode
and use it, the time for legacy code pages is gone.

And that message is what most programmers (and not only) are left with after
reading it.

In the beginning I was also kind of angry about all that noise for an article
that is "wrong." But you know what? Until someone writes an article that is
as popular as that one, sending the right message, and without the mistakes,
Joel's article is all we got.

Arne VajhÃ¸j · Mar 29, 2010

If you run on NT/2000/XP/Vista/Win7, then yes.
All *A APIs convert to UTF-16, call the *W version of the API,
then convert the answer back to ANSI code page.
http://www.mihai-nita.net/article.php?artID=20050306b
Interesting.

Sorry, this is my bad phrasing.
I meant to say exactly what you did, that the native API for Mac and Qt is
UTF-16. But the sentence before changed the meaning.
Ah.

I am not sure what you mean by "It also converts for UTF-16"
.NET uses UTF-16 natively, so there is no conversion
(except to comunicate with some external component that does not
understand UTF-16)

If you write strings to a streamwriter, then it does a
conversion even if the streamwriter is setup to use UTF-16.

Arne

Arne VajhÃ¸j · Mar 29, 2010

Actually, even if the article is full of inaccuracies and misunderstandings,
it has a big merit: it is entertaining enough to be popular. And the message
at the end of the story is clear: programmers today must know about Unicode
and use it, the time for legacy code pages is gone.

And that message is what most programmers (and not only) are left with after
reading it.

In the beginning I was also kind of angry about all that noise for an article
that is "wrong." But you know what? Until someone writes an article that is
as popular as that one, sending the right message, and without the mistakes,
Joel's article is all we got.

If I want to see something that "everybody" is talking about then I
will turn on the TV and watch "American Idol".

If I want to read something about programming, then I find something
that is correct.

Arne

Mihai N. · Mar 29, 2010

If I want to read something about programming, then I find something
that is correct.

I think that if you can detect the mistakes in that article,
then the article is not for you :-)

But it is recomended often enough, for good or for worse.
So maybe the message is good for people who still think
that Shift-JIS and Latin 1 are good options.

Chris Dunaway · Mar 29, 2010

Actually, even if the article is full of inaccuracies and misunderstandings,
it has a big merit: it is entertaining enough to be popular. And the message
at the end of the story is clear: programmers today must know about Unicode
and use it, the time for legacy code pages is gone.

Can you please enumerate the mistakes in that article, or point to a
link that clears them up? He does have a disclaimer about the
elementary nature of the article:

"Before I get started, I should warn you that if you are one of those
rare people who knows about internationalization, you are going to
find my entire discussion a little bit oversimplified."

But I am interested in know what parts of it are mistaken/inaccurate.

Thanks,

Chris

Arne Vajhøj · Mar 30, 2010

Can you please enumerate the mistakes in that article, or point to a
link that clears them up? He does have a disclaimer about the
elementary nature of the article:

"Before I get started, I should warn you that if you are one of those
rare people who knows about internationalization, you are going to
find my entire discussion a little bit oversimplified."

But I am interested in know what parts of it are mistaken/inaccurate.

There were one example earlier in the thread.

Arne

Mihai N. · Mar 31, 2010

Can you please enumerate the mistakes in that article, or point to a
link that clears them up? He does have a disclaimer about the
elementary nature of the article: ....
But I am interested in know what parts of it are mistaken/inaccurate.

Here is a quick list.
I might spent some time to clean it up, add references, and make it into a
blog entry. But for most of them the Unicode standard (especially chapter 4)
is the best reference.

Also, I have only comment on incorrect statements, not on simplifications.
"A bit oversimplified" is ok, incorrect is not.

Some of the inaccuracies are benign (it does not matter for a programmer if
a code page is standard ANSI), but some are bad for developers (mixing UTF-16
and UCS-2, confusion about the Unicode ranges, what happens when you convert
from Unicode to a code page, or what kind of text wchar_t or UTF-7 can
represent).

================================

<joel>
Eventually this OEM free-for-all got codified in the ANSI standard.
In the ANSI standard, everybody agreed on what to do below 128,
which was pretty much the same as ASCII, but there were lots of
different ways to handle the characters from 128 and on up,
depending on where you lived.
These different systems were called code pages.
So for example in Israel DOS used a code page called 862, while Greek users
used 737.
</joel>

Not true.
The code pages mentioned are even today called "OEM code pages" they are not
(and never have been "ANSI")
The "ANSI code page" for Hebrew would be 1255, and 1253 for Greek.
And even the so called "ANSI code pages" used by Windows today are not part
of any ANSI standard.
The are modeled after iso-8859. Only one of them (iso-8895-1) was an ANSI
draft (never made it into
a final standard), and the ones used for other languages have never been
considered by ANSI.
Even replacing ANSI with ISO, the above section is still wrong (the OEM code
pages are not in ANSI, and not in ISO)

http://blogs.msdn.com/michkap/archive/2005/02/06/368081.aspx
http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx

================================

<joel>
Unicode was a brave effort to create a single character set that included
every
reasonable writing system on the planet and some make-believe ones like
Klingon, too.
</joel>

Even today (in 2010), 7 years after this article, Klingon is still not part
of Unicode.
Proposals to encode it have been rejected several times, and it is not on the
roadmap
for the forseable future (Unicode version 6.0).

================================

<joel>
There is no real limit on the number of letters that Unicode can define and
in fact they
have gone beyond 65,536 so not every unicode letter can really be squeezed
into two bytes,
but that was a myth anyway.
</joel>

There is a very clear limit. The Unicode code points are in the 0-10FFFF
range.
From that big range you can substract the surrogate areas, reserved
characters, noncharacter, etc.
So the final count is bit above 1.1 million. A lot, but "no real limit" is
incorrect.
http://www.unicode.org/faq/utf_bom.html#gen0

================================

<joel>
The traditional store-it-in-two-byte methods are called UCS-2 (because it has
two bytes) or UTF-16 (because it has 16 bits),
</joel>

Wrong! UTF-16 is surrogate aware and can represent the whole Unicode range,
from U+0000 to U+10FFFF.
UCS-2 is not surrogate aware and it is limited to the U+0000 - U+FFFF range
(or BMP, Basic Multilingual Plane)
Two completely different beasts!

================================

<joel>
Encodings
</joel>

This whole section merges into one two different concepts: "Encoding Forms"
and "Encoding Scheme"
The UTF-s are encoding forms , and only define ways of representing Unicode
code points "code units"
or certain sizes (1/2/4 bytes). They don't deal with representing bytes, so
there is nothing
to talk about little/bigger-endian. That is the realm of the "encoding
schemes"
Chapter 4 of the standard would be a good read.

================================

<joel>
you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C
U+006F) in ASCII,
or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several
hundred encodings
that have been invented so far, with one catch: some of the letters might not
show up!
</joel>

Not true. All characters in "Hello" are available in all OEM and ANSI (in
fact ISO) code pages.
In fact, few paragraphs above Joel himself says "Specifically, Hello, which
was U+0048 U+0065 U+006C U+006C U+006F,
will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored
in ASCII, and ANSI,
and every OEM character set on the planet."

================================

<joel>
If there's no equivalent for the Unicode code point you're trying to
represent in the encoding
you're trying to represent it in, you usually get a little question mark: ?
or, if you're really
good, a box.
</joel>

The result has nothing to do with how good you are, and it is not
standardized in any way.
In fact, what is "the right thing" depends a lot on what need to do.
It depends on how you call the code conversion APIs (most of them have flags
specifying what
to do if charactes can't be represented in the target code page).
You might get question marks, you might get best fit" characters, or a
developer-specified character.
Almost never "a box" (in fact, most code pages don't even have a box like
character)
If you see "a box", it is most likely that the string is still Unicode, not
damaged in any way,
but you are using the wrong font.

================================

<joel>
UTF 7, 8, 16, and 32 all have the nice property of being able to store any
code point correctly.
</joel>

Actually, UTF-7 can only represent characters in the U+0000 - U+FFFF range
(BMP).
Since Unicode is U+0000 - U+10FFFF, this is definitely not "any" code point.
Slightly related: UTF-7 (unlike UTF-8, 16, 32) is not part of the Unicode
standard, and it is not recommended.

<joel>
we decided to do everything internally in UCS-2 (two byte) Unicode, which is
what Visual Basic,
COM, and Windows NT/2000/XP use as their native string type.
</joel>

At least XP is surrogate-aware, so it uses UTF-16, not UCS-2 (and same does
COM on XP).

================================

<joel>
In C++ code we just declare strings as wchar_t ("wide char") instead of char
and use
the wcs functions instead of the str functions (for example wcscat and wcslen
instead
of strcat and strlen).
</joel>

This is Windows centric. On most other platforms wchar_t is 4 bytes, so UTF-
32.
And on few other platforms wchar_t is not even Unicode.

======================================================

encoding	13	Mar 18, 2010
about encoding UTF-8 and UTF-16	6	Mar 31, 2010
This spanish character string "ñ" cause something that I don't understand	7	Mar 31, 2010
C# and encodings	30	Feb 3, 2009
Unicode in .NET	8	Apr 30, 2010
Invalid character returned when reading UTF-8 XML	7	Jun 30, 2008
XmlDocument and utf-8	4	Jun 16, 2007
Unicode and utf 8 /utf 16	6	Jun 29, 2006

Why use other encoding then UTF-8 when this support almost every language

Mihai N.

Harlan Messinger

Mihai N.

Mihai N.

Arne VajhÃ¸j

Arne VajhÃ¸j

Mihai N.

Chris Dunaway

Arne Vajhøj

Mihai N.

Ask a Question

Similar Threads