Unicode, encodings, and asian languages: need some help.

  • Thread starter Thread starter apprentice
  • Start date Start date
A

apprentice

Hello,

I'm writing an class library that I imagine people from different countries
might be interested in using, so I'm considering what needs to be provided
to support foreign languages, including asian languages (chinese, japanese,
korean, etc).

First of all, strings will be passed to my class methods, some of which
based on the language (and on the encoding) might contain characters that
require more that a single byte.
Having to cycle through each byte composing each char of an input string,
how does .NET guarantee that the string is broken up correctly in its
composing chars based on the string's language??? In other words, how does
..NET identify the correct "boundary" for each char (what bytes are part of
each char) based on the string's language??? Also, what is the encoding with
which are strings initially taken into memory??? Does this encoding depend
from the culture set for the current thread or does it maybe depend from the
encoding for the system's current ANSI code page??? Is there a way to set
the encoding that .NET should be using for strings so that when cycling
through the characters in the string, bytes are correctly assigned to each
char based on the string's language???


Regards,
Bob Rock
 
All chars and strings are Unicode 16 (or UTF-16): each char require two
bytes. More specifically, this is not UNICODE 16 but a subset called UTC16
because it excludes extended characters (those who requires three and four
bytes). Of course, you can store and display extended characters into a UTC
16 string but comparaisons (lexically lesser or greater) won't work
correctly (ie, these extended characters won't be taken into account
correctly).

The notion of Culture Set, Localization and Code Page are mainly taken into
account when .NET must converse (reading or writing something) with the
external world.
 
All chars and strings are Unicode 16 (or UTF-16): each char require two
bytes. More specifically, this is not UNICODE 16 but a subset called UTC16
because it excludes extended characters (those who requires three and four
bytes). Of course, you can store and display extended characters into a
UTC
16 string but comparaisons (lexically lesser or greater) won't work
correctly (ie, these extended characters won't be taken into account
correctly).
Sorry to contradict, but it really is UTF-16.
I think the confusion comes from the old NT, which was indeed UCS2.
And in fact all the Win 2000 was UTF-16.
As with most things, Windows improved. NT had no clue of UTF-16
(normal, since which UTF did not exist at the time :-).
W2K was better, WXP is even better, but still not perfect.

So, .NET is UTF-16. Maybe not perfect, but those are bugs :-)
 
Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8
(not sure in .NET v2.0).

See the "Unicode" page from Wikipedia to get clear idea about consequence
involved, but the following is a quote to give you basic difference:
Unicode defines two mapping methods:

a.. the UTF (Unicode Transformation Format) encodings
b.. the UCS (Universal Character Set) encodings
The encodings include:

a.. UTF-7 ¡X a relatively unpopular 7-bit encoding, often considered
obsolete
b.. UTF-8 ¡X an 8-bit, variable-width encoding
c.. UCS-2 ¡X a 16-bit, fixed-width encoding that only supports the BMP
d.. UTF-16 ¡X a 16-bit, variable-width encoding
e.. UCS-4 and UTF-32 ¡X functionally identical 32-bit fixed-width
encodings
f.. UTF-EBCDIC ¡X an unpopular encoding intended for EBCDIC based
mainframe systems
 
I like it when people use posts to provide any answer and carefully avoid
considering the questions made ... and often even end up diverting the
entire thread. I wonder, is it because to become or remain a MCP you must
guarantee a certain number of posts per month .... and anything post will
contribute to that number???

Anyway, here I go again.
I'll have strings in various languages (including east asian languages)
passes to my class methods and I need to run an algorithm on the bytes that
compose each char in the strings (char based on the string's language).

Here are my questions:

1) How does the .NET Framework know how to appropriately assign bytes to
chars??? How does the Framework identify the correct "boundary" for each
char (what bytes are part of each char) based on the string's language???
2) Is there a way to set the encoding that .NET should be using for strings
so that when cycling through the characters in the string (look at the code
below), bytes are correctly assigned to each char based on the string's
language???
3) In what encoding are strings kept in memory??? Does this encoding depend
from the culture set for the current thread or does it maybe depend from the
encoding for the system's current ANSI code page???
4) Having a string in chinese (simplified or traditional), in japanese or in
korean passed into my methods, would the following code be enough to
guarantee that ch always corresponds to a *full* char in the specified
language:

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// execute algorithm on byte b
}
}



Bob Rock





Lau Lei Cheong said:
Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8
(not sure in .NET v2.0).

See the "Unicode" page from Wikipedia to get clear idea about consequence
involved, but the following is a quote to give you basic difference:
Unicode defines two mapping methods:

a.. the UTF (Unicode Transformation Format) encodings
b.. the UCS (Universal Character Set) encodings
The encodings include:

a.. UTF-7 ¡X a relatively unpopular 7-bit encoding, often considered
obsolete
b.. UTF-8 ¡X an 8-bit, variable-width encoding
c.. UCS-2 ¡X a 16-bit, fixed-width encoding that only supports the BMP
d.. UTF-16 ¡X a 16-bit, variable-width encoding
e.. UCS-4 and UTF-32 ¡X functionally identical 32-bit fixed-width
encodings
f.. UTF-EBCDIC ¡X an unpopular encoding intended for EBCDIC based
mainframe systems
 
Lau Lei Cheong said:
Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8
(not sure in .NET v2.0).

No, that's just not true - and nothing that you posted gave any
evidence for it.

From the docs (admittedly for 2.0, but this hasn't changed) for String:

<quote>
Each Unicode character in a string is defined by a Unicode scalar
value, also called a Unicode code point or the ordinal (numeric) value
of the Unicode character. Each code point is encoded using UTF-16
encoding, and the numeric value of each element of the encoding is
represented by a Char object.
</quote>

Similarly from the docs for System.Char:

<quote>
The .NET Framework uses the Char structure to represent Unicode
characters. The Unicode Standard identifies each Unicode character with
a unique 21-bit scalar number called a code point, and defines the UTF-
16 encoding form that specifies how a code point is encoded into a
sequence of one or more 16-bit values. Each 16-bit value ranges from
hexadecimal 0x0000 through 0xFFFF and is stored in a Char structure.
The value of a Char object is its 16-bit numeric (ordinal) value.
</quote>
 
Anyway, here I go again.
I'll have strings in various languages (including east asian languages)
passes to my class methods and I need to run an algorithm on the bytes that
compose each char in the strings (char based on the string's language).

Here are my questions:

1) How does the .NET Framework know how to appropriately assign bytes to
chars??? How does the Framework identify the correct "boundary" for each
char (what bytes are part of each char) based on the string's language???

Strings don't have languages. All strings are stored in UTF-16.
2) Is there a way to set the encoding that .NET should be using for strings
so that when cycling through the characters in the string (look at the code
below), bytes are correctly assigned to each char based on the string's
language???

The conversion between bytes and strings is performed by the Encoding
classes.
3) In what encoding are strings kept in memory??? Does this encoding depend
from the culture set for the current thread or does it maybe depend from the
encoding for the system's current ANSI code page???

As has been specified, UTF-16.
4) Having a string in chinese (simplified or traditional), in japanese or in
korean passed into my methods, would the following code be enough to
guarantee that ch always corresponds to a *full* char in the specified
language:

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// execute algorithm on byte b
}
}

That wouldn't do anything about surrogate characters. If you really
care about those (and I didn't *think* that any natural language
characters were in the surrogate range, although I could be wrong) you
might be interested in my Utf32String class:

http://www.pobox.com/~skeet/csharp/miscutil
 
Hello Jon,

let my try to clarify some of my statements.

Jon Skeet said:
Strings don't have languages. All strings are stored in UTF-16.

You are right. What I was trying to say is if the .NET Framework is able to
somehow guess (e.g. through a statistical analysis) the natural language of
a string thus getting an appropriate Encoding or if more simply it gets the
appropriate Encoding instance by doing something like
Encoding.GetEncoding(system_ansi_code_page)??? In fact, if you take a look
at the Encoding class the Default property does exactly that. I guess that
strings are encoded using that default Encoding instance.
The conversion between bytes and strings is performed by the Encoding
classes.

Yes, that I already new. However, in a cycle such as the following, am I
guaranteed that each char handed to me is exactly a char in the string's
natural language??? I wonder, how can .NET break up correctly the string in
its natural language chars???

foreach(char ch in text.ToCharArray())
{
// break up ch in bytes
}
As has been specified, UTF-16.

I was hoping that the Framework would allow somehow to specify the code page
to use to get the correct Encoding instance. That would probably guarantee
the cycle above to behave as I need (correctly break up the string in its
natural language chars).
4) Having a string in chinese (simplified or traditional), in japanese or
in
korean passed into my methods, would the following code be enough to
guarantee that ch always corresponds to a *full* char in the specified
language:

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// execute algorithm on byte b
}
}

That wouldn't do anything about surrogate characters. If you really
care about those (and I didn't *think* that any natural language
characters were in the surrogate range, although I could be wrong) you
might be interested in my Utf32String class:

That is exactly the knowledge I'm after. Does any natural language
(expecially asian languages such as chinese, japanese, korean or vietnamese)
require more than the 2 bytes provided by .NET???

Thanks. I'll take a look at it.
 
apprentice said:
You are right. What I was trying to say is if the .NET Framework is able to
somehow guess (e.g. through a statistical analysis) the natural language of
a string thus getting an appropriate Encoding or if more simply it gets the
appropriate Encoding instance by doing something like
Encoding.GetEncoding(system_ansi_code_page)??? In fact, if you take a look
at the Encoding class the Default property does exactly that. I guess that
strings are encoded using that default Encoding instance.

No, strings are always stored internally as UTF-16. All Unicode
characters can be represented in UTF-16, using surrogate pairs for
Unicode characters above U+FFFF. There's no "natural language" of a
string - it's always stored in UTF-16.

Now, if you're talking about converting to and from bytes when (say)
reading from a file, that's a different matter - and it depends on what
API you're using. Most default to a UTF-8 encoding (making
Encoding.Default a really bad name) but allow you to specify an
encoding.

Once the string has been read in, however, there is no trace of which
encoding was used to convert the bytes to chars.
Yes, that I already new. However, in a cycle such as the following, am I
guaranteed that each char handed to me is exactly a char in the string's
natural language??? I wonder, how can .NET break up correctly the string in
its natural language chars???

foreach(char ch in text.ToCharArray())
{
// break up ch in bytes
}

Again, there is no concept of "natural language char". In your code
snippet (which creates a char array unnecessarily, btw - you can just
use foreach (char ch in text)) each char is a UTF-16 code point. If you
want to convert that text data into bytes, you need to explicitly use
an encoding.
I was hoping that the Framework would allow somehow to specify the code page
to use to get the correct Encoding instance. That would probably guarantee
the cycle above to behave as I need (correctly break up the string in its
natural language chars).

It's not at all clear what the ultimate goal is. What is the larger
picture here?
That is exactly the knowledge I'm after. Does any natural language
(expecially asian languages such as chinese, japanese, korean or vietnamese)
require more than the 2 bytes provided by .NET???

http://www.jbrowse.com/text/ suggests that it should be okay:

<quote>
There are enough code points (without using surrogates, see below) to
represent all the characters commonly in use in Japan, China and Korea
</quote>

http://www.unicode.org/roadmaps/index.html gives a pretty good
indication of what's likely to be in each of the "planes" (BMP, or
plane 0, is what can be handled without surrogates).

In general, http://www.unicode.org is the authority on all these
matters - if you want to know whether a given character is covered,
look there.
 
BTW, the algo I posted was missing the line of code where I get the correct
encoding instance based on the natural language.
Here is the full snippet:


this._encoding = Encoding.GetEncoding(naturalLanguageCodePage);

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// algorithm that words on byte b
}
}



Bob Rock
 
apprentice said:
BTW, the algo I posted was missing the line of code where I get the correct
encoding instance based on the natural language.
Here is the full snippet:

this._encoding = Encoding.GetEncoding(naturalLanguageCodePage);

foreach(char ch in text.ToCharArray())
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// algorithm that words on byte b
}
}

That's unlikely to be useful - dealing with individual bytes doesn't
make nearly as much sense as dealing with a character at a time. What
is your algorithm meant to do?
 
Ok, lets start over. I'll try my best to be clear in my intent.

1) I will have strings in different languages passed to my classes (examples
of such language might be chinese, japanese, korean or vietnamese)

2) I need to operate on the bytes that compose each char. For my algo to
work correctly the string must be broken up correctly in chars ... and I
mean chars as they would be understood in the string's language (e.g.
chinese, japanese, korean or vietnamese).

3) I imagine that when a string (let's suppose it is in korean) is passed in
as a parameter to one of my methods, it will be taken into memory encoded in
an encoding based on the system's current ANSI code page (that which is
returned by the Encoding.Default property). I also imagine that when I run a
piece of code such as the following, the string will be broken into chars
based on the encoding ... which might not be the correct one for the
string's language (korean).

foreach(char ch in myString)
{
// code that operates on ch
}

4) I thought that if I could specify the correct encoding (as I do below)
the cycle would however work correctly:

Encoding koreanEncoding = Encoding.GetEncoding(codePageForKorean);

5) Not being able to do so, what do I have to expect from the following
code? Will it correctly break up the string in chars whatever language it
might be in (again chinese, japanese, korean or vietnamese)?

this._encoding = Encoding.GetEncoding(languageCodePagge);
foreach(char ch in myString)
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// algorithm that works on byte b
}
}


All of the above is really to answer a single question: how many bytes do I
have to expect to be used by .NET Framework strings to represent a single
character of an asian language such as chinese, japanese or korean? I though
that 2 bytes would do, but then how could more languages be encoded in the
same stream????

I read the section "Overview and Description" of this article
https://www.microsoft.com/globaldev/getWR/steps/wrg_codepage.mspx and I'm
still confused. If 2 bytes are not enough to encode more asian language
characters in the same stream, how many bytes are used to represent a single
character in a .NET string???


Bob Rock
 
apprentice said:
Ok, lets start over. I'll try my best to be clear in my intent.

1) I will have strings in different languages passed to my classes (examples
of such language might be chinese, japanese, korean or vietnamese)

Right. All of these will be UTF-16 encoded, as that's what .NET uses
for strings.
2) I need to operate on the bytes that compose each char. For my algo to
work correctly the string must be broken up correctly in chars ... and I
mean chars as they would be understood in the string's language (e.g.
chinese, japanese, korean or vietnamese).

There are any number of encodings which can *can* encode the string,
but there's no such thing as "the string's language". There's no such
thing as "Japanese encoding" or "Korean encoding".

What is this algorithm meant to do, anyway?
3) I imagine that when a string (let's suppose it is in korean) is passed in
as a parameter to one of my methods, it will be taken into memory encoded in
an encoding based on the system's current ANSI code page (that which is
returned by the Encoding.Default property).

No, that's not true. It will be encoded in UTF-16, but that's mostly
transparent.
I also imagine that when I run a
piece of code such as the following, the string will be broken into chars
based on the encoding ... which might not be the correct one for the
string's language (korean).

Again, that's not true. You will be given the sequence of UTF-16 code
points which make up the string.
4) I thought that if I could specify the correct encoding (as I do below)
the cycle would however work correctly:

Encoding koreanEncoding = Encoding.GetEncoding(codePageForKorean);

Well, that will allow you to get the string encoded in that particular
encoding, but whether or not that's what you really need, I don't know
- I'd have to know more about what your algorithm is really meant to
do.
5) Not being able to do so, what do I have to expect from the following
code? Will it correctly break up the string in chars whatever language it
might be in (again chinese, japanese, korean or vietnamese)?

this._encoding = Encoding.GetEncoding(languageCodePagge);
foreach(char ch in myString)
{
byte[] bytes = this._encoding.GetBytes(new char[]{ch});
foreach(byte b in bytes)
{
// algorithm that works on byte b
}
}

All of the above is really to answer a single question: how many bytes do I
have to expect to be used by .NET Framework strings to represent a single
character of an asian language such as chinese, japanese or korean? I though
that 2 bytes would do, but then how could more languages be encoded in the
same stream????

It depends on what encoding you use. A single .NET framework char can
always be represented in 2 bytes, but some Unicode characters are
composed of a surrogate pair - two characters together. Note that with
your code above, you'd get each half of the surrogate pair separately.

However, with UTF-8 not all .NET chars are represented in 2 bytes -
anything over U+0799 is represented as 3 bytes.
I read the section "Overview and Description" of this article
https://www.microsoft.com/globaldev/getWR/steps/wrg_codepage.mspx and I'm
still confused. If 2 bytes are not enough to encode more asian language
characters in the same stream, how many bytes are used to represent a single
character in a .NET string???

I couldn't see anything there saying that 2 bytes aren't enough to
encode Asian language characters. Could you quote the section that
worries you? I suspect you'll find all the natural language characters
are encoded in the BMP so you don't need to worry about surrogates -
but I wouldn't like to swear to it.
 
What is this algorithm meant to do, anyway?

Well, it is a simple RTF library. For asian languages the RTF specification
seems to expect 2 bytes encoded characters and requires each byte to be
escaped depending on the fact of being below character code 0x20 and above
0x80. That is why I initially thought of breaking up any asian language
string into its composing characters, get the bytes and do the escaping if
required. But this is really not necessary. Having received the string, I
will get its bytes (based on the correct encoding) and will escape them as
required. In fact I can probably handle these asian characters using the \u
control word without even having to get to the character bytes.
I couldn't see anything there saying that 2 bytes aren't enough to
encode Asian language characters. Could you quote the section that
worries you? I suspect you'll find all the natural language characters
are encoded in the BMP so you don't need to worry about surrogates -
but I wouldn't like to swear to it.

This is it:

......
Each Asian character is represented by a pair of code points (thus
double-byte). For programming awareness, a set of points are set aside to
represent the first byte of the set and are not valued unless they are
immediately followed by a defined second byte. DBCS meant that you had to
write code that would treat these pair of code points as one,and this still
disallowed the combining of say Japanese and Chinese in the same data
stream, because depending on the codepage the same double-byte code points
represent different characters for the different languages.

In order to allow for the storage of different languages in the same data
stream, Unicode was created. This one "codepage" can represent 64000+
characters and now with the introduction of surrogates it can represent
1,000,000,000+ characters. The use of Unicode in Windows 2000 allows for
easier creation of World-Ready code, because you no longer have to worry
about which codepage you are addressing, nor whether you had to group
character points to represent one character.
......

It looks as if to handle for example characters coming from different
(asian) languages, 2 bytes are not enough. So, I imagine that there might be
situations when surrogate pairs are indeed necessary.


Bob Rock
 
2) I need to operate on the bytes that compose each char. For my algo to
There are any number of encodings which can *can* encode the string,
but there's no such thing as "the string's language". There's no such
thing as "Japanese encoding" or "Korean encoding".

I never meant to say that there are japanese or korean encodings. But still,
the Encoding instance you get from the following 2 statements is not the
same one so there are indeed *japanese* and *korean* specific encodings:

Encoding jEnc = Encoding.GetEncoding(932); // 932 = japanese code page
Encoding kEnc = Encoding.GetEncoding(949); // 949 = korean code page


Bob Rock
 
apprentice said:
Well, it is a simple RTF library. For asian languages the RTF specification
seems to expect 2 bytes encoded characters and requires each byte to be
escaped depending on the fact of being below character code 0x20 and above
0x80. That is why I initially thought of breaking up any asian language
string into its composing characters, get the bytes and do the escaping if
required. But this is really not necessary. Having received the string, I
will get its bytes (based on the correct encoding) and will escape them as
required. In fact I can probably handle these asian characters using the \u
control word without even having to get to the character bytes.

I'm not entirely sure what you mean by "character bytes" but that
sounds broadly correct - but you need to make sure that whatever
encoding you use is the one the RTF reader is going to use too. That's
the crucial bit of information.
This is it:

.....
Each Asian character is represented by a pair of code points (thus
double-byte). For programming awareness, a set of points are set aside to
represent the first byte of the set and are not valued unless they are
immediately followed by a defined second byte. DBCS meant that you had to
write code that would treat these pair of code points as one,and this still
disallowed the combining of say Japanese and Chinese in the same data
stream, because depending on the codepage the same double-byte code points
represent different characters for the different languages.

In order to allow for the storage of different languages in the same data
stream, Unicode was created. This one "codepage" can represent 64000+
characters and now with the introduction of surrogates it can represent
1,000,000,000+ characters. The use of Unicode in Windows 2000 allows for
easier creation of World-Ready code, because you no longer have to worry
about which codepage you are addressing, nor whether you had to group
character points to represent one character.
.....

It looks as if to handle for example characters coming from different
(asian) languages, 2 bytes are not enough.

I don't see how you infer that from the above. There are certainly
characters for which 2 bytes aren't enough, but I don't see any
indication above that Asian languages fall into that category.
So, I imagine that there might be
situations when surrogate pairs are indeed necessary.

Don't imagine - look at the code charts on http://www.unicode.org
 
Jon Skeet said:
I'm not entirely sure what you mean by "character bytes" but that
sounds broadly correct - but you need to make sure that whatever
encoding you use is the one the RTF reader is going to use too. That's
the crucial bit of information.

A single asian character is represented using more bytes ... does bytes are
what I called "character bytes".
I don't see how you infer that from the above. There are certainly
characters for which 2 bytes aren't enough, but I don't see any
indication above that Asian languages fall into that category.

If I have in the same stream characters coming from more (asian) languages,
2 bytes are not enough since ... "depending on the codepage the same
double-byte code points represent different characters for the different
languages."
 
apprentice said:
If I have in the same stream characters coming from more (asian) languages,
2 bytes are not enough since ... "depending on the codepage the same
double-byte code points represent different characters for the different
languages."

In that case you'll need to pick an encoding which contains all the
characters you need. UTF-16 may well be the encoding of choice here.

However, as I said before, the crucial thing is to work out what the
reader is going to expect. Are you actually able to dictate which
encoding is used, is it specified somewhere, or does the reader have to
guess?
 
Jon Skeet said:
In that case you'll need to pick an encoding which contains all the
characters you need. UTF-16 may well be the encoding of choice here.

However, as I said before, the crucial thing is to work out what the
reader is going to expect. Are you actually able to dictate which
encoding is used, is it specified somewhere, or does the reader have to
guess?

The developer will have to specify the correct code page for each string
that he/she inputs so that I may encode the string correctly. I wanted to
support different languages on the same document. I should be able to do it
easily.
 
Jon Skeet said:
No, that's just not true - and nothing that you posted gave any
evidence for it.

From the docs (admittedly for 2.0, but this hasn't changed) for String:

<quote>
Each Unicode character in a string is defined by a Unicode scalar
value, also called a Unicode code point or the ordinal (numeric) value
of the Unicode character. Each code point is encoded using UTF-16
encoding, and the numeric value of each element of the encoding is
represented by a Char object.
</quote>
For .NET v1.1 documentation, verified the same as above:
ms-help://MS.MSDNQTR.2005JUL.1033/cpref/html/frlrfSystemStringClassTopic.htm
Admitted my mistake.

Somehow I remebered the default encoding setting in web.config is utf-8, and
all the text files I have to access here is in utf-8, that lead me to
believe everything is in utf-8 here unless explicitly spoken otherwise.

Sorry for the misinformation.

Actually, I intended to reply this thread because of another thread about
"copy char[] to byte[]", and somehow I can't find the title or search it
back. It contains some misunderstanding among issues with Unicode, and I go
search for some reference and intended to post back.

Now I seems to stir things up. Sorry again.
 
Back
Top