Unicode, encodings, and asian languages: need some help.

  • Thread starter Thread starter apprentice
  • Start date Start date
apprentice said:
The developer will have to specify the correct code page for each string
that he/she inputs so that I may encode the string correctly. I wanted to
support different languages on the same document. I should be able to do it
easily.

But how does that encoding information end up in the file? How does the
RTF file itself specify the encoding?
 
apprentice said:
Well, it is a simple RTF library. For asian languages the RTF specification
seems to expect 2 bytes encoded characters and requires each byte to be
escaped depending on the fact of being below character code 0x20 and above
0x80. That is why I initially thought of breaking up any asian language
string into its composing characters, get the bytes and do the escaping if
required. But this is really not necessary. Having received the string, I
will get its bytes (based on the correct encoding) and will escape them as
required. In fact I can probably handle these asian characters using the \u
control word without even having to get to the character bytes.

This is how Japanese looks like, with the mixture of Unicode and shift-jis
required by the spec:

\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea

But this is also valid:
\f1\'93\'fa\'96\'7b\'8c\'ea\par

The condition is that the font number 1 (specified by \f1) has the proper
charset:

{\fonttbl
{\f1\froman\fprq1\fcharset128 MS PGothic;}
}

In fact, this would be a minimal Japanese rtf:

{\rtf1\ansi

{\fonttbl
{\f1\fcharset128 MS PGothic;}
}

\f1\'93\'fa\'96\'7b\'8c\'ea
}

But there are a lot of other useful rtf tags that are important,
like \dbch \langnp1041 \lang1041 and many others.

The official spec is the most important document if you want your own RTF
lib:
http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/dnrtfspec/html/rtfspec.asp


You can also consider the WinWord Converter SDK:
http://support.microsoft.com/kb/q111716/

And saving a lot of RTF files from Write & Word, then "dissecting"
them in Notepad is the best way to understand how this works.


But is there any good reason not to use a standard RTF control?
You can then serialize to/from it, full Unicode, no need to worry about
internal RTF representation, bytes, etc.
 
Yet another sorry to contradict, the "Unicode" used in .NET v1.1 is UTF-8
(not sure in .NET v2.0).

We don't talk "Unicode" here, we talk about Unicode in .NET context.
And that is UTF-16 in .NET 1.0, 1.1, 2.0, you name it.
End of story.

See the "Unicode" page from Wikipedia to get clear idea about consequence
involved, but the following is a quote to give you basic difference:

And the "Unicode" page from Wikipedia (and especially the "Storage, transfer,
and processing" section) is not very good (to put it mild :-)

If you want something, go to the official Unicode web site.
 
Now I seems to stir things up. Sorry again.

Checking the facts, then saying "my bad" and pointing ppl to the right
document is something to be prised for, and I don't see it done to often.

Sorry, I just posted some short rebutal to your post, before reading the full
thread. Although true, you can just ignore it :-)

See? We all learn something (technical or not), every day :-)
 
Jon Skeet said:
But how does that encoding information end up in the file? How does the
RTF file itself specify the encoding?

Well, there are RTF control words to specify (1) the ANSI codepage for the
entire document, (2) the char set for each font used in the document and (3)
then there is the encoding of bytes (which I expect require the correct
Encoding class to be used ... and thus again the correct codepage).
 
Hello Mihai,

thank you very much for you answer.
Please allow me to ask you a few things ... I've got no way of testing my
code to see if my assumptions are correct.

Using how many bytes is each japanese char encoded??? From my understanding,
depending on the word, they are encoded using 1 or 2 bytes, with precise
rules on the valid ranges for the leading and the trailing bytes of
double-byte chars (dbch). Could you please confirm?

In the first example that you provide (i.e.
\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea) it looks like you are
encoding dbch (double byte chars) using the \u control word and single byte
chars using the \' syntax. Did I get you right?

I don't know if you read the entire thread, but to sum it up, my main worry
in my post is if in a piece of code such as the following, a string of
japanese text would be broken up correctly returning chars meaningful for
the japanese language (bytes in the stream are correctly assigned to each
japanese char):

foreach(char ch in text)
{
// is ch indeed a japanese char??? I doubt it!
}

My doubts arise from the fact that I have no way of telling the .NET
Framework that the text in the string is japanese. That would mean that I
cannot rely on having an algorithm break up strings in characters according
to the language of the text they contain but I'll have to work at the byte
level, encoding bytes as required by the RTF spec (using the \' syntax).

I'd appreciate if you could shed some light on the above.


Bob Rock
 
apprentice said:
Well, there are RTF control words to specify (1) the ANSI codepage for the
entire document, (2) the char set for each font used in the document and (3)
then there is the encoding of bytes (which I expect require the correct
Encoding class to be used ... and thus again the correct codepage).

In that case, I'd suggest using UTF-16 everywhere - it'll make life
much easier for you.
 
apprentice said:
thank you very much for you answer.
Please allow me to ask you a few things ... I've got no way of testing my
code to see if my assumptions are correct.

Using how many bytes is each japanese char encoded??? From my understanding,
depending on the word, they are encoded using 1 or 2 bytes, with precise
rules on the valid ranges for the leading and the trailing bytes of
double-byte chars (dbch). Could you please confirm?

It entirely depends on which encoding you use. In the form of a string,
each UTF-16 code point will take two bytes. Leaving surrogate
characters out of it for the moment, that means each character is two
bytes.

However, when you convert it to a different encoding, it entirely
depends on what that encoding uses.
In the first example that you provide (i.e.
\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea) it looks like you are
encoding dbch (double byte chars) using the \u control word and single byte
chars using the \' syntax. Did I get you right?

I don't know if you read the entire thread, but to sum it up, my main worry
in my post is if in a piece of code such as the following, a string of
japanese text would be broken up correctly returning chars meaningful for
the japanese language (bytes in the stream are correctly assigned to each
japanese char):

A string of Japanese text will always be broken up into a sequence of
UTF-16 code points.
My doubts arise from the fact that I have no way of telling the .NET
Framework that the text in the string is japanese.

You don't need to.
That would mean that I
cannot rely on having an algorithm break up strings in characters according
to the language of the text they contain but I'll have to work at the byte
level, encoding bytes as required by the RTF spec (using the \' syntax).

The string representation isn't concerned about the byte level - it's
concerned about the UTF-16 code point level. If you want to convert
into bytes, *you* provide the encoding, so you can give whichever one
you want.
 
Using how many bytes is each japanese char encoded??? From my
understanding, depending on the word, they are encoded using 1 or 2 bytes,
with precise rules on the valid ranges for the leading and the trailing
bytes of double-byte chars (dbch). Could you please confirm?

Sorry I make a mistake typing the post ... it should be CHAR not WORD:

Using how many bytes is each japanese char encoded??? From my understanding,
depending on the CHAR, they are encoded using 1 or 2 bytes, with precise
rules on the valid ranges for the leading and the trailing bytes of
double-byte chars (dbch). Could you please confirm?
 
apprentice said:
Jon, I posted a small sample C# project to make you understand why I cannot
do what you are suggesting.
Please take a look at it:

http://backslashzero.united.net.kg/TestEncoding.zip

How it will clarify things.

Not really. Your code shows you using different encodings - it doesn't
show any restrictions, as far as I can see. Yes, the output is more
verbose - but the ability to encode *any* Unicode character without
having to guess at which encoding might or might not work is worth
that, isn't it?
 
Not really. Your code shows you using different encodings - it doesn't
show any restrictions, as far as I can see. Yes, the output is more
verbose - but the ability to encode *any* Unicode character without
having to guess at which encoding might or might not work is worth
that, isn't it?

Jon, I believe I'm not getting you ... and probably you are not getting my
point either. As you might have seen, the number and even what bytes are
being printed out for the same exact unicode string (the one containing the
japanese text) are different. One of the ways that RTF requires you to
encode such double-byte char texts (texts for example in chinese, japanese,
korean and vietnamese) is to precede each byte's ascii code with a \'.
However, I believe (but I admit it, this is only my belief) the correct
encoding must first be selected before getting the ascii code for the bytes
in the stream because otherwise I might end up generating RTF code that does
not display the wanted text but simply a bunch of rubbish.
 
apprentice said:
Jon, I believe I'm not getting you ... and probably you are not getting my
point either. As you might have seen, the number and even what bytes are
being printed out for the same exact unicode string (the one containing the
japanese text) are different.

Of course they are. There wouldn't be much use in having different
encodings if they all did the same thing, would there? The point of
different encodings is that they take the same text data and represent
it in different binary formats. Think of it in the same kind of way as
image formats - several formats could all take the same picture and
save it in different ways.
One of the ways that RTF requires you to
encode such double-byte char texts (texts for example in chinese, japanese,
korean and vietnamese) is to precede each byte's ascii code with a \'.

There *isn't* an ASCII code for Chinese, Japanese etc characters.
That's why you can't use the ASCII encoding.
However, I believe (but I admit it, this is only my belief) the correct
encoding must first be selected before getting the ascii code for the bytes
in the stream because otherwise I might end up generating RTF code that does
not display the wanted text but simply a bunch of rubbish.

But you said yourself that you can set the codepage for the whole
document. So set it to UTF-16 and use that throughout. Not sure where
the character set to use for the font comes into it, admittedly - you'd
have to read the specs for what that means.

On the other hand, looking at the specs briefly myself, the \uN keyword
seems to cover you fairly reasonably. I'd be tempted to stick to ASCII
and use \uN for every non-ASCII character, just to keep things simple.
That would mean the documents because pretty large, however.
 
One of the ways that RTF requires you to
There *isn't* an ASCII code for Chinese, Japanese etc characters.
That's why you can't use the ASCII encoding.

You are right. I used the word ascii, but I meant the hex value for a byte.
But you said yourself that you can set the codepage for the whole
document. So set it to UTF-16 and use that throughout. Not sure where
the character set to use for the font comes into it, admittedly - you'd
have to read the specs for what that means.

Yes, I get your point now. Don't know if it would work ... but I'll try it
an submit it to someone who may test the code.
On the other hand, looking at the specs briefly myself, the \uN keyword
seems to cover you fairly reasonably. I'd be tempted to stick to ASCII
and use \uN for every non-ASCII character, just to keep things simple.
That would mean the documents because pretty large, however.

Yes, but either your idea above works, or the .NET framework will probably
break up the text in a string into chars that are simply junk for the
specific language of the text. Hope you now understand why I was trying to
find a way to help the framework break up the text into chars that are
meaningful for the specific language.
 
apprentice said:
You are right. I used the word ascii, but I meant the hex value for a byte.

Right - although "hex value" is unnecessary too, it's really just the
bytes which are important. They're really bytes for the characters,
rather than codes for the bytes, if you see what I mean :)
Yes, I get your point now. Don't know if it would work ... but I'll try it
an submit it to someone who may test the code.
Excellent.


Yes, but either your idea above works, or the .NET framework will probably
break up the text in a string into chars that are simply junk for the
specific language of the text. Hope you now understand why I was trying to
find a way to help the framework break up the text into chars that are
meaningful for the specific language.

But the point is that because .NET uses UTF-16, and UTF-16 can encode
*all* Unicode characters, you shouldn't have a problem. The characters
*can't* be junk for the language of the text, unless the text was
extracted badly to start with - because the characters *are* the text
as far as .NET is concerned.

It sounds like we're definitely making progress though :)
 
Using how many bytes is each japanese char encoded??? From my
understanding,
depending on the word, they are encoded using 1 or 2 bytes, with precise
rules on the valid ranges for the leading and the trailing bytes of
double-byte chars (dbch). Could you please confirm? True.

In the first example that you provide (i.e.
\u26085\'93\'fa\u26412\'96\'7b\u-30050\'8c\'ea) it looks like you are
encoding dbch (double byte chars) using the \u control word and single byte
chars using the \' syntax. Did I get you right?
True again :-)
My doubts arise from the fact that I have no way of telling the .NET
Framework that the text in the string is japanese.
Well, you have to know. You need to carry that info together with the string.
Where are the strings comming from? Is that info available at some point?

I still don't undertstant exactly what you need.
Back to my question "is there any good reason not to use a standard RTF
control?"
At what level do you want to work?
Have code producing an rtf "from scratch", no rtf control involved?
I find this a bit tough and probably not worth the effort.
Why not use the standard RTF control? Then you do not need to care
about the internal representation (but you still have to care about the right
fonts).

The same unicode code point looks differently in Japanese/Traditional
Chinese/Simplified Chinese, and you need the proper font for the proper
language.

The font gives a hint to the RTF control for what encoding to use.
See my example:
{\fonttbl
{\f1\fcharset128 MS PGothic;}
}

\f1\'93\'fa\'96\'7b\'8c\'ea

This reads: font number 1 using charset 128 is "MS PGothic"
Then \f1 tells that the text following used font 1.
Charset 128 is SHIFTJIS_CHARSET (WinGDI.h), which means Japanese,
which means 932 used for the bytes.


On the other side, "the bytes" part is only used by old RTF controls.
For new controls you can even use this:
\u26085\'3f\'3f\u26412\'3f\'3f\u-30050\'3f\'3f
(\'3f = question mark)
 
Well, you have to know. You need to carry that info together with the
string.
Where are the strings comming from? Is that info available at some point?

I still don't undertstant exactly what you need.

I'm writing an RTF library.
Back to my question "is there any good reason not to use a standard RTF
control?"

Yes, it seems it does not work.
On my system (set to use ansi codepage 1252) the code that makes use of the
RichTextBox control does not work.
Take a look at the project at the following location:

http://backslashzero.united.net.kg/JapaneseRTF.zip

On my system, after setting the Text property to the japanese string, I get
the RTF output (Rtf property on the RichTextBox control) and this is
empty!!!
Can anyone explain it???
At what level do you want to work?
Have code producing an rtf "from scratch", no rtf control involved?
I find this a bit tough and probably not worth the effort.
Why not use the standard RTF control? Then you do not need to care
about the internal representation (but you still have to care about the
right
fonts).

The same unicode code point looks differently in Japanese/Traditional
Chinese/Simplified Chinese, and you need the proper font for the proper
language.

Yes, I know.
The font gives a hint to the RTF control for what encoding to use.
See my example:
{\fonttbl
{\f1\fcharset128 MS PGothic;}
}

\f1\'93\'fa\'96\'7b\'8c\'ea

This reads: font number 1 using charset 128 is "MS PGothic"
Then \f1 tells that the text following used font 1.
Charset 128 is SHIFTJIS_CHARSET (WinGDI.h), which means Japanese,
which means 932 used for the bytes.
Ok.



On the other side, "the bytes" part is only used by old RTF controls.
For new controls you can even use this:
\u26085\'3f\'3f\u26412\'3f\'3f\u-30050\'3f\'3f
(\'3f = question mark)

I believe using the bytes is simpler. In the code you sent me you do the
following on the unicode value of a char in the provided text string:

if (unicodeValue >= 0x8000)
unicodeValue -= 0x10000;

This means that there are certain chars when I just cannot print out the
unicode value for the char BUT I somehow need to do a transformation that is
dependant on the text language (japanese, chinese, etc.). Having to write
language specific code is something I want to avoid. Printing out the bytes
should not require any of these transformations so I'll stick to them.


Bob
 
Back to my question "is there any good reason not to use a standard RTF
Yes, it seems it does not work.
On my system (set to use ansi codepage 1252) the code that makes use of the
RichTextBox control does not work.
As I explained by email, you probably don't have Japanese support installed.

Take a look at the project at the following location:

http://backslashzero.united.net.kg/JapaneseRTF.zip

On my system, after setting the Text property to the japanese string, I get
the RTF output (Rtf property on the RichTextBox control) and this is
empty!!!
Can anyone explain it???
Working on my system.
And I would appreciate if once you ask me something by email, you keep it
there. And not publicly posting code I give you without checking with me
and withouy the proper credits.

I believe using the bytes is simpler. In the code you sent me you do the
following on the unicode value of a char in the provided text string:

if (unicodeValue >= 0x8000)
unicodeValue -= 0x10000;

This means that there are certain chars when I just cannot print out the
unicode value for the char BUT I somehow need to do a transformation that
is
dependant on the text language (japanese, chinese, etc.). Having to write
language specific code is something I want to avoid. Printing out the bytes
should not require any of these transformations so I'll stick to them.
Also as explained by email, this is a dumb way to cast to a signed short.
Taking code out of context an posting it publically (again, code I have
privately sent to you by email).


In general, it is my pleasure to help. But I think once a discution moves
to email, stays to email. And that my code sent by email should not me
made public without my permision and especially without giving credit.

If you are in a rush and I don't answer fast enough, then please
keep the thread on the newsgroups.
I have a day job and doing this on my time, usually late at nigh.
This means I don't answer questions (including email) during the day.
 
And I would appreciate if once you ask me something by email, you keep it
there. And not publicly posting code I give you without checking with me
and withouy the proper credits.

Credits??? Are you alright??? In that code there are only 3 lines (and they
are indeed only 3) of YOUR code.
 
And I would appreciate if once you ask me something by email, you keep it
Credits??? Are you alright??? In that code there are only 3 lines (and they
are indeed only 3) of YOUR code.

Yes, credits. One line or 5000, it does not matter.
When 3 lines of my code do the job right and replace 40 lines of your code
that do not, then yes, you should give credit.

Programming is about quality, not quantity.
Maybe as an "apprentice" you did not know that.

To close the chapter I will answer here the questions you ask by email.
I don't know why I am doing it, but here it is:

===================
What RTL issues???
RTL = Right To Left, used for scripts like Arabic or Hebreaw.
You think there are no issues? Search the specs (from the link I have sent
you) for rtf tags related to this.

You mean that I cannot encode
the text using iso-2022-jp and then print out the bytes preceeding the
byte value with \' ?
Yes, this is what I am mean.

Could you please explain why?
Read the RTF specs.

Yes, windows 2003. What you are saying leads however to exclude the use of
the RichTextBox control in my library: I may not count on a client system
to have asian symbols installed.
Then you cannot count on a client system to handle Asian languages.
It is unrealistic to ask for Japanese without Japanese support from the OS.
Unless you want to do everything from scratch and carry all the data with
you (like ICU).

===================
 
Back
Top