unicode in windows 2003

  • Thread starter Thread starter Onega
  • Start date Start date
O

Onega

Hi

I create a simple win32 project (VC2003, windows2003(English) ,
and do simple paint in WM_PAINT message, when the project use
multi-character set, it is OK.
but when I change to UNICODE, some Chinese characters are illegible( I see
sizeof(TCHAR)=2 being displayed). Your idea is welcome.

case WM_PAINT:
hdc = BeginPaint(hWnd, &ps);
{
LPCTSTR smsg = _T("pringÖÐÎÄ");
TextOut(hdc,0,0,smsg, _tcslen(smsg));
TCHAR buf[256];
wsprintf(buf, _T("sizeof(TCHAR)=%d"), sizeof(TCHAR));
TextOut(hdc,0,20,buf, _tcslen(buf));
}
EndPaint(hWnd, &ps);
break;

Best Regards
Onega
 
Your idea is welcome.
why still use win32 for a new project? do you need to modify an existing
application?

for new projects i would recommend using a .NET windows application project.
these are MUCH simpler to use.

kind regards,
Bruno.
 
I would assume that when compiled as Unicode, the characters in your string
literal will each be interpreted as one Unicode character. You might want to
look at using the \x escape sequence.
 
Thank you, Ted Miller
\x escape sequence is not friendly.
My code snippet works well under Windows XP. I'd like to know if it is a bug
of Windows 2003 or VC 2003?

Best Regards
Onega

Ted Miller said:
I would assume that when compiled as Unicode, the characters in your string
literal will each be interpreted as one Unicode character. You might want to
look at using the \x escape sequence.

Onega said:
Hi

I create a simple win32 project (VC2003, windows2003(English) ,
and do simple paint in WM_PAINT message, when the project use
multi-character set, it is OK.
but when I change to UNICODE, some Chinese characters are illegible( I see
sizeof(TCHAR)=2 being displayed). Your idea is welcome.

case WM_PAINT:
hdc = BeginPaint(hWnd, &ps);
{
LPCTSTR smsg = _T("pringÖÐÎÄ");
TextOut(hdc,0,0,smsg, _tcslen(smsg));
TCHAR buf[256];
wsprintf(buf, _T("sizeof(TCHAR)=%d"), sizeof(TCHAR));
TextOut(hdc,0,20,buf, _tcslen(buf));
}
EndPaint(hWnd, &ps);
break;

Best Regards
Onega
 
Thank you, Ted Miller
\x escape sequence is not friendly.
My code snippet works well under Windows XP. I'd like to know if it is a
bug of Windows 2003 or VC 2003?

None of the above.
It is a bug in your code.

Your string is _T("pringÖÐÎÄ");
Because of the _T, the string will be left as is if the application is
ANSI or will be converted to Unicode if the application is Unicode.

When left as is (ANSI), you will get the byte sequence:
D6 D0 CE C4

When you run this on an Chinese Simplified system,
D6 D0 => will be interpreted as center/midle (unicode 4E2D)
CE C4 => will be interpreted as literature/culture/writing (unicode 6587)
(I guess this is what you want)

When you run this on an Chinese Traditional system,
D6 D0 => will be unicode 7B22 (no clue about meaning)
CE C4 => will be unicode 6045 (no clue about meaning)
(I guess this is not what you want)

When run on Russian system you will get Russian characters and so on.
This is the problem with code pages, the same sequence of byte can represent
different characters in different code pages.

For an Unicode application, whenm you compile the string is converted to
Unicode from the code page of your source code, which is assumed to be the
system code page.
If you compile on a US system, the result is the byte sequence
D6 00 D0 00 CE 00 C4 00
representing the Unicode characters U+00D6 U+00D0 U+00CE U+00C4

This will display identical on any system supporting Unicode:
LATIN CAPITAL LETTER O WITH DIAERESIS
LATIN CAPITAL LETTER ETH
LATIN CAPITAL LETTER I WITH CIRCUMFLEX
LATIN CAPITAL LETTER A WITH DIAERESIS

If you compile this on a Simplified Chinese system you get what you want.
The \x escape sequence is not friendly, but behave identical on all systems.

This letting aside that it is a verry bad practice to hard-code UI strings in
your code (you already discovered one of the reason).


Mihai
 
Hi Mihai N,

Thanks a lot for your informative explaination. I got a lot from it.
While I still have some doubt on this issue.
According to your theory, it seems that my code snippet should fail on both
Windows XP(English, SP1) and Windows 2003(English) . But it is fine on
Windows XP( English version , default codepage: Chinese, Region : Chinese),
althrough I set default codepage and Region to Chinese too under Windows
2003.
I appreciate your help!

TCHAR buf[256];
ZeroMemory(buf,sizeof(buf));
int n = GetLocaleInfo(LOCALE_SYSTEM_DEFAULT
,LOCALE_ILANGUAGE,buf,ARRAY_SIZE(buf));
buf contains text "0804" under both Windows XP and Windows 2003

Best Regards
Onega
 
According to your theory, it seems that my code snippet should fail on both
Windows XP(English, SP1) and Windows 2003(English) . But it is fine on
Windows XP( English version , default codepage: Chinese, Region : Chinese),
althrough I set default codepage and Region to Chinese too under Windows
2003.

Ok, maybe this is not the explanation.
Can you pleas answer some questions, maybe I can figure it out?
Is the code compiled already and you test the same executable on the two
systems?
Or you recompile?
The convestion of the string in the source happens at compile time.
What characters do you get see when you run your code on Windows 2003?

Mihai
 
Glad to see your post again.
Your tips is valuable.
I build ANSI and UNICODE version executable under windows XP, both works
well under windows 2003, then I rebuild under windows 2003, only ANSI
version works well.

my code looks like

case WM_PAINT:
hdc = BeginPaint(hWnd, &ps);
{
LPCTSTR smsg = _T("AÖÐÎÄ");
int nlen = _tcslen(smsg);
TextOut(hdc,0,0,smsg, _tcslen(smsg));
TCHAR buf[512];
wsprintf(buf, _T("sizeof(TCHAR)=%d, strlen = %d,"), sizeof(TCHAR), nlen);
TextOut(hdc,0,20,buf, _tcslen(buf));
ZeroMemory(buf,sizeof(buf));
TCHAR nbuf[16];
for(int ci=0;ci<nlen;ci++)
{
ZeroMemory(nbuf,sizeof(nbuf));
TCHAR tci = smsg[ci];
if(sizeof(TCHAR)==1)
wsprintf(nbuf, _T("%02X"),tci&0xff);
else
wsprintf(nbuf, _T("%04X"),tci&0xffff);
_tcscat(buf, nbuf);
}
TextOut(hdc,0,40,buf, _tcslen(buf));
}
EndPaint(hWnd, &ps);
break;

version build under win2003 gives the following output(I have only run it
under 2003):
ANSI : Chinese is fine, sizeof(TCHAR)=1, strlen=5, 41D6D0CEC4
UNICODE: Chinese isn't fine, sizeof(TCHAR)=2,strlen=5,
004100D600D000CE00C4

version build under Windows XP gives the following output(run on both XP and
2003):
UNICODE: Chinese is fine, sizeof(TCHAR)=2,strlen=3, 00414E2D6587
ANSI: Chinese is fine, sizeof(TCHAR)=1,strlen=5, 41D6D0CEC4

I think there is something wrong with Windows 2003 or VS.NET 2003

Best Regards
Onega
 
I build ANSI and UNICODE version executable under windows XP, both works
well under windows 2003, then I rebuild under windows 2003, only ANSI
version works well.

My guess: the XP you are using for building is Chinese Simplified,
the 2003 is English (or something else using code page 1252)
version build under win2003 gives the following output(I have only run
it under 2003):
ANSI : Chinese is fine, sizeof(TCHAR)=1, strlen=5, 41D6D0CEC4
UNICODE: Chinese isn't fine, sizeof(TCHAR)=2,strlen=5,
004100D600D000CE00C4

This matches what I was saying in a previous email:Note: COMPILE on US system, not RUN on US system.
_T is solved at compile time.
This also points to the conclusion that you do compile on an English system.

Try to compile it on a Chinese Simplified system.
You can do it on your 2003 system, but you should set both the user
and the system locale to Chinese (RPC), then reboot.

There is nothing wrong with Windows 2003 or Dev. Studio (2003 or older)

But even if this solves the problem, please move the string in the resources.
This is "the right thing" to do.

Quoting Microsoft
"In fact, the C/C++ Language specification says that the source files are
to be written in 7-bit ANSI."

Quoting the standard:
1. The basic source character set consists of 96 characters: the space
character, the control characters representing
horizontal tab, vertical tab, form feed, and newline,
plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + ­ / ^ & | ~ ! = , \ " ’

2 The universal-character-name construct provides a way to name
other characters.
hexquad:
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
 
Hi Mihai N,


Both Windows XP and Windows 2003 I worked with are English version.
At last I got a solution, by puting #pragma setlocale("chs") in .cpp file.
The idea is from Alexander Grigoriev. Show my respect to you for your
patience with it. I'll take your advice in future project. Thanks again!

Best Regards
Onega


Mihai N. said:
I build ANSI and UNICODE version executable under windows XP, both works
well under windows 2003, then I rebuild under windows 2003, only ANSI
version works well.

My guess: the XP you are using for building is Chinese Simplified,
the 2003 is English (or something else using code page 1252)
version build under win2003 gives the following output(I have only run
it under 2003):
ANSI : Chinese is fine, sizeof(TCHAR)=1, strlen=5, 41D6D0CEC4
UNICODE: Chinese isn't fine, sizeof(TCHAR)=2,strlen=5,
004100D600D000CE00C4

This matches what I was saying in a previous email:Note: COMPILE on US system, not RUN on US system.
_T is solved at compile time.
This also points to the conclusion that you do compile on an English system.

Try to compile it on a Chinese Simplified system.
You can do it on your 2003 system, but you should set both the user
and the system locale to Chinese (RPC), then reboot.

There is nothing wrong with Windows 2003 or Dev. Studio (2003 or older)

But even if this solves the problem, please move the string in the resources.
This is "the right thing" to do.

Quoting Microsoft
"In fact, the C/C++ Language specification says that the source files are
to be written in 7-bit ANSI."

Quoting the standard:
1. The basic source character set consists of 96 characters: the space
character, the control characters representing
horizontal tab, vertical tab, form feed, and newline,
plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + ?/ ^ & | ~ ! = , \ " ?

2 The universal-character-name construct provides a way to name
other characters.
hexquad:
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
universal-character-name:
\u hex-quad
\U hex-quad hex-quad
 
Hi Onega,

Even if you did solve the problem, please give me one more answer.
Both Windows XP and Windows 2003 I worked with are English version.

The language of the Windows interface does not matter.
What matters for this is the system locale.
Is one of them set to Chinese Simplified system locale?

A quick way to tell, withoug diging in menus:
- open notepad
- type some chinese characters
- save the file as text (not Unicode or UTF, but ANSI)
- close the file and open it again
Is it still Chinese?

I would guide you through the Control Panel and dialogs, but I
am on Windows 2000 and they did change these things on XP.
 
Hi Mihai N.

Under Windows XP, I write "AÖÐÎÄ" in notepad and save to a txt file, reopen
it, it is readable, open in binary mode, it is
41 D6 D0 CE C4
ZeroMemory(buf, sizeof(buf));
GetLocaleInfo(LOCALE_SYSTEM_DEFAULT ,LOCALE_ILANGUAGE,buf,ARRAY_SIZE(buf));
// buf filled by "0804"
ZeroMemory(buf, sizeof(buf));
GetLocaleInfo(LOCALE_USER_DEFAULT ,LOCALE_ILANGUAGE,buf,ARRAY_SIZE(buf));
//buf filled by "0804"
Please wait for the result of Windows 2003.

Best Regards
Onega
 
Hi Mihai N.

Under Windows 2003, I write "AÖÐÎÄ" in notepad and save to a txt file,
reopen
it, it is readable, open in binary mode, it is
41 D6 D0 CE C4
ZeroMemory(buf, sizeof(buf));
GetLocaleInfo(LOCALE_SYSTEM_DEFAULT ,LOCALE_ILANGUAGE,buf,ARRAY_SIZE(buf));
// buf filled by "0804"
ZeroMemory(buf, sizeof(buf));
GetLocaleInfo(LOCALE_USER_DEFAULT ,LOCALE_ILANGUAGE,buf,ARRAY_SIZE(buf));
//buf filled by "0409"



Best Regards
Onega
 
Hi Onega,

Your results:
XP
GetLocaleInfo(LOCALE_USER_DEFAULT ,LOCALE_ILANGUAGE,buf,ARRAY_SIZE(buf));
//buf filled by "0804"
2003
GetLocaleInfo(LOCALE_USER_DEFAULT ,LOCALE_ILANGUAGE,buf,ARRAY_SIZE(buf));
//buf filled by "0409"

409 = English U.S.!!!
It is a little weird to see that the user default matters, but hust give it a
try with both locales set to 804, maybe we can solve the mistery.

I know that #pragma setlocale("chs") solved the problem,
but I usualy like to understand WHY something works or not.
Anyway, I will stop bugging you, I can do my own tests,
when I will get my hands on a 2003.
 
Back
Top