Unicode and ASCII code pages confusion on user interfaces

  • Thread starter Thread starter Bob Rock
  • Start date Start date
B

Bob Rock

Hello,

I'm developing an application that users around the world could end up using
and I'd like to understand once and for all how unicode and ASCII impact
expecially an application's user interface. What I cannot get is:

1) the text that appears on static text controls (for example on a dialog
box) is exactly the same for everyone independent of the ASCII code page set
on their system???
2) I have seen that most ASCII code pages share the first 128 characters ...
but what happens if in a static control I use characters above 128
decimal??? Could people with a different ASCII code page see characters
different from the intended ones??? Also when parsing input text can I rely
on the fact that, for example, the character with a ASCII value of 169
decimal is the copyright symbol as it is on my system???
3) What controls the fact that users may input unicode text into my input
controls??? Or better what controls the fact that input controls (for
example edit box controls) will accept unicode text as input??? Is it the
fact that the containing dialog box has been created with a unicode version
of CreateDialog/DialogBox??? If an ASCII version of such functions
(CreateDialog/DialogBox) has been used does a Unicode -> ASCII conversions
take place as input is typed into the input controls???
4) How can I control the fact that my application will use the font I want
and not the one may be set on a system to support asiatic languages???


Bob Rock
 
Bob Rock said:
I'm developing an application that users around the world could end up using
and I'd like to understand once and for all how unicode and ASCII impact
expecially an application's user interface. What I cannot get is:

1) the text that appears on static text controls (for example on a dialog
box) is exactly the same for everyone independent of the ASCII code page set
on their system???

The bottom 128 ASCII characters are the same regardless of the code page. In
fact, ASCII only defines codes 0-127; the rest depend on the code page in
use.
2) I have seen that most ASCII code pages share the first 128 characters ....
but what happens if in a static control I use characters above 128
decimal??? Could people with a different ASCII code page see characters
different from the intended ones???
Yes.

Also when parsing input text can I rely
on the fact that, for example, the character with a ASCII value of 169
decimal is the copyright symbol as it is on my system???
No.

3) What controls the fact that users may input unicode text into my input
controls??? Or better what controls the fact that input controls (for
example edit box controls) will accept unicode text as input???

Only Unicode edit controls -- edit controls created using
CreateWindowW("EDIT") -- accept Unicode input. Windows translates input
directed towards ANSI windows into characters in the window's thread's code
page. If there is no match (e.g. I type Cyrillic into an ANSI program on a
Western system) it gives you ???.
Is it the
fact that the containing dialog box has been created with a unicode version
of CreateDialog/DialogBox???

Yes :).
If an ASCII version of such functions
(CreateDialog/DialogBox) has been used does a Unicode -> ASCII conversions
take place as input is typed into the input controls???

Yes :). (I should probably read the rest of the post before replying.)
4) How can I control the fact that my application will use the font I want
and not the one may be set on a system to support asiatic languages???

Well, you could use the MS Shell Dlg name: it is substituted with the
appropriate UI font on the target system. It's used in all the standard OS
dialogs and by property sheet pages everywhere. On Western systems it maps
by default to MS Sans Serif (though I've got mine changed to Tahoma).

I haven't tried this, but I think the standard fonts on non-Western versions
of Windows support the local character set anyway. Maybe a non-Westerner
could answer this?
 
Hi,
1) the text that appears on static text controls (for example on a dialog
box) is exactly the same for everyone independent of the ASCII code page set
on their system???
ASCII is independent of the system and there is nothing like ASCII code
page. Something similar exists called ASCII extensions, because ASCII only
defines the first 128 characters, the other 128 characters are filled using
locale specific characters. This extensions are allows limitless.
2) I have seen that most ASCII code pages share the first 128 characters ....
but what happens if in a static control I use characters above 128
decimal???
mmm... same as above.
Could people with a different ASCII code page see characters
different from the intended ones???
Just with Unicode (or something similar.)
Also when parsing input text can I rely on the fact that, for example, the
character with a ASCII value of 169
decimal is the copyright symbol as it is on my system??? No.

3) What controls the fact that users may input unicode text into my input
controls??? Windows :-)
Or better what controls the fact that input controls (for
example edit box controls) will accept unicode text as input??? Windows too
Is it the
fact that the containing dialog box has been created with a unicode version
of CreateDialog/DialogBox???
If you compile in Unicode, your life will be easier if you intend to support
it, but you can compile without Unicode and 'work it out'
If an ASCII version of such functions (CreateDialog/DialogBox) has been
used does a Unicode -> ASCII conversions
take place as input is typed into the input controls???
Actually, Windows use something called Windows code page that is a
(something like) subset of Unicode. If you compile with an ASCII version and
people input Unicode (or extended characters of windows code page) be ready
to see 2 bytes per character.
4) How can I control the fact that my application will use the font I want
and not the one may be set on a system to support asiatic languages???


Bob Rock

Lucas/

PS: I do not try to start a discussion on the difference of character and
glyph, so, don't do it.
 
1) the text that appears on static text controls (for example on a
dialog
ASCII is independent of the system and there is nothing like ASCII code
page. Something similar exists called ASCII extensions, because ASCII only
defines the first 128 characters, the other 128 characters are filled using
locale specific characters. This extensions are allows limitless.

Sorry, I noticed I make a mistake writing ASCII where ANSI should have
appeared.

Bob Rock
 
2) I have seen that most ASCII code pages share the first 128 characters


How can I then display in a static text control a character that is not
among the first 128 chars in my own ansi code page so that everyone
independent of their own code page will see the intended character??? Also,
is there a way to force an application to use a given character set/code
page such as Latin I???

Bob Rock
 
Tim Robinson said:
I haven't tried this, but I think the standard fonts on non-Western versions
of Windows support the local character set anyway. Maybe a non-Westerner
could answer this?

I can't answer it. But on an english version of windows, try renaming
a file to have chinese characters. (You can look up the unicode codes
for chinese characters using charmap). Usually the default (english)
desktop font won't show them. But if you go do Display Properties >
Appearance, and set the destkop font to one of the massive fonts, then
it displays fine.

Only Unicode edit controls -- edit controls created using
CreateWindowW("EDIT") -- accept Unicode input.

I'm unsure about what you wrote. My understanding was that edit
controls are always unicode (at least on NT/Win2k/XP). If you created
them with CreateWindowA, then that tells windows to perform automatic
text translation between multibyte and unicode in the WM_SETTEXT &c.
messages. If you used CreateWindowW then there's no translation.

Hence: I understood that every edit control accepts unicode input,
from the user typing. If the user types something weird, then
WM_GETTEXT will give it back to you as a multibyte string.

How can I then display in a static text control a character that is not
among the first 128 chars in my own ansi code page so that everyone
independent of their own code page will see the intended character?

And I also understood that the static text control was always unicode
internally, but that WM_SETTEXT &c. do automatic translation. Also,
don't .rc files take unicode?

So I reckon you could (1) use SetWindowTextW(..) to put in a unicode
string into the static text control, and (2) ensure that the dialog's
font is one that will be able to display your string. It's a safe bet
that west european characters will be present in every font. But for
things like arabic or chinese, set it to the international Arial font.
 
Hello Tim,

Tim said:

I'd rather say that this depends on the "source" where the text comes
from (and the font support). If the text that is displayed in the static
control is already in the dialog resource as such, then it is has been
stored by the resource compiler as a Unicode string and thus will be
displayed correctly and will display the same symbol, regardless of the
code page, if the font support is on the system, i.e. if the font that
is used to display the text has an entry for that Unicode point. If the
latter is not the case, a substitute character will be used (a question
mark for instance).

However if the source is e.g. a ASCII text file or a database (from
which you somehow retrieve the string to be displayed) and you do a
SetWindowTextA to display the text in the static control, it will be
converted from an ANSI code page into a Unicode string and will be
displayed differently, depending on the code page. The same applies for
strings that are in your Stringtable. Although they are stored as
Unicode, if you do a LoadStringA, you will get an ANSI string and if you
then do a SetWindowTextA with this string for the static control, a
conversion will have happened and it may not display in a unique manner.

For CJK languages an additional twist comes into play: HAN unification,
which renders one and the same unicode point into different symbols,
depending on the font used.

If the text is ANSI, Tim is right. If it is a Unicode string, 169 will
be the copyright symbol, because ANSI-ISO 8859/1 (aka Windows Codepage
1252) is identical to the first 256 Unicode points.

Folks, correct me if I am wrong.

HTH,
 
I'm working with a team that is finishing a multi-lingual application built
with MS Visual C++ 6.0 for Windows XP Embedded which contains as part of the
package a configuration utility which was designed to run under Windows
9x/NT.

I haven't done any work in this area lately so the writeup below is a bit
rough and some details may not be right but hopefully this will point you in
a good direction.

The first thing to remember about multi-lingual Windows applications is that
Windows 9x does not have the UNICODE API (except for a couple of guaranteed
functions like MessageBox) whereas Windows NT does. Since Windows XP
Embedded is of the NT family and supports naitive UNICODE from the ground
up, we just turned on UNICODE with a define and used TCHAR everywhere and it
worked out fine.

The problem was with the configuration utility which was ported from Windows
3.x (16 bit Windows) to Win 9x/NT (and was not an MFC application so
everything was direct Windows API functions) and which used ANSI rather than
UNICODE since we wanted it to run on both types of Windows and the utility
had to create its configuration information with UNICODE strings.

To make a long story short, we did the following with the utility:
- used WCHAR internally everywhere supporting UNICODE inside
- compiled without UNICODE turned on
- wrote our own UNICODE layer for the Windows API which does the
following:
. checks if running on Windows NT and if so, uses the UNICODE API
. if running on Windows 9x, do conversion to multibyte then use ANSI
API

I believe that one of the problems we ran into when we tried to use just the
ANSI API regardless of Windows version was that when running under Windows
NT, the multibyte strings were converted to UNICODE using the Windows system
settings so the resulting output was wrong.

We decided we had to write our own UNICODE layer because we were supporting
multiple languages in a locale independent manner. In other words, someone
in France may be using the utility to setup a Cyrillic or Greek system so
even though they were in France using French Windows, we had to change the
keyboard and and display to the appropriate language.

We did pick and chose API functions so as to minimize the time spend making
the UNICODE layer.

The real bear of an example would be if a South Korean was configuring a
system destined for China and was using Simplified Chinese on a Korean
Windows PC.

We looked at the Microsoft Unicode Layer and decided it would not do because
of the system code page dependency. The Unicode Layer API will do
conversions between UNICODE and multibyte to then use the ANSI API functions
but only with the Windows system settings and not with an arbitrary
programatically specified code page.

We could not figure out any other way to run on Windows 9x than by doing our
own UNICODE to multibyte conversion, specifying the target code page as a
part of the conversion call, then doing the API call with the multibyte
string.

Now since the conversion utility has a multi-lingual interface, we are using
resource DLLs one per supported language. When the user selects a specific
language, the DLL with the resources for that language is loaded in, the
appropriate internal code page is set, the appropriate font family is set,
and the appropriate keyboard is selected if available. With Simplified
Chinese, we use the Microsoft IME (Pingyang I think for Simplified Chinese)
which is an Active-X control for Windows 9x but is built into the OS for
Windows NT if the user has added that keyboard from the NT CD. If I
remember correctly, the UNICODE characters were available from the IME
control for Windows 9x and of course provided via standard keyboard messages
from Windows NT as UNICODE.

When we popup dialogs, we send all the controls a WM_SETFONT message to set
the font. You can usually tell if you missed one because you'll see the
square black rectangles indicating an unsupported UNICODE character if
you're displaying Simplified Chinese with a font that doesn't contain those
glyphs.

Since resource strings are compiled into the resource file as UNICODE
strings regardless of the Windows version, the UNICODE strings were
available in the resources so all we had to do was load them in and use them
internally as UNICODE since everything was kept as UNICODE until actual I/O
was done.

But then we found that the LoadStringA () function does UNICODE to multibyte
conversion using the system code page so we had to write our own function to
do a loadstring. Found an example from a write up in MSDN about that.

So it all worked about a year ago but we haven't tested it lately on Windows
9x so maybe I should get off this news group and go check it.

Of course it may not matter much anymore since most people who would be
using the utility have probably migrated to Windows XP and the rest should
go ahead and get a copy anyway.

I suggest you spend some time with MSDN browsing about and get a copy of
Petzold's Windows Programming book which has some info about UNICODE.
 
Back
Top