wide character (unicode) and multi-byte character

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hello everyone,


Wide character and multi-byte character are two popular encoding schemes on
Windows. And wide character is using unicode encoding scheme. But each time I
feel confused when talking with another team -- codepage -- at the same time.

I am more confused when I saw sometimes we need codepage parameter for wide
character conversion, and sometimes we do not need for conversion. Here are
two examples,

code page is used in WideCharToMultiByte when dealing with unciode character

int WideCharToMultiByte (
UINT CodePage,
DWORD dwFlags,
LPCWSTR lpWideCharStr,
int cchWideChar,
LPSTR lpMultiByteStr,
int cbMultiByte,
LPCSTR lpDefaultChar,
LPBOOL lpUsedDefaultChar );

code page is not used in wcstombs when dealing with unciode character

size_t wcstombs (
char* mbstr,
const wchar_t* wcstr,
size_t count );

My question is, what is codepage (seems my current understanding is not
correct)? Does codepage have anything to do with multi-byte character or only
have relationship with wide character? Could anyone explain the meaning and
relationship between codepage, wide character and multi-byte character?


thanks in advance,
George
 
Thanks Mihai!


It is a very good article and I read through it twice. It solves and
clarifies most of my questions. I still want to let you help to confirm,

1. Unicode, ANSI UTF-8 and UTF-16 is character set or code page, number
mapping between character and number, is that correct?

2. What is the encoding approach? Does it has a name? I only see B or Q in
the samples in the article to represent encoding approach.

I think encoding approach is another level of mapping between code page
character number and storage bytes. Is my understanding correct?


regards,
George
 
1. Unicode, ANSI UTF-8 and UTF-16 is character set or code page, number
mapping between character and number, is that correct?

Unicode = code page

UTF-8, UTF-16, UTF-32 = Character Encoding Forms
http://www.unicode.org/glossary/#character_encoding_form

ANSI = in the Windows lingo ANSI is a misnomer, meaning “the default system
code page.” See http://www.mihai-nita.net/article.php?artID=glossary

The Unicode lingo is a bit more complicated (you also have a "Character
Encoding Scheme", etc.), but you probably don't need the whole enchilada
to get a grasp of the basics.

2. What is the encoding approach? Does it has a name? I only see B or Q in
the samples in the article to represent encoding approach.
B = BASE64, Q = Quoted-Printable
http://www.faqs.org/rfcs/rfc2047.html
I think encoding approach is another level of mapping between code page
character number and storage bytes. Is my understanding correct?
Yes. It is also called "byte serialization"
Since for normal text in a computer (let's say in code page 1252, Western
European) the maping from code value to byte is 1:1, direct storage, the
encoding part is not quite obvious. The code for 'a' is 0x61 and it is stored
as the byte 61.
This is why many programmers don't "grok" this extra level.
 
Thanks Mihai!


Your reply is very great! I have read some more articles and have two more
questions.

1. Previously I think wide character representation in computer is a
specific encoding (codepage) approach -- like UTF-16, and multi-byte
character representation in computer is another specific encoding (codepage)
approach.

But now from your help, I think my previously understanding is wrong. Wide
character and multi-byte character are just general terms used on Windows to
represent mapping between a character and multiple (more than one) bytes. Is
that correct?

Differences between multi-byte and wide character? I think they are both
characters which are represented by more than one bytes. Why on Windows they
are distinguished?

2. I am wondering where can I find the mapping table of each codepage (or
encoding)? (how a number is mapped to a character)


regards,
George
 
1. Previously I think wide character representation in computer is a
specific encoding (codepage) approach -- like UTF-16, and multi-byte
character representation in computer is another specific encoding
(codepage) approach.
You are right, they are slightly different approaches.

But now from your help, I think my previously understanding is wrong. Wide
character and multi-byte character are just general terms used on Windows
to represent mapping between a character and multiple (more than one)
bytes. Is that correct?
Not quite. It is a bit more complicated.
First, MultiByteToWideChar is not quite a correct name.
It does conversion from all kind of code pages, including single byte
(like 1252). So naming it MultiByte... it not quite accurate.
But hey, is jut an API name.

The main difference between a multi-byte character (let's say in Shift-JIS
for Japanese) and a wide character is that the multi-byte one was really
thought as multi-byte. "Ni" in "Nihon" was "93 FA" in Shift-JIS. They are
really two bytes, and nobody considered it to be a number, 0x93FA, or 0xFA93.

It is a bit like numbering systems. In base 10, you think about 12 as being
represented by two digits, 1 and 2. If you switch to base 16, 12 is a digit
(represented as the digit 'C').

It is a difference in perception, some might say just philosophical.
But sometimes just looking at a problems differently helps solving it.

Differences between multi-byte and wide character? I think they are both
characters which are represented by more than one bytes. Why on Windows
they are distinguished?
They are distinguished on all platforms, not only in Windows.

2. I am wondering where can I find the mapping table of each codepage (or
encoding)? (how a number is mapped to a character)
First, it is better to use standard API.
But if you want to check some code pages, you can take a look here:
ftp://ftp.unicode.org/Public/MAPPINGS/
Or you can download ICU (International Components for Unicode)
http://www.icu-project.org/download/
and take a look, they have lots of tables.
 
Thanks Mihai!


I am wondering what kinds of codepage (encoding) method could be called as
multibyte character codepage (encoding), and what kinds of codepage
(encoding) method could be called as wide character codepage (encoding)?

For example, if we are given a codepage (encoding) name like UTF-7, how
could we make a conclusion whether it is wide character or multibyte
character?

I think UTF-8 is multibyte and UTF-16 is wide character -- in my current
limited knowledge level. But I am not sure about others, could you help to
list some others (like popular ANSI code page?) or identify what is the rule
used to distinguish whether a codepage (encoding) method is multibyte
character or wide character?

I am still confused that why we distinguish multibyte character and wide
character -- because I think wide character is also multibyte character,
since wide character is of 2 bytes -- multiple bytes. :-)

I have performed some self-study, I think on Windows only UTF-16 is wide
character codepage (encoding). Is that correct?


regards,
George
 
George said:
Thanks Mihai!


I am wondering what kinds of codepage (encoding) method could be called as
multibyte character codepage (encoding), and what kinds of codepage
(encoding) method could be called as wide character codepage (encoding)?

For example, if we are given a codepage (encoding) name like UTF-7, how
could we make a conclusion whether it is wide character or multibyte
character?

I think UTF-8 is multibyte and UTF-16 is wide character -- in my current
limited knowledge level. But I am not sure about others, could you help to
list some others (like popular ANSI code page?) or identify what is the rule
used to distinguish whether a codepage (encoding) method is multibyte
character or wide character?

I am still confused that why we distinguish multibyte character and wide
character -- because I think wide character is also multibyte character,
since wide character is of 2 bytes -- multiple bytes. :-)

I have performed some self-study, I think on Windows only UTF-16 is wide
character codepage (encoding). Is that correct?

George:

On Windows, wide-character means an encoding using 16-bit characters
(unsigned short, or wchar_t). There is only one wide character encoding
in Windows, UTF-16. Most Unicode code points in UTF-16 are just one
16-bit character, but some languages use code points that requite two
16-bit characters.

All other encodings used in Windows use 8-bit characters (unsigned char,
or char).

UTF-8 can represent all Unicode code points, using up to four 8-bit
characters.

The "ANSI" code pages in Windows are 8-bit encodings in which (I think)
at most two 8-bit characters are used for each code point. Each code
page can only represent a subset of the Unicode code points, so
different languages require different code pages.

Both UTF-8 and ANSI code pages are MBCS.
 
Most Unicode code points in UTF-16 are just one
16-bit character, but some languages use code points that requite two
16-bit characters.
There are no two 16-bit characters. The 16-bit thing is not a character, but
a code unit. For characters above FFFF (which cannot be represented on 16
bit) you use two code units (in the surrogates area).
So, the techically corect statement is "some languages use characters that
requite two 16-bit code points" (exactly the other way around :-)
UTF-8 can represent all Unicode code points, using up to four 8-bit
characters.
"up to four 8-bit code units"
Both UTF-8 and ANSI code pages are MBCS.
Not quite.

UTF-8 is not a character set, is a character encoding scheme for Unicode.
So it is not a MBCS (Multi Byte Character Set).

Also, ANSI code pages that have only 256 values (like 1250, 1251, 1252, etc.)
are SBCS (Single Byte Character Sets).

The only ANSI true MBCS are 932 (Japanese), 936 (Simplified Chinese),
949 (Korean) and 950 (Traditional Chinese).
All of them use maximum 2 bytes, so they are also called DBCS (Double Byte
Character Set). An example of MBCS that is not DBCS is GB 18030 (which cannot
be ANSI code page).
 
I am wondering what kinds of codepage (encoding) method could be called as
multibyte character codepage (encoding), and what kinds of codepage
(encoding) method could be called as wide character codepage (encoding)?
To make it simple, in the Windows world wide characters are WCHAR/wchar_t.
That would be UTF-16 (16 bit code units).
So when you call MultiByteToWideChar you will convert anything to UTF-16.
And WideCharToMultiByte will convert UTF-16 to whatever.
That whatever is not always technically a "multi byte character set,"
but you should not care.

For example, if we are given a codepage (encoding) name like UTF-7, how
could we make a conclusion whether it is wide character or multibyte
character?
Although UTF-7 (or UTF-8) is not a code page or an encoding, technically
you will use it on the "multibyte side" of MultiByteToWideChar.

But I am not sure about others, could you help to
list some others (like popular ANSI code page?) or identify what is
the rule used to distinguish whether a codepage (encoding) method
is multibyte character or wide character?
- UTF-16 is wide
- UTF-32 would be wide, but is not supported. And if it will be supported,
it will probably go on the MultiByte part of the conversion API :-)
- UTF7 and UTF-8 are not code pages, but work as MBCS for the Windows
conversion API
- all the rest are SBCS, DBCS, MBCS
- SBCS (Single Byte Character Set): needs only 1 byte to represent all
the characters (max 256 char). Most of the Windows code pages.
- DBCS (Double Byte Character Set): needs 1 or two bytes to
represent all the characters (more than 256 char)
This are used for CCJK: Chinese Simplified (936),
Chinese Traditional (950) Japanese (932) Korean (949)
- MBCS (Multi Byte Character Set): needs 1 or more bytes to
represent all the characters.
Now, SBCS and SBCS are a particular case of MBCS (because of the
"1 or more" part). A code page that is MBCS and not SBCS/DBCS is
GB 18030.

I am still confused that why we distinguish multibyte character and wide
character -- because I think wide character is also multibyte character,
since wide character is of 2 bytes -- multiple bytes. :-)
It is about the design of the thing, not on how many bytes is represented.
It is a bit tricky to "grok," but the good news is that you don't need to
grok it in order to use it. The basic rule: in the Windows world the ony
wide is UTF-16, all the rest is MBCS.
This is not techically corect, but it works as a general rule.
Unless you care about the philosophical aspects, you should not mind :-)

I have performed some self-study, I think on Windows only UTF-16 is wide
character codepage (encoding). Is that correct?
Yes.
 
Thanks David! Your reply is very clear. I am still confused about just one
concept you are using.

8-bit character or 8-bit encoding, I think you mean in the encoding
approach, the basic unit is 8-bit, and in the encoding approach, more than
one 8-bit basic units could be used. Is that correct?

About 16-bit character or 16-bit encoding, I think you mean the basic unit
is 16-bit.

Is my understanding correct?


regards,
George
 
Thanks Mihai! You are so knowledgeable about codepage! So great to meet with
you here!

One more simple question about your reply,

the coding unit, I think you mean the basic units to represent the number
(or storage, hex) form of a character in computer.

For example, about wide character, the coding unit is 16-bit, so each
character is in a form of multiple 16-bits, for example, 16 bits, 32 bits or
64 bits.

For another example, about multibyte character, the coding unit is 8-bit, so
each character is in a form of multiple 8-bits, for example, 8bits, 16 bits
or 24 bits. -- But on Windows, only UTF-16 is used, means each character is
represented by 16 bits.

I have also read through the resource (link) you recommended before, and I
learned that ANSI codepage is a very special codepage name, which has
different meaning on different locales, for example,

1252 (English)
932 (Japanese), 936 (Simplified Chinese),
949 (Korean) and 950 (Traditional Chinese).

Is my understanding correct?


regards,
George
 
Mihai said:
There are no two 16-bit characters. The 16-bit thing is not a character, but
a code unit. For characters above FFFF (which cannot be represented on 16
bit) you use two code units (in the surrogates area).
So, the techically corect statement is "some languages use characters that
requite two 16-bit code points" (exactly the other way around :-)

"up to four 8-bit code units"

Not quite.

UTF-8 is not a character set, is a character encoding scheme for Unicode.
So it is not a MBCS (Multi Byte Character Set).

Also, ANSI code pages that have only 256 values (like 1250, 1251, 1252, etc.)
are SBCS (Single Byte Character Sets).

The only ANSI true MBCS are 932 (Japanese), 936 (Simplified Chinese),
949 (Korean) and 950 (Traditional Chinese).
All of them use maximum 2 bytes, so they are also called DBCS (Double Byte
Character Set). An example of MBCS that is not DBCS is GB 18030 (which cannot
be ANSI code page).

Hi Mihai:

I just knew that I would mess this up, an that someone would have to
correct me. And I just knew it would be you :).

Yes, the word "character" is overloaded. I was using it in the sense of
what you call code unit (char or wchar_t in C/C++). I won't do that any
more.

But what exactly is the other meaning of "character". Is it the same as
"glyph"? From an abstract point of view, I think there are just two
concepts: "(Unicode) code points" and "code units". Each encoding uses
one or more code units to represent some subset (possibly all) of the
code points. It's just what the code points represent that I'm not quite
sure of.

But anyway, don't you think the correct statement is

"Some languages use Unicode code points that require two 16-bit code
units." ?

Because any 8-bit encoding can be used in WideCharToMultiByte() and
MultiByteToWideChar(), I was thinking that any 8-bit encoding could be
regarded as MBCS. SBCS is a special case where the selected code points
can be represented by a single code unit. DBCS is a special case where
the selected code points can be represented with two code units. UTF-8
is a special case where all the code points can be represented, using up
to four code units. It's the old "Is a square a rectangle" thing.
 
George said:
Thanks David! Your reply is very clear. I am still confused about just one
concept you are using.

8-bit character or 8-bit encoding, I think you mean in the encoding
approach, the basic unit is 8-bit, and in the encoding approach, more than
one 8-bit basic units could be used. Is that correct?

About 16-bit character or 16-bit encoding, I think you mean the basic unit
is 16-bit.

Is my understanding correct?

George:

Yes. What I was calling "character" is perhaps better called "code unit."
 
the coding unit, I think you mean the basic units to represent the number
(or storage, hex) form of a character in computer.
No, the code unit is the smallest piece of data that you can manipulate.
It is not a character. A character can take several code units (up to 4 in
UTF-8, 1 or 2 in UTF-16)

For example, about wide character, the coding unit is 16-bit, so each
character is in a form of multiple 16-bits, for example, 16 bits,
32 bits or 64 bits.
Nope :-)
For 16 bits the encoding is UTF-16. You can have one code unit
(representing code points up to FFFF) or two units for the range
FFFF-10FFFF. The mechanism is called "surrogates," if you want to search
for it and read more.

For another example, about multibyte character, the coding unit is 8-bit,
so
each character is in a form of multiple 8-bits, for example, 8bits, 16 bits
or 24 bits. -- But on Windows, only UTF-16 is used, means each character is
represented by 16 bits.
Everything in the PC world is a multiple of 8 bits.
But it is not accurate to say this about encodings.
In MBCS characters are not 16 bits. They are 2 bytes.
It is about how things are see: as a 16 bit unit vs a succession of 8 bit
units.
Think 123 vs 1 2 3.
First one means one hundred twenty three.
The second one is a succession of digits: one two three
They both take 3 digits, but you thing about them differently.

I have also read through the resource (link) you recommended before, and I
learned that ANSI codepage is a very special codepage name, which has
different meaning on different locales, for example,

1252 (English)
932 (Japanese), 936 (Simplified Chinese),
949 (Korean) and 950 (Traditional Chinese).

Is my understanding correct?
Yes.
In the Windows world ANSI code page == default system code page.
It is the same as the code page used by localized non-Unicode versions
of Windows (from Win 3.0 to Win Me).
This is why in the XP UI is described as "Language for non-Unicode programs"
Except for legacy, Unicode applications should not care about it.


Anyway, I think you got enough dry theory. I would say start working,
and you will grok more once you start hitting problems.
After a while you will hopefully have that "aha, now I get it!" moment :-)
 
But what exactly is the other meaning of "character". Is it the same as
"glyph"? From an abstract point of view, I think there are just two
concepts: "(Unicode) code points" and "code units". Each encoding uses
one or more code units to represent some subset (possibly all) of the
code points. It's just what the code points represent that I'm not quite
sure of.
I would say: go to http://www.unicode.org/glossary/
A character is what the user percieves as a character.
An user means: the real guy on the street who has no clue about languages.
This is a cultural thing. For some countries the ae ligature (U+00E6)
is one character, for others they are two.
A glyph is a form (think vectors in a TTF file). The 'a' in Times New Roman
and in Arial has two different glyphs.
You can have a glyph representing more than one character (ligatures) or
part of a character (combining marks).
The code units depend on the character encoding scheme. They are 8 bits for
UTF-8, 16 bit for UTF-16 and 32 bit for UTF-32.
The Unicode code points are in the range 0-10FFFF, and do not necesarily
map one code point to one character. They are the real value.
Think numbers: the concept of "eighteen" is one and the same, although
you can represent it as 12h, or 0x12, or 18 or 11000 (bin) or 030 (oct)
Same, the U+020D is the "Latin small letter o with double grave"
You might represent it in UTF-8 or 16, you can even use Java escape (\u525)
or HTML ȍ or ȍ. It is the same thing.

But anyway, don't you think the correct statement is

"Some languages use Unicode code points that require two 16-bit code
units." ? Right.


I was thinking that any 8-bit encoding could be
regarded as MBCS. SBCS is a special case where the selected code points
can be represented by a single code unit. DBCS is a special case where
the selected code points can be represented with two code units.
You can put it this way, if you want (especially for MultiByteToWideChar)
For me you are not multi-millionaire if you only have one million :-)
So in my book you are not MBCS if you are only SBCS :-)

UTF-8
is a special case where all the code points can be represented, using up
to four code units. It's the old "Is a square a rectangle" thing.
But UTF-8 is not a code page or a character set. It is more like BASE64,
or "quoted printable" But it was easier to squize thru MultiByteToWideChar
instead of adding another API.
Imagine the questions: "so if I convert from Japanese code page, I use
MultiByteToWideChar, but if I convert from UTF-8 code page I use
UTF8ToWideChar? Why? Crap design! What do you mean UTF-8 is not code page?"
:-D
 
Think numbers: the concept of "eighteen" is one and the same, although you
can represent it as 12h, or 0x12, or 18 or 11000 (bin) or 030 (oct)

Some of your coding conversions changed your numbers when you weren't
expecting it. Yeah, numbers have a lot in common with characters ^_^
 
Think numbers: the concept of "eighteen" is one and the same, although you
Some of your coding conversions changed your numbers when you weren't
expecting it. Yeah, numbers have a lot in common with characters ^_^

Right :-)
Got the 18 (decimal), considered it hex, and converted that to bin and oct.
Another try: 12h, or 0x12, or 18 or 10010 (bin) or 022 (oct)
:-D
 
Back
Top