GNU gettext

  • Thread starter Thread starter Jean-Luc M.
  • Start date Start date
Any idea how much of this software is developed in C#?

What, free and open source software? The numbers are growing. C# is
catching on in the F/OSS world; there's at least two C# compiler
implementations that I am aware of, and two CLR implementations. There
are likely to be more (and more specialized) ones, though the one that
most people use (Mono) is quite flexible and efficient.

Banshee, a portable (and very pleasant) media player written in C# uses
gettext. Presently Banshee runs on systems that can run Mono, save for
Windows, though last I heard the final changes necessary to run on
Windows were close to being finished. It's licensed under the MIT
license, so you can check it out and look at the source if you'd like.

http://banshee-project.org/about/license/

--- Mike
 
There is no such thing.
Some tables in the OpenType specs are limited to 65.536 entries,
and there are values limited to 16 bits.
Unicode has more than 100.000 characters allocated.

My mistake. I should have said "font families"; you're right, an OTF
file can only hold one Unicode plane. Which brings me to my next
point...
Again, no such thing.

This is something you will invent, because it is not available.
And it will be a quick (and buggy) solution, like the others.

Actually, Gtk (or possibly the text rendering libraries that it uses)
already does this. If the requested Unicode code point is not
available in the font being displayed, the system will find another
similar font (preferably in the same family, so that there is æsthetic
consistency, but if not, then at least within the same class, serif,
sans, etc.) and display the glyph in that font, instead. Gtk+ has done
this for years now.
It is not only about text, it is about the full control.
The scroll-bar goes to the left, radio buttons will have the (x) in
the right, (sometimes) you will have to mirror the bitmaps of the
buttons.

This is easily done on a stock Gtk+ system (even menus and toolbars are
right-aligned and reversed from a normal English layout, when the
application has support for that locale).
Example: in the Arabic IE toolbar the left and right arrows
(back/forward) will have to be mirrored (because in Arabic "forward"
is "to the left") But you don't want to mirror the bitmap with the
the IE logo.

Do you know of such a system, that can automatically know if a bitmap
should be mirrored or not?

Automatically? No. There is no such thing; computers don't
intrinsically understand the content of graphical images. However,
you can certainly indicate alternate images which should be used based
on the rules for the current locale. Software that is inordinately
complex and has a great deal of bitmapped graphical elements in it
which are not standard widgets have to specify what the differences
are. As an example, OpenOffice.org reverses the entire toolbar when
the locale is set to Arabic. I'd imagine that other applications
behave the same way when they are i18n-aware and cover the target
language. That having been said, I don't know that anything even close
to the majority of software has been translated to Arabic. That
doesn't prevent you from entering data in it that way, though, if the
system supports Unicode fully, you'll get very basic automatic
switching between LTR and RTL in a mixed-language document.
 
translation files give you a pointer to the origin of the message so
that you can see where it's used to gain a sense of context within the
program.

gettext was designed by geeks, for geeks.
A professional translator does not care (and does not want) to look
in C files to see where a string is used.

non-breaking whitespace doesn't get displayed unless you ask a
tool to show it to you.

Quite on the contrary. non-breaking whitespace always shows like
a space. Only that it prevents a line break there.
You use it (for instance) between Mac OS and X so that you
don't end up with Mac OS
X


Lucky for us, one can use UTF-8 in C source code without an
issue, since C treats a char * as a sequence of bytes. The user

http://www.gnu.org/software/gettext/FAQ.html#nonascii_strings

"Short answer: If you want your program to be useful to other people,
then don't use accented characters (or other non-ASCII characters) in
string literals in the source code."
....
"So, in summary, there is no way to make accented characters in string
literals work in C/C++."


you'd need to write a small utility to help
you to manage the files, but it's certainly doable.
Heck, you could use a padding of NUL bytes
One could also, if they _really_ wanted, actually modify gettext

Anything is fixable.
But a tool that needs so much fixing to do the job that was supposed
to do is a bad tool.

I work with a very niche audience, and
don't have a need for i18n other than ensuring that it'd be easy to do
if the client changes its mind later.

And it shows, sorry to say.

I've learned enough of two other languages to
understand the sorts of situations that are awkward to translate,
and so I avoid them.

This does not scale. When you translate into 30 languages,
the problems start accumulating, especially if the languages
are unrelated.

Take a look at the Unicode standard. It's freely available on the
Internet.

This is whay I was asking. You take a look, and you will
discover that those *control characters* have a very clear role.
And that role is not to serve as invisible differentiators in keys.

I'm not sure why you'd count in words, when strings are the important
thing.

Because this is how the size of translatable text is measured.
This is what you pay for. So people in the l10n business
think in "word counts"

True, for this discution the important thing is indeed the number of
strings. But that does not give a good feeling on how big the thing is
Imagine I tell that the temperature is 27 degree Celsius.
Yes, it is easy to convert for F, but you don't know instantly
if it is pleasant, too cold, or too hot.
(I assume you live in US and you are used to F, not C).


Exactly!
And gettext was created in the open source world,
where a geek writes the software, another geek translates it,
and nother geek will use it.
Once you get out of that world, things start breaking.
A geek writes the software, a linguist (professional tranlsator)
will localize it, and a total non-geek will use it.
Then small things like gender, case, number, etc. start to look
unprofesional (think "all your bases are belong to us" :-)

Every developer should be
well-practiced in applying language elegantly, not just putting it
there.

I agree here.
But for things to go right, it means that developers would also have
to be well-practiced in thinking how something will behave in 30
languages, most of them totaly unfamiliar.
Impossible.
This is why you need good libraries/tools.

Updating even 100 language
catalogs can be done in well under a minute using them unless your
catalogs are on a very slow network drive.

Have you ever been involved in translating a medium-size software
into more than 10 languages?

There are lots of g11n, i18n, and l12n experts out there. Many of them
have broad experience in a good number of systems, and most of them
that I have met are native speakers of more than one language (a
variant of English and another language is the most common I've run
into).

Speaking a foreign language (or 2, or 4, or even 10) does not make one
a g11n/i18n/l10n expert. It helps, but it is not enough.
Is my writing so good that I sound like a native English speaker?
I know this is not the case.

You're the first that I've seen actively complain about gettext,
to be honest.

I have friends that are doctors (the medical kind).
And they don't complain to me about crapy drugs, or procedures.
So try asking one of you expert friends about it.
See what they say. Maybe even tell them about some of my reasons.

Agreed; it's time to end the thread. You seem to be frustrated.
Please accept my apologies if I've somehow caused that.

That's ok, comes and goes :-)

But you seem to not want to listen.

Do you really think that all these guys using the key-value system
are all idiots? Between the MS guys that created the .rc files,
and the C# localization model, Sun with the Java properties files,
all of them?
Do you really think they are not aware of the "great invention"
that is gettext?

gettext works in the "3 x geek" world. You move out of there,
you need quality. And gettext is not enough anymore.
 
Actually, Gtk (or possibly the text rendering libraries that it uses)
already does this. If the requested Unicode code point is not
available in the font being displayed, the system will find another
similar font (preferably in the same family, so that there is æsthetic
consistency, but if not, then at least within the same class, serif,
sans, etc.) and display the glyph in that font, instead.

As I kave explained, code points are not enough to decide on what font to
use. Fonts (and glyphs) can be language specific. Read again my comment.
Gtk (and Windows, and Mac) can do some educated guessing. But the results
are not always accurate. That is ok for some text input, but not for the
full UI.
 
In very great summary, I think that there is simply something that I
just don't comprehend about the way .NET does it, even after looking at
the documentation and the like and that there is probably quite simply
a large difference at how we view doing things. I'd like to seek to
correct my own apparent deficiency here, and it is to your credit that
I've learned about it.

Apologies about the rather long length of the post, ahead of time, and
further apologies if my apparent inability to "get it" is frustrating.
I am trying to understand what the major differences are as well as
what it is I don't get.

Quite on the contrary. non-breaking whitespace always shows like
a space. Only that it prevents a line break there.
You use it (for instance) between Mac OS and X so that you
don't end up with Mac OS
X

Oops. I meant to say zero-width non-breaking whitespace. I missed the
most critical part. :-P
http://www.gnu.org/software/gettext/FAQ.html#nonascii_strings

"Short answer: If you want your program to be useful to other people,
then don't use accented characters (or other non-ASCII characters) in
string literals in the source code."
...
"So, in summary, there is no way to make accented characters in string
literals work in C/C++."

I'd wonder how long it's been since that was written, since no current
system that I am aware of is restricted to ASCII or ISO-8859-xx only
character sets---even embedded devices these days are able to render
Unicode characters pretty easily.
And it shows, sorry to say.

Anyone can be expected to be ignorant in a field in which they're not a
full-time-plus participant---I'm no exception to that rule. That having
been said, I do continually attempt to learn as much as possible to make
future modifications easier, be that internationalizing an application I
write, or me coming back later to modify the application. This
includes a11y, i18n, UI design, development techniques, etc.
Exactly!
And gettext was created in the open source world,
where a geek writes the software, another geek translates it,
and nother geek will use it.

I have to disagree on the idea that F/OSS is by geeks and for geeks...
I know many non-geeks that are quite happy to use it. That's really
neither here nor there, though. gettext was designed for the "small
tools" way of thinking---using a suite of small tools that are designed
to interoperate with other tools (known/unknown, past/present/future)
and exceed the original developers intended design. There's nearly 40
years of utilities out there designed in this way. Some people do
think that the tools are too narrowly-aimed, but that's precisely part
of the point, is to have very little overlap. This means that
functionality provided by older, already existing tools (e.g.,
mass-editing multiple text files, or generating multiple text files
that fit a pattern, etc.) won't be reimplemented in a newer tool that
follows that philosophy. It's a way of thinking, to be sure, but it's
not restricted to geeks.
Once you get out of that world, things start breaking.
A geek writes the software, a linguist (professional tranlsator)
will localize it, and a total non-geek will use it.
Then small things like gender, case, number, etc. start to look
unprofesional (think "all your bases are belong to us" :-)

It'd seem that perhaps---among other things---I don't quite see how
using an arbitrary key to look-up a string helps productivity at all.
If you're working on the code for a project, and what you're looking at
is:

string windowTitle = stringsResourceMgr.GetString("dialog.print");

Then it'd seem you have to go and look that up in whatever language
you're working in. Most programmers work in English, so they'd have to
go search the English language resource file and look for the string.
It'd seem that having something like:

string windowTitle = "Document Print Settings";

.... would eliminate that right away.

The other solution would be to put the string in a comment near where
the string is referenced from the resource it's loaded from, but that
can very easily be out of date since comments are often less maintained
than code. It seems to me that there is more room for error that way
due to it being less straightforward.

However, I'm not sure that I could come up with anything any better
than gettext or the way that .NET handles it natively or the way that
Java handles it natively without a very large, significant amount of
thought. It's not an easy problem to solve, and there may never be an
ideal or perfect situation---and it seems to me that there are
tradeoffs depending on which ones you pick. I suppose what it comes
down to then is deciding what trade-offs are acceptable for a given
project.
I agree here.
But for things to go right, it means that developers would also have
to be well-practiced in thinking how something will behave in 30
languages, most of them totaly unfamiliar.
Impossible.
This is why you need good libraries/tools.

Well, maybe just the 5 most widely varied languages. It'd be somewhat
redundant to know both French and Spanish for this purpose, but it'd
help to know, say, Latin (limited word roots available), Esperanto
(same thing, but more adaptable, often used in computerized translation
software as an intermediate language for it's "meeting in the middle"
between many languages and constant regularity), French or Spanish, and
one or two other languages that work entirely differently from the
above. As a fall back, having people close by that speak English and
one or more of those other languages would be a decent help, too. But,
there will _always_ be quirks, and there's never going to be a day and
age where translations will always be literal between languages.
Have you ever been involved in translating a medium-size software
into more than 10 languages?

No, but I have been involved in medium-to-large size software wherein
anywhere from hundreds to thousands of files had to be processed in a
similar fashion. The example here was fixing the English keys, in
gettext's case. In that case, you're going to be making the exact same
transformation on every single file, and this is a simple for loop at
the shell, using sed or awk to do the substitution. To make it easier,
the three or so commands can be wrapped into a shell script.
Speaking a foreign language (or 2, or 4, or even 10) does not make one
a g11n/i18n/l10n expert. It helps, but it is not enough.

If I made that implication, I didn't mean to; sorry. Most of the work
of globalizing an application can be done by a reasonably seasoned,
non-arrogant programmer that is willing to learn where s/he must not
make assumptions---e.g., not assume that numbers are always formatted
with commas for thousands separators and decimal points for the
separation between the whole and fractional parts of a number, or
assuming that dates are printed the way they're used to, or that layout
issues will be static. It takes some learning to gain the ability to
do it, and a great deal of practice to stop making the assumptions that
we are inclined to make based on our past (usually somewhat narrow)
life experience.
Is my writing so good that I sound like a native English speaker?
I know this is not the case.

You are correct, but I think it's somewhat beside the point. You are
able to communicate in English, which at the very least gives us some
common ground; you've provoked a lot of thought on my end as well, and
for that I am truly appreciative.
I have friends that are doctors (the medical kind).
And they don't complain to me about crapy drugs, or procedures.
So try asking one of you expert friends about it.
See what they say. Maybe even tell them about some of my reasons.

While I've not queried everyone I know, I can say that the only things
that I've heard some of my friends gripe about is having to learn
multiple systems for the same task between different jobs.

Also, while I don't *personally* work in the field of i18n, I've also
learned quite a lot from them about writing programs that are easy to
internationalize, because they do also gripe about software that can't
be internationalized by them because the program is too rigidly
structured and makes too many culture-specific assumptions.
That's ok, comes and goes :-)

But you seem to not want to listen.

No, I do want to listen. I want to more than listen, I want to
comprehend. Admittedly, I still don't fully understand why using an
arbitrary key is better, or how this saves work overall. I see other
issues with the way .NET does it from what I understand of the system
so far---though to be sure, I need to read more about it, despite
having read a lot on it. One such example: using assemblies to hold the
information, instead of a plain text file or generated hash table from
the plain text file.

An application front-end that is written in multiple language
environments but exposes a similar or identical interface can then no
longer share the same data, or must have a central repository of that
data which is fed to multiple code/file generators that work with each
i18n resource management system. Some of the largest companies I've
performed work for have such software, but _do_ use gettext because of
its ubiquity. While I can't name specifics, I can say that one
employer of 150,000+ employees uses gettext in its internal software
used by sales agents around the world for this very reason---they've
adopted it for (nearly) all of their software, including that which
wasn't previously internationalized and has existed since before
gettext was written.
Do you really think that all these guys using the key-value system
are all idiots? Between the MS guys that created the .rc files,
and the C# localization model, Sun with the Java properties files,
all of them?

No; I think that it's a very difficult problem domain where an ideal
solution hasn't yet been discovered. I think that there are many
tradeoffs between the various available systems, such as tying oneself
to a particular system or portability. Now, C# and .NET's way of
handling it is more portable than Java's, since a fully JIT-capable CLR
exists for more platforms (currently, though this may change since the
only fully-compliant Java VM can now be ported by the community),
though not still not quite as portable as gettext, which works just
about everywhere imaginable. That having been said, I do know that
Java---at least as of the last time I did any programming in it---didn't
have support for easy pluralization in its own system of
internationalization; I'd call that a decent oversight, myself.
Do you really think they are not aware of the "great invention"
that is gettext?

I'm not in for a religious debate; my initial recommendation was based
on the idea that it's more portable between languages, runtime
environments, and operating systems than the .NET standard means of
implementing internationalization. It'd seem that really, none of the
existing systems are perfect. Then again, I know of nothing that is
perfect in every way.

'gettext' was primarily designed for the "small-tools" way of doing
things, much like UNIX was designed, and much like most well-written
modular software is designed. You'd mentioned that gettext was a bad
tool for needing to be "fixed," but I don't think that revising strings
means that it needs to be fixed. On the contrary, I'd think using
arbitrary resource identifiers can potentially lead to translated
strings diverging from each other in meaning, unless there is something
grave that I am missing about how the .NET way of doing it works. I'll
wholeheartedly admit that I have a great deal more functional
experience with gettext than I do with anything else, since I use
software that uses it and sometimes make updates in gettext's resource
files for people when they don't have the technical know-how to do it
(say, they're contributing a single translation, and I happen to be
around to help them get it into a resource file).
gettext works in the "3 x geek" world. You move out of there,
you need quality. And gettext is not enough anymore.

I'm not entirely sure what you mean by the "'3 x geek' world".

I think we'll have to agree to disagree there, however; I'd argue that
the systems I use are typically of a much higher quality than most
proprietary commercial systems that I've used over the years---it's one
of the reasons that I, by and large, don't use proprietary commercial
systems unless it's a very specific requirement for a client. I don't
have time to deal with minutia like software crashes, required reboots,
BSODs, etc.---I need to be able to sit down, get my stuff done, and
go. I think that we at least agree on that, insofar as it probably
applies to both of us.

Given that what I use employs the small-tools philosophy (it means lots
of small, well-tested components as opposed to large, monolithic
time-bombs) it works well for me. Yes, there is the rare time that I
run into an extremely strange issue and need to think of a way to do it,
but 99% of the time, I can do it without writing any new software
whatsoever---I'd define that as quality. The system I use today is
built for end-users, power users and developers alike, and is very
flexible in that fashion---it was designed with the idea of being used
beyond its design, if that statement makes any sense. Being based on
nearly 40 years of time-tested principles of small, extensible tools
has its advantages. There will eventually be a major fundamental shift
in how things are done---possibly even including computers being able
to learn human languages like we can today teach them new programming
languages. Ah, that would be ideal. But we're not there yet.

What I do think is that somehow, for the moment, there's something that
I am not getting about the way .NET handles it. Comparing it with
gettext, it seems to be that either there is possibly something very
fundamental missing from the documentation, or possibly that I am
overlooking something pretty major.

I can at least say with a fair bit of certainty that there is a great
deal of room for growth still in giving application software the
ability to be easily used among a broad spectrum of users around the
globe. I think that an ideal system would be
language/environment/platform agnostic, easy-to-use, and provide the
ability to keep translations in sync with each other.

Incidentally, if you don't mind my asking, what do you think of the
interface for mass translations that Ubuntu uses
(https://translations.launchpad.net/)? The application software that
uses it (which is by and large still only a minimal amount yet, since
the front-end is relatively new) is already able to handle
locale-specific things like the formatting of dates, times, currency,
etc., and is in most cases requiring translation into at least a
handful of the 268 languages available. I am curious as to your
thoughts on it, though; it appears to provide a combination of methods
to attempt to provide enough information for translation (far more than
Google does for its effort to internationalize its services, for
example) and also appears to support multiple translation libraries
that application software may use.

--- Mike
 
As I kave explained, code points are not enough to decide on what
font to use. Fonts (and glyphs) can be language specific. Read again
my comment. Gtk (and Windows, and Mac) can do some educated guessing.
But the results are not always accurate. That is ok for some text
input, but not for the full UI.

I don't quite get what you mean. All glyphs outside of the private use
areas are well-defined and thus unambiguous. Font switching is done
based on the block of Unicode the character belongs to, and if no font
is present that contains support for that block of characters, it has to
display something that indicates that the character couldn't be found;
Gtk+ will display the hexadecimal code point in a box in that case.

Gtk+ will also not give up if a font claims to support a block of
Unicode, but doesn't have the desired character. It will continue
until it finds the character or determines that no font supports it.
The semantics for the PUAs specified in Unicode 5 are a bit different,
I think, but I can't remember how they work precisely.

If the rendering system selects an invalid result, it's due to a bug in
the font or a bug in the renderer.

--- Mike
 
In very great summary, I think that there is simply something that I
just don't comprehend about the way .NET does it

Oh, I don't argue that the way .NET does it is perfect.

I was just contrasting the two models:
- gettext style: strings hard-coded in sources, the English string is
the key, no translator-friendly context (line number and source file
name is not translator-friendly :-)
- .NET style: strings in a separate file, accessed thru keys that are
completely independent, and all stored in a file that also contains
visual information. So the localizer has control and can change
all the extras (coordinates, colors, fonts, alignment, etc.)

And these are, in fact, the two main models, with models like Java
in between: strings accessed by key, UI not accessible to localizers.
Apologies about the rather long length of the post, ahead of time, and
further apologies if my apparent inability to "get it" is frustrating.
I am trying to understand what the major differences are as well as
what it is I don't get.

Sorry, maybe I also had a long day before getting home and answering :-)

I'd wonder how long it's been since that was written, since no current
system that I am aware of is restricted to ASCII or ISO-8859-xx only
character sets

You are right here. But the C world is still behind.
Food for thought: how do you specify the encoding of your C file?
The bunch of bytes in your C source is interpreted based on the OS
setting. What happens if you are on Linux, set the locale to en_US.UTF8,
then a Japanese developer with ja_JP.EUC-JP tries to compile the thing?

The answer in the FAQ is still valid: there is no way to *guarantee*.
You have to ask that the build environment is set in a certain way.
And in Windows you cannot set the system code page to utf-8.
Plus, changing it requires reboot. Oups!
(don't take this as a plus for Linux, that model has it's own problems)

Anyway, not arguing here. Just explaining that the gettext FAQ is
still correct.

Anyone can be expected to be ignorant in a field in which they're not a
full-time-plus participant---I'm no exception to that rule.

Absolutely. I don't mean I know everything about everything.
It just happens that I know a bit more about i18n. That's all.
And I am (too) trying to learn.


It'd seem that perhaps---among other things---I don't quite see how
using an arbitrary key to look-up a string helps productivity at all.

Oh, but it is not about productivity at all!!!
I think this was the main missunderstanding!
It is about the linguistic quality of the result.

In my book, the experience of my user is more important than mine.
If I design a library, I am willing to work 10 times more if this
makes it easyer for the developer using my library.
And a certain functionality will be handy for the developer, but
dangerous/incorrect/whatever for the end-user, then I will not
put that functionality in the library.

So me < developers using my lib < end-user.

..NET sacrifices the developer, gettext sacrifices the end user.
It is a pitty that you cannot have both.
I have some ideas on how thing might be better, but will see...

Well, maybe just the 5 most widely varied languages.
Agree. But that would probably mean something like French, German,
Japanese, Chinese, Arabic :-) Not an easy task for a developer
that also has to care about functionality, security, accesibility,
performance, deadlines, etc. :-)
Here the open source world has it a bit easier: no deadlines,
it is ready when it is ready. And use it as is not guarantees.
Until it gets successful. Then you get deadlines, and bad press
for vulnerabilities, and screen-shots in tech magazines making
fun of bad translations...
:-)

you've provoked a lot of thought on my end as well, and
for that I am truly appreciative.

In fact, you did the same for me.
It does not mean I will love gettext :-)
But it might mean that in the comming Christmass vacation I will
maybe put together an article, or maybe write a VS plugin that I
have in mind for a while. Anyway, move something.


having to learn
multiple systems for the same task between different jobs.

Oh, I can understand this.

I still don't fully understand why using an
arbitrary key is better, or how this saves work overall.

It does not save work. Quite on the contrary :-(

One such example: using assemblies to hold the
information, instead of a plain text file or generated hash table from
the plain text file.

You can think of the assembly as a smarter hash table.
That can also contain images, coordinates, font info, colors, bitmaps, etc.

uses gettext in its internal software
used by sales agents around the world for this very reason

I bet that there is no chance of a screen-shot with a bad translation
ending up in a magazine.
But for other kind of software this is not a choice.
Think about Microsoft, or Apple, or Adobe, or Amazon.


I think that there are many
tradeoffs between the various available systems
Amen! :-)

I'm not entirely sure what you mean by the "'3 x geek' world".
"geek developer, geek translator, geek user" :-D
(not a standard thing, I know)


I think we'll have to agree to disagree there, however; I'd argue that
the systems I use are typically of a much higher quality than most
proprietary commercial systems that I've used over the years

Oh, no argument about the technical quality of the systems.
I only talking about translation.
Geeks are usually very good at generating good technical quality
(especially when they are not pressured by deadlines)
I use open-source software for a long time.
And I don't use "geek" in a dismissive way. I am one.

What I do think is that somehow, for the moment, there's something that
I am not getting about the way .NET handles it. Comparing it with
gettext, it seems to be that either there is possibly something very
fundamental missing from the documentation

I would agree that the .NET documentation is not quite clean in that area.
In some respects .NET handles this better than Win32 (.rc files), but
in other respects it is worse.

Incidentally, if you don't mind my asking, what do you think of the
interface for mass translations that Ubuntu uses
(https://translations.launchpad.net/)?

I was aware of it, but never used it. So I cannot judge.


Incidentally, if you don't mind my asking, what do you think of the
interface for mass translations that Ubuntu uses

locale-specific things like the formatting of dates, times, currency,
etc., and is in most cases requiring translation into at least a
handful of the 268 languages available.

I am almost ready to bet (without looking) that they use ICU.
(International Components for Unicode)
Open source, with a very non-restrictive license, available for C/C++
and Java: http://www.icu-project.org
 
I don't quite get what you mean. All glyphs outside of the private use
areas are well-defined and thus unambiguous.

Glyphs are not well-defined. The glyph is "the shape" or character.

A code page (coded character set) assigns a numer to the 'a' letter.
But it does not care about glyphs.
So Unicode do not care about glyphs.

'a' in Arial, and Times, and Geneva has different glyph shapes,
but the same code point.

Problem is, the glyph shapes is culture-sensitive.
See http://www.unicode.org/faq/han_cjk.html#3

It sound like a mistake and is blamed on the "Han unification"
But it is not. You can see it even for plain ASCII.

In US 7 has two straight lines.
But in Europe you usually write it with a line in the middle and
the top line wavy: http://en.wikipedia.org/wiki/7_(number)

(the wavy line is mostly used in handwriting, but you can also
see it in some serif fonts, like this
http://farm1.static.flickr.com/11/12866591_949dace7db.jpg?v=0)

9 has a straignt line in US, but a hook in Europe.

Compare:
US: http://www.handbehindtheword.com/SamCov.jpg
Europe: http://www.cis.hut.fi/Opinnot/T-61.231/Harjoitustyot/Digits/img1.gif

If the font you use would show the European style, you would probably
recognize it as 7, but you would say "but this is not the American 7"
or something along these lines.

Exactly the same thing for Japanese/Chinese Simplified/Chinese Traditional.
Only a bit worse (I guess it is more difficult to keep in sync the shape of
thousands of characters across 13 centuries :-)

So, back to the initial statement: a code point is not enough to determine
a font. You need locale information. You can guess (and the more text you
have, the more accurate the guess), but you cannot determine it for sure.
And if some labels will show with a Chinese font in a Japanese UI, some
users will complain.
 
Michael said:
That's news to me.

Maybe you don't use it, and maybe your business doesn't use it, and
maybe programmers who haven't a clue what GNU software is don't use it,
but that doesn't mean it's not very widely used.

Lot of programmers that know about GNU don't use it.
Virtually all free
and open source software that is internationalized uses gettext.

Absolutely not true.

In fact most of the stuff in Java, .NET and traditional Win32 does
not use it.
And
you'd probably be surprised how much GNU software there is around you.

I know about GNU software.

Arne
 
Back
Top