GNU gettext

Jean-Luc M. · Dec 2, 2008

Hi,

Anyone have an example to use this translation tool with C# please ?

Thanks,

Michael B. Trausch · Dec 2, 2008

Anyone have an example to use this translation tool with C# please ?

The manual for GNU gettext (as well as a link to the latest version of
it) can be found on the GNU Web site:

http://www.gnu.org/software/gettext

The manual shows you how to use it with C, Java, and C# (and possibly
more languages than that, I don't know). In any case, it also comes
with the C# code that you need to access gettext translation files.
The hard part isn't using the translations, really, it's creating them
in the first place. The same utilities are used for that, though,
regardless of what (computer programming) language you're using.

--- Mike

Arne Vajhøj · Dec 3, 2008

Jean-Luc M. said:
Anyone have an example to use this translation tool with C# please ?

I will strongly recommend that you use the .NET way of doing
I18N in a .NET app.

Arne

Michael B. Trausch · Dec 3, 2008

I will strongly recommend that you use the .NET way of doing
I18N in a .NET app.

Might I ask why? What's the strength of .NET's built-in way of i18n
compared to GNU gettext?

--- Mike

Mihai N. · Dec 3, 2008

Might I ask why? What's the strength of .NET's built-in way of i18n

compared to GNU gettext?

I am not sure .NET's strength, but I know about gettext's weaknesses.
The major one: the English text is the key.
And that does not work.
In Latin languages (and not only) the one and the same English
word is translated differently for labels/titles (descriptions)
vs. buttons/radio buttons (action/commands)
Example: "Print" in French is "Impression" or "Imprimer"
Since you can only have one key in the gettext dictionary, and that
is "Print", you cannot translate it in several ways, depending on context.
Same problems for many other reasons: "Scan" is translated differently
if it is about scanning the disk vs. scanning a piece of paper,
many things in English can be both verbs and substantifs, etc.
Then you have gender, number, case.

So, crappy quality by design, guaranteed.

Michael B. Trausch · Dec 3, 2008

I am not sure .NET's strength, but I know about gettext's weaknesses.
The major one: the English text is the key.
And that does not work.
In Latin languages (and not only) the one and the same English
word is translated differently for labels/titles (descriptions)
vs. buttons/radio buttons (action/commands)
Example: "Print" in French is "Impression" or "Imprimer"
Since you can only have one key in the gettext dictionary, and that
is "Print", you cannot translate it in several ways, depending on
context. Same problems for many other reasons: "Scan" is translated
differently if it is about scanning the disk vs. scanning a piece of
paper, many things in English can be both verbs and substantifs, etc.
Then you have gender, number, case.

So, crappy quality by design, guaranteed.

The documentation for GNU gettext does cover these sorts of issues in
the section on preparing program strings for translation. The major
reason for using an English key is that there's no real advantage to
having a surrogate key that would be used in its place. FWIW,
surrogate keys are far overused everywhere in programming these days.

That having been said, translatable strings can have comments
associated with them so that when they're added to translation resource
files, the context is clearly established. This minimizes error when
the translation is going to be made. As with any system, if it's used
improperly, it'll result in less than accurate results.

The largest reasons I can think of to use gettext in a .NET application
include:

(a) because it's vastly multiplatform,
(b) it's highly multilanguage, and
(c) it's much more likely that a translator is going to be familiar
with gettext, especially if working on a free software project
(and thus will beware of some of the pitfalls that those of us
only speak English will create for them, usually inadvertantly).

I'd use gettext for those reasons under .NET, for the same reason that
I use doxygen for documentation generation (and it's associated
syntax of /** ... */ for docu-comments). Though for doxygen, it's
mostly because doxygen's style of docu-comment is much more readable
inline and much more maintainable, IMHO, and particularly in
mixed-language environments.

Of course, all of that having been said, the ultimate reason doing
something a specific way is to be consistent with others in a group.
If independently writing software, I am of the mindset of doing
whatever works best. If working in a group (or writing for a target
group) then doing things the way that they do them obviously gives one
the upper hand, so to speak.

--- Mike

J-L · Dec 3, 2008

Michael B. Trausch vient de nous annoncer :

The manual for GNU gettext (as well as a link to the latest version of
it) can be found on the GNU Web site:

http://www.gnu.org/software/gettext

The manual shows you how to use it with C, Java, and C# (and possibly
more languages than that, I don't know). In any case, it also comes
with the C# code that you need to access gettext translation files.
The hard part isn't using the translations, really, it's creating them
in the first place. The same utilities are used for that, though,
regardless of what (computer programming) language you're using.

--- Mike

I've already checked out the manual on the site, but the solution it
suggests is to use .resources and the GetResourceManager; I can't use
this solution because my program must be integrated within an
already-existing suite which uses gettext with .po and .mo files.

What I am trying to do is to use my .net program to interface directly
with libint13.dll from gettext 0.14.

Calling the function in the dll doesn't seem to raise any
errors/exceptions, but doesn't seem able to find the translation (e.g.
in French) either. Perhaps due to an error in calling the dll (any
further initialization needed before performing the call?) or something
to do with the structure of my files in the local sub-folders? Any
input would be greatly appreciated!

Michael B. Trausch · Dec 3, 2008

Calling the function in the dll doesn't seem to raise any
errors/exceptions, but doesn't seem able to find the translation
(e.g. in French) either. Perhaps due to an error in calling the dll
(any further initialization needed before performing the call?) or
something to do with the structure of my files in the local
sub-folders? Any input would be greatly appreciated!

Without seeing any of the code, there's very little that I can give you
for input; the only thing that I can tell you is that you need to
invoke the native DLL's entrypoints the same way that the program
you're integrating with does. Do you have the source code to that
program? If so, you'll want to follow the setup, use, and tear-down of
the DLL---since I don't use the Windows version, I don't know what the
differences are from being on a Linux system, if there are any.

--- Mike

Arne VajhÃ¸j · Dec 4, 2008

Michael said:
Might I ask why? What's the strength of .NET's built-in way of i18n
compared to GNU gettext?

That it is what the future maintenance programmer will know.

Arne

Michael B. Trausch · Dec 4, 2008

That it is what the future maintenance programmer will know.

Eh, I suppose. There is a good chance that a programmer that has been
around for a while is going to also have encountered GNU gettext, since
it's overwhelmingly popular outside of the confines of Windows,
though. I think that either is acceptable depending on circumstances,
though I'd personally choose gettext.

--- Mike

Mihai N. · Dec 4, 2008

The documentation for GNU gettext does cover these sorts of issues in

the section on preparing program strings for translation.

I don't see how this solves anything.
It is not that you don't have the info (solved by comments)

It is that you have both situations in the same application.
Imagine a dialog with the title "Print" and a button (in the same dialog)
with the text "Print". And a menu option "Print"

The button must be translated as "Imprimer" and the title "Impression"
It does not help that you have comments.
The message catalog can only have one key, and that is the English string
"Print". Haw do you map from one to many?

The major
reason for using an English key is that there's no real advantage to
having a surrogate key that would be used in its place.

This is the real advantage:
English
dlgTitlePrint = Print
btnPrint = Print
French
dlgTitlePrint = Impression
btnPrint = Imprimer

FWIW,
surrogate keys are far overused everywhere in programming these days.

They are used (not quite enough) because they solve real problems.
The gettext looks like it solves the localization problems for lazy
programmers that don't understand how other languages work.
But it does *guarantee* bad quality for the results.

Mihai N. · Dec 4, 2008

Anyone have an example to use this translation tool with C# please ?

Other arguments *against* gettext vs. .NET standard way:
- Many localization tools can show previews of the .NET resources
so a translator can see the dialog, or menu to be translated
This gives better context than any stand alone string (even with comments)
- Localization goes beyond plain text
Many other elements must be changed: fonts, font sizes (for languages like
Chinese, Japanese, Korean), UI should be mirrored (for languages like
Arabic, or Hebrew), sometimes colors or images.
- The size of the forms should change to fir the size of the new text
(very often in length, sometime in height)
The X-Windows GUI model allows for automatic resizing (with a rich
collection of layout managers)
.NET has something, but is not that rich. And you will have to design
your forms using those techniques to achieve auto-layout, it does not
come for free.
It is still a good idea to do it if you translate in many languages,
because it saves resizing costs after translation.

All this added to my previous explanations for Michael.

Michael B. Trausch · Dec 4, 2008

I don't see how this solves anything.
It is not that you don't have the info (solved by comments)

It is that you have both situations in the same application.
Imagine a dialog with the title "Print" and a button (in the same
dialog) with the text "Print". And a menu option "Print"

The button must be translated as "Imprimer" and the title "Impression"
It does not help that you have comments.
The message catalog can only have one key, and that is the English
string "Print". Haw do you map from one to many?

You can differentiate between them by making the key different. Yes,
the key is in English. But if you have words that are used differently
and thus translated differently, you're going to have them tagged in
one way or another. It's very easy to use a non-printing Unicode
character to "tag" such differences, and then mark in the comments
which instance is which. There are other ways you can accomplish the
task, as well---it just takes a little bit of imagination.

The *other* solution would be to use more free-standing text in things
like window titles and menu actions. IMHO, using "Print" for a dialog
title and a menu item is rather silly.

And if you don't like either, then you can use "Printâ€¦" as the key
for the menu item (since it ought to have an ellipsis anyway) and
"Print" for the dialog title.

There are three quick solutions, and I am sure that there are many,
many more to choose from given a little bit more thought on the issue.

They are used (not quite enough) because they solve real problems.
The gettext looks like it solves the localization problems for lazy
programmers that don't understand how other languages work.
But it does *guarantee* bad quality for the results.

No, they are _overused_. Very often, they are nothing more than a
shortcut to a solution---an alternative to a better solution that
doesn't require the use of a surrogate key. You find unnecessary
surrogate keys in databases of all sorts today, and why? Because the
person(s) who developed the database didn't bother to use the real
unique key.

Now, for _some_ data tables, yes, you need a surrogate key. Hell,
there are sometimes business practices which dictate the necessity of a
surrogate key. But surrogate keys are still overused far too often.
There are programmers out there that would advocate that every row in
every database should be tagged with a UUID, even though this is
extremely wasteful and most of the time there is a better, natural key
that can be used. However, here we're leaving the field of programming
and getting more into the DBA side of things. A DBA worth his or her
weight in gold won't design a database to use a surrogate key when it
can use a natural one unless there is a _very_ good reason for it (say,
necessary denormalization breaks the natural key across multiple
relations).

--- Mike

Michael B. Trausch · Dec 4, 2008

- Localization goes beyond plain text
Many other elements must be changed: fonts, font sizes (for
languages like Chinese, Japanese, Korean), UI should be mirrored (for
languages like Arabic, or Hebrew), sometimes colors or images.

This problem is solved by using fonts which support the entire Unicode
spectrum, or by using a system that will handle automatic
font-switching for you. If the system has good support for i18n, it'll
automatically handle RTL text (and even LTR/RTL mixed text) as part of
the toolkit. If your toolkit doesn't do that, and you have to work
with these issues, you should probably consider changing toolkits.

- The size of the forms should change to fir the size of the new text

This should be done automatically; UI sizes shouldn't be hard coded.
If they are, there is a problem. The overall layout should be
specified, and a _minimum_ size should probably be specified, but the
user interface should adjust for its environment, and that means that
nothing in a frame should be absolutely positioned except for the menu
bar and the status bar and the scroll bars.

--- Mike

Arne VajhÃ¸j · Dec 5, 2008

Michael said:
Eh, I suppose. There is a good chance that a programmer that has been
around for a while is going to also have encountered GNU gettext, since
it's overwhelmingly popular outside of the confines of Windows,
though.

If I were to guess at how many C# programmer that have experience
with gettext, then I would guess as <2%.

It is not a very widely used tool.

Arne

Michael B. Trausch · Dec 5, 2008

If I were to guess at how many C# programmer that have experience
with gettext, then I would guess as <2%.

It is not a very widely used tool.

That's news to me.

Maybe you don't use it, and maybe your business doesn't use it, and
maybe programmers who haven't a clue what GNU software is don't use it,
but that doesn't mean it's not very widely used. Virtually all free
and open source software that is internationalized uses gettext. And
you'd probably be surprised how much GNU software there is around you.

--- Mike

Mihai N. · Dec 5, 2008

It's very easy to use a non-printing Unicode

character to "tag" such differences, and then mark in the comments
which instance is which.

Really?
Can you point me to something in the gettext documentation that explains
this technique?
This is what you do in your code, for every localizable string?
Remember: gettext was desined for plain C, the keys are C strings (char*),
so they are not Unicode. Same as the typical C source.

And can you give me a list of such Unicode characters?
How many of tham can you give me?
Are they enough to eliminate duplicated keys in a big software?
(let's say 400.000 words)

There are other ways you can accomplish the
task, as well---it just takes a little bit of imagination.

This is a kludge around gettest bad design.
Exactly my point.

The *other* solution would be to use more free-standing text in things
like window titles and menu actions. IMHO, using "Print" for a dialog
title and a menu item is rather silly.

And if you don't like either, then you can use "Printâ€¦" as the key
for the menu item (since it ought to have an ellipsis anyway) and
"Print" for the dialog title.

Thing is, as a developer you have no clue what every language requires.
Titles/buttons was just an example of what can go wrong.

Scan (Scan disk vs Scan paper) is usually translated differently,
because they have different meanings, not because of the context.

English:
New
Spanish:
Nuevo (masculine singular)
Nuevos (masculine plural)
Nueva (feminine singular)
Nuevas (feminine plural)

Every language has it's own characteristics.
If you translate your application in 30 languages, you don't want to "fix"
you English keys every time you receive a bug report.
Just tink about it: translate into 30 languages, you get a but report for
language 31, you change a key and update all 30 language catalogs.

There are three quick solutions, and I am sure that there are many,
many more to choose from given a little bit more thought on the issue.

If all are as bad as the 3 ones, don't bother.

Ok, I will stop here.

I work in localization and internationalization for more than 11 years,
and I have seen hundresd of projects from tens of companies.
Stuff translated in tens of languages, on a lot of the platforms
out there (from Win and Mac to Palm OS), using every standard solution,
and quite a few non standard "quick solutions".

You listen, fine, you don't fine again.
I have nothing to loose.

Mihai N. · Dec 5, 2008

This problem is solved by using fonts which support the entire Unicode
spectrum,

There is no such thing.
Some tables in the OpenType specs are limited to 65.536 entries,
and there are values limited to 16 bits.
Unicode has more than 100.000 characters allocated.

Because of ligatures and complex shaping many unicode code points
need more than one glyph, so a 64K glyph limit means that in fact
the font can cover less than 64K Unicode code points.

No to mention that the same Unicode code point sometimes requires
different glyphs depending on language.
Because of that you will need separate Japanese, Chinese Simplified,
Chinese Traditional fonts, even if they share a big number of characters.

Try showing using a Simplified Chinese font on a Japanese text,
and show it to a Japanese native.

or by using a system that will handle automatic
font-switching for you.

Again, no such thing.

This is something you will invent, because it is not available.
And it will be a quick (and buggy) solution, like the others.

If the system has good support for i18n, it'll
automatically handle RTL text (and even LTR/RTL mixed text) as part of
the toolkit.

It is not only about text, it is about the full control.
The scroll-bar goes to the left, radio buttons will have the (x) in the
right, (sometimes) you will have to mirror the bitmaps of the buttons.

Example: in the Arabic IE toolbar the left and right arrows (back/forward)
will have to be mirrored (because in Arabic "forward" is "to the left")
But you don't want to mirror the bitmap with the the IE logo.

Do you know of such a system, that can automatically know if a bitmap
should be mirrored or not?

Mihai N. · Dec 5, 2008

Maybe you don't use it, and maybe your business doesn't use it, and

maybe programmers who haven't a clue what GNU software is don't use it,
but that doesn't mean it's not very widely used. Virtually all free
and open source software that is internationalized uses gettext. And
you'd probably be surprised how much GNU software there is around you.

Any idea how much of this software is developed in C#?

Michael B. Trausch · Dec 5, 2008

Really?
Can you point me to something in the gettext documentation that
explains this technique?

The gettext documentation explains how the keys work. Furthermore, the
translation files give you a pointer to the origin of the message so
that you can see where it's used to gain a sense of context within the
program.

Given how the keys work, and given that the utilities that work with
the files are easily able to handle Unicode (obviously) and that modern
user interfaces are able to handle Unicode, it follows that you can use
some technique such as this. It wouldn't mess with screen readers since
they don't care about non-breaking whitespace, and non-breaking
whitespace doesn't get displayed unless you ask a tool to show it to
you. Now, that means that you might need to take a closer look at the
translation files, or that you'd need to write a small utility to help
you to manage the files, but it's certainly doable. There are plenty
of other ways to mark things like that, too. Heck, you could use a
padding of NUL bytes between the first " and the beginning of the
string or the end of the string and the last ", assuming that the
program that reads them doesn't assume that there are no NUL bytes. I
don't know how you'd do this on Windows applications, but Emacs (as an
example) very easily lets you input arbitrary Unicode characters that
are not on your keyboard. Most software on Linux supports doing that
easily, too; I'd imagine that there is some way of doing this on
Windows, as well.

One could also, if they _really_ wanted, actually modify gettext
slightly to use a surrogate key in the source code. However, that'd
really put a damper on its usability, due to the way strings are
extracted from source code. But that's doable, too. (As an aside, I
didn't need the documentation to come up with that idea.)

This is what you do in your code, for every localizable string?
Remember: gettext was desined for plain C, the keys are C strings
(char*), so they are not Unicode. Same as the typical C source.

No, but that's the great thing about sensible modern systems: They use
UTF-8. Lucky for us, one can use UTF-8 in C source code without an
issue, since C treats a char * as a sequence of bytes. The user
interface will put the bytes together and see them for what they are.
If you need to do things like read multiple transformations of Unicode,
then just use a C library that can work and convert between them.
Furthermore, since UTF-8 is immensely more efficient than any other
transformation of Unicode for _most_ use cases where storage needs are
mostly ASCII characters, it's very natural to use.

Now, if the system were written in Python, I'd be a bit dubious about
using any of those tricks---Python does things really weirdly with
strings and interferes with you all over the place. Also, though, FTR,
I while I have worked with projects that actively use gettext, there is
little point to me using it: I work with a very niche audience, and
don't have a need for i18n other than ensuring that it'd be easy to do
if the client changes its mind later. What _I_ do with my software is
I make the strings absolutely clear. I don't transliterate words and
break their meaning by mutilating them, I don't use action verbs
outside of menus, and I try to make sure that the UI is not written in
what you'd see in everyday colloquial English. I am only a fluent
speaker in English, but I've learned enough of two other languages to
understand the sorts of situations that are awkward to translate, and
so I avoid them.

In fact, that's probably the _best_ way to use gettext. Use formal,
well-written, unambiguous English. Then, when adopting it in a
previously non-internationalized source base, spend the time that one
might spend creating surrogate keys instead cleaning up English
language strings.

Now, if that doesn't suit you, that's fine; freedom is for everybody,
and freedom includes choice. Obviously, because proprietary software
thrives yet still today. But when it comes to programming, there are a
few things that are important in today's strongly heterogeneous
computing world: (1) simple is better until it's not, and (2) portable
knowledge and portable software systems are among the best tools one
has in today's world, especially when it is open and maintained by
masses of people so that the longevity of the software is ensured as
long as there is someone interested in it.

And can you give me a list of such Unicode characters?

Take a look at the Unicode standard. It's freely available on the
Internet.

How many of tham can you give me?
Are they enough to eliminate duplicated keys in a big software?
(let's say 400.000 words)

I'm not sure why you'd count in words, when strings are the important
thing. There are 6,992 strings in GCC (1 string in every approx. 840
lines of code, roughly).

This is a kludge around gettest bad design.
Exactly my point.

The primary goal of the design was to make it easy to adopt and use in
existing software, since that was (and largely still is) the primary
use case scenario. Most software isn't hooked up with i18n from the
beginning, and some never get hooked up with it at all. The barrier to
entry is pretty light if you have only the need for a translator to run
a program, extract the strings, and make a few modifications to the
source code to gain the use of the translation catalogs.

Thing is, as a developer you have no clue what every language
requires. Titles/buttons was just an example of what can go wrong.

Scan (Scan disk vs Scan paper) is usually translated differently,
because they have different meanings, not because of the context.

Yes; and again, this depends on the language. "Search" is a more
proper verb to use when looking for something on a disk, be that bad
blocks or the file that you think you might've deleted last week by
accident. Scan is appropriate for use with a scanning device, be that
a bar code reader or an optical document scanner. This is a prime
example, really, because that's one thing that many people who speak
and write English on a regular basis fail to contemplate: word choice.

English:
New
Spanish:
Nuevo (masculine singular)
Nuevos (masculine plural)
Nueva (feminine singular)
Nuevas (feminine plural)

Every language has it's own characteristics.

Indeed it does. I also think that if you're going to have a new
something, you should know what that new something is; most
applications have "New" as an option in the "File" menu, but "new file"
doesn't make any sense from a usability standpoint---not when you're an
end user and you're using an office suite and what you really want is a
new file containing a new text document or new spreadsheet document.
Menu items in most computer software aren't as clear as they ought to
be for native English speakers, let alone translators. You seem to
want to pinpoint that as a gettext problem: no, it's a developer
problem. Developers have, for years, completely overestimated the
ability of a regular end-user to grasp a user interface. I to this
_day_ get calls from people that are confused about their
software---and there is no language barrier. Every developer should be
well-practiced in applying language elegantly, not just putting it
there. A flaw of developers, and unless those developers are somehow
prompted to think about it and trained to deal with that issue, it'll
never go away. Sounds like something someone ought to do, if you ask
me.

If you translate your application in 30 languages, you don't want to
"fix" you English keys every time you receive a bug report.
Just tink about it: translate into 30 languages, you get a but report
for language 31, you change a key and update all 30 language catalogs.

Which is very easy to do on any reasonably equipped operating
system---it's a very simple search/replace operation. Surely Windows
has something like sed or awk, doesn't it? Updating even 100 language
catalogs can be done in well under a minute using them unless your
catalogs are on a very slow network drive.

If all are as bad as the 3 ones, don't bother.

Ok, I will stop here.

I work in localization and internationalization for more than 11
years, and I have seen hundresd of projects from tens of companies.
Stuff translated in tens of languages, on a lot of the platforms
out there (from Win and Mac to Palm OS), using every standard
solution, and quite a few non standard "quick solutions".

There are lots of g11n, i18n, and l12n experts out there. Many of them
have broad experience in a good number of systems, and most of them
that I have met are native speakers of more than one language (a
variant of English and another language is the most common I've run
into). You're the first that I've seen actively complain about gettext,
to be honest. That's fine---everyone has their own opinions about
things.

Personally, I'll take a single, capable, and portable system and use
that over any single-environment system, unless I have extremely strong
reason to do the opposite. In more than 20 years, I've only once had a
really good reason to pick a single-environment system, and even then,
eventually portability was required and that choice came back to take a
chunk out of my ass. As the saying goes, if one hasn't the time to do
it right, they'd best have the time to do it over---and I'll spend a
bit of extra time up front automating things in a portable fashion to
save the great expense later of having to suddenly become portable.
I've never had a problem doing it that way.

You listen, fine, you don't fine again.
I have nothing to loose.

Agreed; it's time to end the thread. You seem to be frustrated.
Please accept my apologies if I've somehow caused that.

--- Mike

GNU gettext

Jean-Luc M.

Michael B. Trausch

Arne Vajhøj

Michael B. Trausch

Mihai N.

Michael B. Trausch

J-L

Michael B. Trausch

Arne VajhÃ¸j

Michael B. Trausch

Mihai N.

Mihai N.

Michael B. Trausch

Michael B. Trausch

Arne VajhÃ¸j

Michael B. Trausch

Mihai N.

Mihai N.

Mihai N.

Michael B. Trausch