Translate UTF16 into lower ascii

  • Thread starter Thread starter Bob
  • Start date Start date
B

Bob

Is there an easy way to translate odd UTF8/16 characters (like letters
with umlauts, vowels with accent symbols above) into the closest
'look-alike' lower ascii equivalent (A-Z, a-z)?

This is something that has probably been done, but I can't think of a
good search key for finding the code.
 
Is there an easy way to translate odd UTF8/16 characters (like letters
with umlauts, vowels with accent symbols above) into the closest
'look-alike' lower ascii equivalent (A-Z, a-z)?

This is something that has probably been done, but I can't think of a
good search key for finding the code.

There may be a library out there somewhere, but I am sure that it is so
obscure that I can't find it.

Your best best would be to try to transliterate what you can and drop
what you can't transliterate. A table-based approach would be the only
way I can see being able to do it reasonably. Maybe looking for a list
of transliterations that you could preprocess into a table would be
ideal?

--- Mike
 
System.Text.ASCIIEncoding have some methodes to convert, translate
chars. you can find many exmeple in MSDN
 
There may be a library out there somewhere, but I am sure that it is so
obscure that I can't find it.

Your best best would be to try to transliterate what you can and drop
what you can't transliterate. A table-based approach would be the only
way I can see being able to do it reasonably. Maybe looking for a list
of transliterations that you could preprocess into a table would be
ideal?

--- Mike

Very likely that someone has already done this, as there are occasions
that plain 'lower ascii' must be used, like on cell phone keypads. If
someone wanted to enter the name "Andre" on a cell phone, there would
be no access to an E with the accent over it.

Now, to find it...
 
System.Text.ASCIIEncoding have some methodes to convert, translate
chars. you can find many exmeple in MSDN

Entirely appropriate to hear from someone with two accents in their
name. <G> Good example here, as I wouldn't know how to type your name
as you have it spelled above. And you wouldn't want to drop the two
E's...you'd translate to lower ascii E when necessary.

I presume that you're referring to the Decoder.Convert functions via
ASCIIEncoding classes. I didn't see anything that looked like it would
do this.
 
Bob said:
Very likely that someone has already done this, as there are occasions
that plain 'lower ascii' must be used, like on cell phone keypads. If
someone wanted to enter the name "Andre" on a cell phone, there would


Really? I have all umlauts available on my mobile (and it is not a
special or expensive model). It depends on the language setting, if it
is set to English then there are no special characters of course. Think
about Chinese or Japanese mobiles, they do not have 2000+ tiny keys -
but I guess you can send Chinese text using the keypad somehow...

Michael
 
You can use this code, it work fine :

public String ToLowerASCII(String s)
{
return new String(
s.Normalize(NormalizationForm.FormD).ToCharArray()
.Where(c =>
System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) !=
System.Globalization.UnicodeCategory.NonSpacingMark)
.ToArray());
}
 
public String ToLowerASCII(String s)
{
return new String(
s.Normalize(NormalizationForm.FormD).ToCharArray()
.Where(c =>
System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) !=
System.Globalization.UnicodeCategory.NonSpacingMark)
.ToArray());
}

There is one small issue with this, depending on whether or not you
have non-letter Unicode input to your application or not---because
we're not using an ASCII encoding directly, the system is still
permitting characters with values greater than 127, which is a no-no if
you want pure ASCII.

Now, if .NET provides a general-purpose Unicode transliteration
mechanism, cool. It doesn't seem so, though, so if your input includes
anything but characters that can be stripped of accents, you're still
sending non-ASCII output. To catch that, we need to filter the allowed
characters a bit. Here's a way to do it without using LINQ (requires
"using System.Collections.Generic; using System.Globalization; using
System.Text;"):

===============
public static string StringToAscii(string str) {
List<char> normalized = new List<char>();

foreach(char c in
str.Normalize(NormalizationForm.FormD).ToCharArray()) {
if(CharUnicodeInfo.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
if(c < 127)
normalized.Add(c);
}

return(new String(normalized.ToArray()));
}
===============

This function now strips accents from characters, and doesn't pass any
non-ASCII character through. If your input has null characters or
control characters, I would expect that those would be preserved.

If you want to preserve ALL Unicode input, though, you will still have
to create a translation table method of some sort, such that you can do
things like:

© = (c)
® = (R)
â„¢ = (tm)
ß = ss

And so forth.

Look below for a full test program that you can examine, too, the
output of which is:

test> ./ascii.exe
Orig: áéíóú
New: aeiou

Orig: äåé®þüúíóö«áßðfgjhg'¶øœæ©xvbbñ
New: aaeuuiooafgjhg'xvbbn

Orig: ¡²³¤¼¼½¾½‘¾
New:

Orig: ¿©µvæ¢ÆÃVÃÄÉÞÖ
New: vAVAEO

--- Mike

using System;
using System.Collections.Generic;
using System.Globalization;
using System.Text;

public static class EntryPoint {
public static int Main() {
string[] tests = {
"áéíóú",
"äåé®þüúíóö«áßðfgjhg'¶øœæ©xvbbñ",
"¡²³¤¼¼½¾½‘¾",
"¿©µvæ¢ÆÃVÃÄÉÞÖ" };

foreach(string t in tests) {
Console.WriteLine("Orig:\t{0}", t);
Console.WriteLine("New:\t{0}", StringToAscii(t));
Console.WriteLine();
}

return(0);
}

public static string StringToAscii(string str) {
List<char> normalized = new List<char>();

foreach(char c in
str.Normalize(NormalizationForm.FormD).ToCharArray()) {
if(CharUnicodeInfo.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
if(c < 127)
normalized.Add(c);
}

return(new String(normalized.ToArray()));
}
}
 
if(c < 127)
normalized.Add(c);

Grr.

And that would be an off-by-one error. :-)

It should be:

if(c <= 127)
normalized.Add(c);

The error won't lose anything unless you have ASCII DEL in your input,
but still.

--- Mike
 
But think twice if this is really what you need. I can see no good
reason to strip the accents. You are basically corrupting the text.

Indeed. It is lossy.

Perhaps what is needed is a class that can perform a less lossy
conversion of Unicode to ASCII. There are certain Unicode glyphs that
simply cannot be represented in ASCII, such as the ideographic language
glyphs, but most symbols can be represented:

é => 'e
â => ^a
Ä => ^g
½ => 1/2
â„¢ => (tm)
© => (c) or (C)
® => (r) or (R)
â„— => (p) or (P)
→ => ->
Ï€ => pi
€ => EUR
£ => GPB
– => --
— => ---

And so on... this sort of transliteration can be useful for
transferring texts to ASCII that are written in a language using the
Roman alphabet with additional symbols.

That having been said, I can't see any reason to use ASCII unless there
is something in a line of dependencies that requires it. Even modern
filesystems can handle Unicode characters in filenames (NTFS uses
UTF-16, for example, and most Linux filesystems by default use UTF-8
for encoding of characters). Database systems are now equipped to
handle them, too, and they can be easily handled in most of today's
programming languages, as well.

I do recall seeing some transliteration behavior in one of the
libraries my system uses once upon a time. Someone had tried the idea
of disabling Unicode in the terminal and using transliterated
characters. The idea didn't work out so well, but it was proof that it
could be done. Unfortunately, I can't remember just what library it
was doing the transliteration, and so I don't know what the source of
the translation table is, for now.

It'd be a time-consuming task to create such a table, but it wouldn't
be terribly difficult. So, if semi-lossless transliteration of Unicode
into classic ASCII is something that is important for someone, they'll
have to set aside a workweek's worth of time (maybe a little more) and
create a table-driven method to perform the transliteration.

But, I think the correct solution, if it is feasible in any way, would
be to just get rid of the software in the chain that doesn't handle
Unicode characters correctly.

--- Mike
 
http://blogs.msdn.com/michkap/archive/2005/02/19/376617.aspx
But think twice if this is really what you need.
I can see no good reason to strip the accents.
You are basically corrupting the text.

Thanks for your reply Mihai.

There are a lot of reasons to have that capability:

Voice recognition programs often don't know how to match incoming
speech with extended character sets.

Conventional keyboards, especially cell phones, etc are going to be a
problem for entry of odd characters. And following, doing string
searches is tough when you don't know how to enter the search key.

Even if you menu the list of names in the latter case, it's tough to
determine how they will sort, so the user would not be able to count
on a certain sequence for centering in on a name in the list.

Stuff like that.
 
Voice recognition programs often don't know how to match incoming
speech with extended character sets.

Conventional keyboards, especially cell phones, etc are going to be a
problem for entry of odd characters. And following, doing string
searches is tough when you don't know how to enter the search key.

Even if you menu the list of names in the latter case, it's tough to
determine how they will sort, so the user would not be able to count
on a certain sequence for centering in on a name in the list.


Thing is, the accents are not simple "decorations"
If a language requires those accents, then one must use them.
It is not acceptable to remove them just because they are difficult to enter.

In many cases the letter + accent is a different character (with a different
sound, and which sorts differently) than the base letter.

Imagine asking American users to use P instead of R, or O instead of Q.
on the argument that "it is the same thing, just with an extra line"

Voice recognitions programs are language speciffic (you cannot use an English
speach recognition engine to recognize German). So if you use a german
engine, it will be able to deal with the accents required by German.

The sorting is language sensitive. If the language uses accents, then it has
rules on how to sort them.
 
Thing is, the accents are not simple "decorations"
If a language requires those accents, then one must use them.
It is not acceptable to remove them just because they are difficult to enter.

In many cases the letter + accent is a different character (with a different
sound, and which sorts differently) than the base letter.

Imagine asking American users to use P instead of R, or O instead of Q.
on the argument that "it is the same thing, just with an extra line"

Not always. Example: If a user wants to find info on Andre Previn via
a cell phone, they won't be able to enter the string if the name has
an accent over the 'e'. Everyone knows how to pronounce 'Andre', with
or without the accent, so it makes sense to filter to lower ascii.
There are many case like that.

IOW, extended characters often cannot be used. And even if all cell
phones had extended character keys, you'd end up having to guess which
way it was spelled. (I picked that name cause I only recently saw it
with the accent. I would not have guessed that was the reason a search
failed)

This also applies to speech recognition. The French name Andre is
found quite often in English text.

And while typing this, I have no idea how to enter the accented 'e' on
my American keyboard. Probably an odd combination of keystrokes that I
would have to memorize.
 
And while typing this, I have no idea how to enter the accented 'e' on
my American keyboard. Probably an odd combination of keystrokes that I
would have to memorize.

The US International layout is pretty similar on all platforms that I
have worked with, in particular the layout that uses "dead keys". You
do modify your typing slightly to be able to type letters like é and è,
but (IMHO, YMMV) it's well worth it.

Essentially, `, ', ", ~, and ^ become "dead". When you hit them the
first time, they do nothing. Then you hit the letter they apply to.
So for é, you'd type '+e. To type ', you type '+<SPC> (same with ",
`, ~, and ^). Then you can type things like résumé, or "ça va?".

Also, at least on the 2 or 3 most recent phones I've had, you can type
a key and then get a list of alternates---so you can press "e" and ask
for the alternates list, and you'll see things like ẽ, é, è, ê, and ë.
Given that the world is getting "smaller", it is getting easier to
enter characters that we use even infrequently, partially out of
necessity.

--- Mike
 
Not always. Example: If a user wants to find info on Andre Previn via
a cell phone, they won't be able to enter the string if the name has
an accent over the 'e'.
This means the cell phone is dumb.
Everyone knows how to pronounce 'Andre', with
or without the accent, so it makes sense to filter to lower ascii.
There are many case like that.
Because people have to put up with crappy technologies.
you'd end up having to guess which way it was spelled. ....
And while typing this, I have no idea how to enter the accented 'e' on
my American keyboard. Probably an odd combination of keystrokes that I
would have to memorize.
Because you are not a French speaker.

The only decent reason to stripping off the accents is indeed the one you
describe, searching text produced by crappy systems.
But one should never cripple it's own application by stripping accents,
especially without givving the option to the user.

Google can do searches without accents.
But if you enclose the string/word to search between quotes, it will not
do it. So the user has full control, with "ignore accents" as default.
A bit like a search dialog with "ignore case" set by default.

A bit like Unicode: use it throughout, go to code pages for interaction
with legacy applications.

Like any rule, there are exceptions.
Very few rules are absolute truths and must be blindly followed.
This is why my initial warning: "But think twice if this is really
what you need."
 
Back
Top