ToUpper()

Ornette · Mar 1, 2007

Hello,

I'm trying to convert strings to upper without the accents. For the moment,
ToUpper converts é to E with an accent...
I tried to set up english culture (en) but it's the same...

Any ideas ?

Ornette.

Ornette · Mar 1, 2007

Ok, finally I did it like this :

private string ReplaceAccents(string chaine)
{

string strAccents=
"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÌÍÎÏìíîïÙÚÛÜùúûüÿÑñÇç";
string strNoAccents =
"AAAAAAaaaaaaOOOOOOooooooEEEEeeeeIIIIiiiiUUUUuuuuyNnCc";

char[] tAccent = strAccents.ToCharArray();
char[] tNoAccent = strNoAccents .ToCharArray();

for(int i=0; i<strAccents.Length; i++)
{
chaine = chaine.Replace(tAccent .ToString(), tNoAccent
.ToString());
}
return chaine;
}

J'ai pas trouvé mieux, même si ça boucle un peu pour rien...

Ornette.

Ornette · Mar 1, 2007

Hello,

This is better :

byte[] bString =
System.Text.Encoding.GetEncoding(1251).GetBytes(StringAvecAccents);
string stringSansAccent = System.Text.Encoding.ASCII.GetString(bString );

Reference CodePage :
http://www.microsoft.com/globaldev/reference/sbcs/1251.mspx

Ornette.

Ornette said:
Ok, finally I did it like this :

private string ReplaceAccents(string chaine)
{

string strAccents=
"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÌÍÎÏìíîïÙÚÛÜùúûüÿÑñÇç";
string strNoAccents =
"AAAAAAaaaaaaOOOOOOooooooEEEEeeeeIIIIiiiiUUUUuuuuyNnCc";

char[] tAccent = strAccents.ToCharArray();
char[] tNoAccent = strNoAccents .ToCharArray();

for(int i=0; i<strAccents.Length; i++)
{
chaine = chaine.Replace(tAccent .ToString(), tNoAccent
.ToString());
}
return chaine;
}

J'ai pas trouvé mieux, même si ça boucle un peu pour rien...

Ornette.

Ornette said:

Hello,

I'm trying to convert strings to upper without the accents. For the
moment, ToUpper converts é to E with an accent...
I tried to set up english culture (en) but it's the same...

Any ideas ?

Ornette.

Click to expand...

Jon Skeet [C# MVP] · Mar 1, 2007

Ornette said:
This is better :

byte[] bString =
System.Text.Encoding.GetEncoding(1251).GetBytes(StringAvecAccents);
string stringSansAccent = System.Text.Encoding.ASCII.GetString(bString );

Well, that's assuming that the encoding will find the closest match
letter. It may work now, but there's no guarantee that it will in the
future.

Ornette · Mar 1, 2007

Hello,

So how would you do ?

Ornette.

Jon Skeet said:
Ornette said:

This is better :

byte[] bString =
System.Text.Encoding.GetEncoding(1251).GetBytes(StringAvecAccents);
string stringSansAccent = System.Text.Encoding.ASCII.GetString(bString );

Click to expand...

Well, that's assuming that the encoding will find the closest match
letter. It may work now, but there's no guarantee that it will in the
future.

Jon Skeet [C# MVP] · Mar 1, 2007

Ornette said:
So how would you do ?

The mapping table idea you had before looked best to me, although I
wouldn't quite implement it the same way. I'd have a look up table for
every possible character, where it defaults to the Unicode character,
but for all the accented characters you care about, you specify the
non-accented version.

You'd then call ToCharArray() on the string in question, go through
each character replacing the original with the mapped character, and
then create a new string with the char array.

It does require you to manually map all the accented characters you
care about though.

My guess is that there are libraries around to do this somewhere, but I
don't know of any myself.

Ornette · Mar 1, 2007

Ok, thank you for your point of view.
I really agree.

For the the librairies, I also didn't find any one.

Have a nice day and thanks again.

Ornette.

Chris Mullins [MVP] · Mar 2, 2007

The closest thing that comes to mind is an RFC called stringprep. There are
a wide variety of stringprep profiles, and while they don't quite do what
you're looking for, they're close. Included in stringprep is a set of
mapping tables for Uppder->Lower case conversions. These are (in that
context) called case-foldings, are are found in table B.2. Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourceprep-02.html

There's a C# implementation of this RFC that's part of the libidn library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

We've actually got a full implemention of stringprep as well - it's much
more .Net 2.0 ish than the libidn one, which is just a native C++ app that
was then ported to Java & .Net. It's found in our open-source SoapBox
Framework.

JR · Mar 2, 2007

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the accents.

JR

Chris Mullins said:
The closest thing that comes to mind is an RFC called stringprep. There
are a wide variety of stringprep profiles, and while they don't quite do
what you're looking for, they're close. Included in stringprep is a set of
mapping tables for Uppder->Lower case conversions. These are (in that
context) called case-foldings, are are found in table B.2. Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourceprep-02.html

There's a C# implementation of this RFC that's part of the libidn library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

We've actually got a full implemention of stringprep as well - it's much
more .Net 2.0 ish than the libidn one, which is just a native C++ app that
was then ported to Java & .Net. It's found in our open-source SoapBox
Framework.

Chris Mullins · Mar 2, 2007

I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

--
Chris Mullins

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the accents.

JR

The closest thing that comes to mind is an RFC called stringprep. There
are a wide variety of stringprep profiles, and while they don't quite do
what you're looking for, they're close. Included in stringprep is a setof
mapping tables for Uppder->Lower case conversions. These are (in that
context) called case-foldings, are are found in table B.2. Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

Click to expand...

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

Click to expand...

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

Click to expand...

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

Click to expand...

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourcepre...

Click to expand...

There's a C# implementation of this RFC that's part of the libidn library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

Click to expand...

We've actually got a full implemention of stringprep as well - it's much
more .Net 2.0 ish than the libidn one, which is just a native C++ app that
was then ported to Java & .Net. It's found in our open-source SoapBox
Framework.

Click to expand...

Jon Skeet [C# MVP] · Mar 2, 2007

Chris Mullins said:
I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

Cool - I hadn't seen that 2.0 had normalization stuff. Fantastic!

Chris Mullins [MVP] · Mar 2, 2007

Yea, .Net 2.0 also added the IDN stuff (which includes the PunyCode
algorithm), which came as a complete surprise to me:

System.Globalization.IdnMapping mapping = new
System.Globalization.IdnMapping();
normalized = mapping.GetAscii(normalized);

(just don't run an empty string through the IdnMapping class, or you'll get
an exception)

Now, if only they exposed the BiDirectional stuff...

JR · Mar 2, 2007

You meant NormalizationForm.FormKD.

Looking into it, I see a simpler method: After normalization, use
ASCIIEncoding with DecoderReplacementFallback replacing invalid ASCII
characters (which will be the accents) with the empty string.

JR

"Chris Mullins" <[email protected]> ???
??????:[email protected]...
I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

--
Chris Mullins

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the
accents.

JR

The closest thing that comes to mind is an RFC called stringprep. There
are a wide variety of stringprep profiles, and while they don't quite do
what you're looking for, they're close. Included in stringprep is a set
of
mapping tables for Uppder->Lower case conversions. These are (in that
context) called case-foldings, are are found in table B.2. Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

Click to expand...

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

Click to expand...

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

Click to expand...

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

Click to expand...

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourcepre...

Click to expand...

There's a C# implementation of this RFC that's part of the libidn
library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

Click to expand...

We've actually got a full implemention of stringprep as well - it's much
more .Net 2.0 ish than the libidn one, which is just a native C++ app
that
was then ported to Java & .Net. It's found in our open-source SoapBox
Framework.

Click to expand...

Chris Mullins [MVP] · Mar 2, 2007

Opps. Definatly KD!

We want to do the decomposition & make our changes. For KC would decompose &
then perform a canonical recompose - which would defeat the purpose!

I've never used (or even seen) the DecoderReplacementFallback - that's
another good idea. By now the original poster has probably given up and will
never try any of these solutions, but I think they would very cleanly do the
trick.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

JR said:
You meant NormalizationForm.FormKD.

Looking into it, I see a simpler method: After normalization, use
ASCIIEncoding with DecoderReplacementFallback replacing invalid ASCII
characters (which will be the accents) with the empty string.

JR

"Chris Mullins" <[email protected]> ???
??????:[email protected]...
I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

--
Chris Mullins

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the
accents.

JR

The closest thing that comes to mind is an RFC called stringprep. There
are a wide variety of stringprep profiles, and while they don't quite
do
what you're looking for, they're close. Included in stringprep is a set
of
mapping tables for Uppder->Lower case conversions. These are (in that
context) called case-foldings, are are found in table B.2.
Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

Click to expand...

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

Click to expand...

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

Click to expand...

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

Click to expand...

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourcepre...

Click to expand...

There's a C# implementation of this RFC that's part of the libidn
library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

Click to expand...

We've actually got a full implemention of stringprep as well - it's
much
more .Net 2.0 ish than the libidn one, which is just a native C++ app
that
was then ported to Java & .Net. It's found in our open-source SoapBox
Framework.

Click to expand...

So how would you do ?

Click to expand...

The mapping table idea you had before looked best to me, although I
wouldn't quite implement it the same way. I'd have a look up table for
every possible character, where it defaults to the Unicode character,
but for all the accented characters you care about, you specify the
non-accented version.

Click to expand...

You'd then call ToCharArray() on the string in question, go through
each character replacing the original with the mapped character, and
then create a new string with the char array.

Click to expand...

It does require you to manually map all the accented characters you
care about though.

Click to expand...

My guess is that there are libraries around to do this somewhere, but
I
don't know of any myself.

Click to expand...

Click to expand...

Chris Mullins [MVP] · Mar 2, 2007

For anyone still paying attention, the complete (and working + tested) code
is:

string s = "áäåãòä:usdBDlGXHHA";
string normalized = s.Normalize(NormalizationForm.FormKD);

Encoding ascii = Encoding.GetEncoding(
"us-ascii",
new EncoderReplacementFallback(string.Empty),
new DecoderReplacementFallback(string.Empty));

byte[] encodedBytes = new byte[ascii.GetByteCount(normalized)];
int numberOfEncodedBytes = ascii.GetBytes(normalized, 0, normalized.Length,
encodedBytes, 0);

string newString = ascii.GetString(encodedBytes).ToUpper();
MessageBox.Show(newString);

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

JR said:
You meant NormalizationForm.FormKD.

Looking into it, I see a simpler method: After normalization, use
ASCIIEncoding with DecoderReplacementFallback replacing invalid ASCII
characters (which will be the accents) with the empty string.

JR

"Chris Mullins" <[email protected]> ???
??????:[email protected]...
I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

--
Chris Mullins

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the
accents.

JR

The closest thing that comes to mind is an RFC called stringprep. There
are a wide variety of stringprep profiles, and while they don't quite
do
what you're looking for, they're close. Included in stringprep is a set
of
mapping tables for Uppder->Lower case conversions. These are (in that
context) called case-foldings, are are found in table B.2.
Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

Click to expand...

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

Click to expand...

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

Click to expand...

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

Click to expand...

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourcepre...

Click to expand...

There's a C# implementation of this RFC that's part of the libidn
library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

Click to expand...

We've actually got a full implemention of stringprep as well - it's
much
more .Net 2.0 ish than the libidn one, which is just a native C++ app
that
was then ported to Java & .Net. It's found in our open-source SoapBox
Framework.

Click to expand...

So how would you do ?

Click to expand...

The mapping table idea you had before looked best to me, although I
wouldn't quite implement it the same way. I'd have a look up table for
every possible character, where it defaults to the Unicode character,
but for all the accented characters you care about, you specify the
non-accented version.

Click to expand...

You'd then call ToCharArray() on the string in question, go through
each character replacing the original with the mapped character, and
then create a new string with the char array.

Click to expand...

It does require you to manually map all the accented characters you
care about though.

Click to expand...

My guess is that there are libraries around to do this somewhere, but
I
don't know of any myself.

Click to expand...

Click to expand...

Ornette · Mar 3, 2007

Hello,

I following the thread which is very interesting.
I's thinking about wrinting a "StringHelper" class to do the jos and re-use
:-)

Thanks a lot for your participation to this subject !!!

Ornette.

Chris Mullins said:
Opps. Definatly KD!

We want to do the decomposition & make our changes. For KC would decompose
& then perform a canonical recompose - which would defeat the purpose!

I've never used (or even seen) the DecoderReplacementFallback - that's
another good idea. By now the original poster has probably given up and
will never try any of these solutions, but I think they would very cleanly
do the trick.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

JR said:

You meant NormalizationForm.FormKD.

Looking into it, I see a simpler method: After normalization, use
ASCIIEncoding with DecoderReplacementFallback replacing invalid ASCII
characters (which will be the accents) with the empty string.

JR

"Chris Mullins" <[email protected]> ???
??????:[email protected]...
I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

--
Chris Mullins

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the
accents.

JR

"Chris Mullins [MVP]" <[email protected]> ëúá
áäåãòä:[email protected]...

The closest thing that comes to mind is an RFC called stringprep.
There
are a wide variety of stringprep profiles, and while they don't quite
do
what you're looking for, they're close. Included in stringprep is a
set of
mapping tables for Uppder->Lower case conversions. These are (in that
context) called case-foldings, are are found in table B.2.
Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourcepre...

There's a C# implementation of this RFC that's part of the libidn
library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

We've actually got a full implemention of stringprep as well - it's
much
more .Net 2.0 ish than the libidn one, which is just a native C++ app
that
was then ported to Java & .Net. It's found in our open-source SoapBox
Framework.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

So how would you do ?

The mapping table idea you had before looked best to me, although I
wouldn't quite implement it the same way. I'd have a look up table
for
every possible character, where it defaults to the Unicode character,
but for all the accented characters you care about, you specify the
non-accented version.

You'd then call ToCharArray() on the string in question, go through
each character replacing the original with the mapped character, and
then create a new string with the char array.

It does require you to manually map all the accented characters you
care about though.

My guess is that there are libraries around to do this somewhere, but
I
don't know of any myself.

Click to expand...

Click to expand...

Chris Mullins [MVP] · Mar 3, 2007

I'm sitting in front of my computer, and not feeling much like working (too
busy sneezing and being sick!), so I broke down and turned the solution to
this problem into a little blog entry.

http://www.coversant.com/Default.aspx?tabid=88&EntryID=30

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

Ornette said:
Hello,

I following the thread which is very interesting.
I's thinking about wrinting a "StringHelper" class to do the jos and
re-use

Thanks a lot for your participation to this subject !!!

Ornette.

Chris Mullins said:

Opps. Definatly KD!

We want to do the decomposition & make our changes. For KC would
decompose & then perform a canonical recompose - which would defeat the
purpose!

I've never used (or even seen) the DecoderReplacementFallback - that's
another good idea. By now the original poster has probably given up and
will never try any of these solutions, but I think they would very
cleanly do the trick.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

JR said:

You meant NormalizationForm.FormKD.

Looking into it, I see a simpler method: After normalization, use
ASCIIEncoding with DecoderReplacementFallback replacing invalid ASCII
characters (which will be the accents) with the empty string.

JR

"Chris Mullins" <[email protected]> ???
??????:[email protected]...
I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

--
Chris Mullins

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the
accents.

JR

"Chris Mullins [MVP]" <[email protected]> ëúá
áäåãòä:[email protected]...

The closest thing that comes to mind is an RFC called stringprep.
There
are a wide variety of stringprep profiles, and while they don't quite
do
what you're looking for, they're close. Included in stringprep is a
set of
mapping tables for Uppder->Lower case conversions. These are (in that
context) called case-foldings, are are found in table B.2.
Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourcepre...

There's a C# implementation of this RFC that's part of the libidn
library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

We've actually got a full implemention of stringprep as well - it's
much
more .Net 2.0 ish than the libidn one, which is just a native C++ app
that
was then ported to Java & .Net. It's found in our open-source SoapBox
Framework.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

So how would you do ?

The mapping table idea you had before looked best to me, although I
wouldn't quite implement it the same way. I'd have a look up table
for
every possible character, where it defaults to the Unicode
character,
but for all the accented characters you care about, you specify the
non-accented version.

You'd then call ToCharArray() on the string in question, go through
each character replacing the original with the mapped character, and
then create a new string with the char array.

It does require you to manually map all the accented characters you
care about though.

My guess is that there are libraries around to do this somewhere,
but I
don't know of any myself.

Click to expand...

Click to expand...

Ornette · Mar 5, 2007

Hello,

Thank you for this really understandable article (even for french people :-)

Nice & clear !!

Ornette

Chris Mullins said:
I'm sitting in front of my computer, and not feeling much like working
(too busy sneezing and being sick!), so I broke down and turned the
solution to this problem into a little blog entry.

http://www.coversant.com/Default.aspx?tabid=88&EntryID=30

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

Ornette said:

Hello,

I following the thread which is very interesting.
I's thinking about wrinting a "StringHelper" class to do the jos and
re-use

Thanks a lot for your participation to this subject !!!

Ornette.

Chris Mullins said:

Opps. Definatly KD!

We want to do the decomposition & make our changes. For KC would
decompose & then perform a canonical recompose - which would defeat the
purpose!

I've never used (or even seen) the DecoderReplacementFallback - that's
another good idea. By now the original poster has probably given up and
will never try any of these solutions, but I think they would very
cleanly do the trick.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

You meant NormalizationForm.FormKD.

Looking into it, I see a simpler method: After normalization, use
ASCIIEncoding with DecoderReplacementFallback replacing invalid ASCII
characters (which will be the accents) with the empty string.

JR

"Chris Mullins" <[email protected]> ???
??????:[email protected]...
I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

--
Chris Mullins

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the
accents.

JR

"Chris Mullins [MVP]" <[email protected]> ëúá
áäåãòä:[email protected]...

The closest thing that comes to mind is an RFC called stringprep.
There
are a wide variety of stringprep profiles, and while they don't
quite do
what you're looking for, they're close. Included in stringprep is a
set of
mapping tables for Uppder->Lower case conversions. These are (in
that
context) called case-foldings, are are found in table B.2.
Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourcepre...

There's a C# implementation of this RFC that's part of the libidn
library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

We've actually got a full implemention of stringprep as well - it's
much
more .Net 2.0 ish than the libidn one, which is just a native C++
app that
was then ported to Java & .Net. It's found in our open-source
SoapBox
Framework.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

So how would you do ?

The mapping table idea you had before looked best to me, although I
wouldn't quite implement it the same way. I'd have a look up table
for
every possible character, where it defaults to the Unicode
character,
but for all the accented characters you care about, you specify the
non-accented version.

You'd then call ToCharArray() on the string in question, go through
each character replacing the original with the mapped character,
and
then create a new string with the char array.

It does require you to manually map all the accented characters you
care about though.

My guess is that there are libraries around to do this somewhere,
but I
don't know of any myself.

Click to expand...

Click to expand...

Cor Ligthert [MVP] · Mar 5, 2007

Chris,

Maybe for your blog, be aware that there is a Microsoft VisualBasic method
StrConv(mystring, VbStrConv.Narrow)

However I saw that it only is converting 16bit Asian languages to 8 bit not
European languages (Latin characters).

Cor

Chris Mullins said:
I'm sitting in front of my computer, and not feeling much like working
(too busy sneezing and being sick!), so I broke down and turned the
solution to this problem into a little blog entry.

http://www.coversant.com/Default.aspx?tabid=88&EntryID=30

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

Ornette said:

Hello,

I following the thread which is very interesting.
I's thinking about wrinting a "StringHelper" class to do the jos and
re-use

Thanks a lot for your participation to this subject !!!

Ornette.

Chris Mullins said:

Opps. Definatly KD!

We want to do the decomposition & make our changes. For KC would
decompose & then perform a canonical recompose - which would defeat the
purpose!

I've never used (or even seen) the DecoderReplacementFallback - that's
another good idea. By now the original poster has probably given up and
will never try any of these solutions, but I think they would very
cleanly do the trick.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

You meant NormalizationForm.FormKD.

Looking into it, I see a simpler method: After normalization, use
ASCIIEncoding with DecoderReplacementFallback replacing invalid ASCII
characters (which will be the accents) with the empty string.

JR

"Chris Mullins" <[email protected]> ???
??????:[email protected]...
I hadn't thought of that, but it's certainly an option.

Doing the normalization in .Net 2.0 is easy enough:

string s = "test";
string normalized = s.Normalize(NormalizationForm.FormKC);

Then you can iterate over the normalized string looking for (and
removing) the accents.

--
Chris Mullins

In a more general way:

There is a Unicode database at

http://www.unicode.org/Public/UNIDATA/

You could do what you want in two steps: decompose the string to base
characters followed by accent (NFKD normalization), then remove the
accents.

JR

"Chris Mullins [MVP]" <[email protected]> ëúá
áäåãòä:[email protected]...

The closest thing that comes to mind is an RFC called stringprep.
There
are a wide variety of stringprep profiles, and while they don't
quite do
what you're looking for, they're close. Included in stringprep is a
set of
mapping tables for Uppder->Lower case conversions. These are (in
that
context) called case-foldings, are are found in table B.2.
Unfortunatly,
they're Upper->Lower, not the other way around.

Stringprep:
http://www.faqs.org/rfcs/rfc3454.html

There are a number of profiles:
[Profile for Internaional Domain Names]
http://www.rfc-editor.org/rfc/rfc3491.txt

[Profile for iSCSI names]
http://tools.ietf.org/html/draft-ietf-ips-iscsi-string-prep-01

[Profile for SASL UserNames & Passwords]
http://www.ietf.org/rfc/rfc4013.txt

[Profile for XMPP Resources]
http://www.xmpp.org/internet-drafts/attic/draft-ietf-xmpp-resourcepre...

There's a C# implementation of this RFC that's part of the libidn
library.
There's also a C++ & Java version.
http://www.gnu.org/software/libidn/

We've actually got a full implemention of stringprep as well - it's
much
more .Net 2.0 ish than the libidn one, which is just a native C++
app that
was then ported to Java & .Net. It's found in our open-source
SoapBox
Framework.

--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins

So how would you do ?

The mapping table idea you had before looked best to me, although I
wouldn't quite implement it the same way. I'd have a look up table
for
every possible character, where it defaults to the Unicode
character,
but for all the accented characters you care about, you specify the
non-accented version.

You'd then call ToCharArray() on the string in question, go through
each character replacing the original with the mapped character,
and
then create a new string with the char array.

It does require you to manually map all the accented characters you
care about though.

My guess is that there are libraries around to do this somewhere,
but I
don't know of any myself.

Click to expand...

Click to expand...

Chris Mullins · Mar 5, 2007

As was pointed out to me in a blog comment, my original solution to
this problem - using the ASCII encoder to avoid the string iteration -
was pretty silly.

The code in my blog has been updated to use the correct method - the
Unicode Character Info class. More details at:
http://www.coversant.com/Default.aspx?tabid=88&EntryID=30

Problems with carret in URL (real time ticker sample)	2	May 15, 2009
ToUpper	2	May 20, 2005
carret and accent key broken	6	Jul 12, 2011
Code ok in DEBUG fails in Release	2	Nov 2, 2005
Foreign accents not displaying	6	Aug 3, 2009
Replace accented letters	4	Jan 8, 2009
unicode sorting	2	May 11, 2004
Accent insensitive	6	Nov 16, 2009

ToUpper()

Ornette

Ornette

Ornette

Jon Skeet [C# MVP]

Ornette

Jon Skeet [C# MVP]

Ornette

Chris Mullins [MVP]

JR

Chris Mullins

Jon Skeet [C# MVP]

Chris Mullins [MVP]

JR

Chris Mullins [MVP]

Chris Mullins [MVP]

Ornette

Chris Mullins [MVP]

Ornette

Cor Ligthert [MVP]

Chris Mullins

Ask a Question

Similar Threads