UTF-8 and diacritics combining characters

jm · Dec 19, 2008

Hi All,

I'm trying to read an UTF-8 file where diacritics are coded using
combining characters e.g the accented character is represented as the
unaccented character followed by the accented combining character.

Example: é is 65 CC 81

It seems that the UTF8 encoding does not handle this:
Dim sr As IO.StreamReader = New System.IO.StreamReader(fname, Encoding.UTF8)

I do not get my accented characters.

Is there another encoding to cope with such UTF-8 file ?

Thanks,
Jean-Michel

Anthony Jones · Dec 19, 2008

jm said:
Hi All,

I'm trying to read an UTF-8 file where diacritics are coded using
combining characters e.g the accented character is represented as the
unaccented character followed by the accented combining character.

Example: é is 65 CC 81

It seems that the UTF8 encoding does not handle this:
Dim sr As IO.StreamReader = New System.IO.StreamReader(fname,
Encoding.UTF8)

I do not get my accented characters.

Is there another encoding to cope with such UTF-8 file ?

When you view a UTF-8 file as if it were ANSI file you will often see the
the characters you expect but preceded with other characters.

The reason for this the manner that UTF-8 encodes unicode characters and the
values in the unicode domain chosen for the more common Latin accented
characters. You should not conclude that the appearance of the expected
character when viewed as ANSI is intended, its merely a coincidence and is
not significant to the coding.

The UTF8 encoder definitely does handle UTF-8 it correctly.

The problem is that your initial encoding for é is wrong. é is c3 a9 in
UTF-8 encoding. Most, if not all, the characters in the upper portion of
ISO-8859-1 set will encode as 2 bytes in UTF-8. You would need move further
up the unicode set of characters to get the point where 3 bytes are needed
to encode a character in UTF-8.

jm · Dec 19, 2008

Anthony Jones a écrit :

The problem is that your initial encoding for é is wrong. é is c3 a9 in
UTF-8 encoding. Most, if not all, the characters in the upper portion
of ISO-8859-1 set will encode as 2 bytes in UTF-8. You would need move
further up the unicode set of characters to get the point where 3 bytes
are needed to encode a character in UTF-8.

Hi,

You say it is wrong but when when I open the file under word or FireFox,
I get the right character (é) displayed.

Should I understand that Encoding.UTF8 in .net does not handle combining
characters and that there is is work around ?

Actually, that is all I need to know so I can tell people who have
created this file that we cannot handle it because of .net limitation.

Regards,
Jean-Michel

Hans Kesting · Dec 19, 2008

jm wrote on 19-12-2008 :

Hi All,

I'm trying to read an UTF-8 file where diacritics are coded using combining
characters e.g the accented character is represented as the unaccented
character followed by the accented combining character.

Example: é is 65 CC 81

It seems that the UTF8 encoding does not handle this:
Dim sr As IO.StreamReader = New System.IO.StreamReader(fname, Encoding.UTF8)

I do not get my accented characters.

Is there another encoding to cope with such UTF-8 file ?

Thanks,
Jean-Michel

If you normalize the string to "FormC", then you will get a printable
string:
string s = ...
s = s.Normalize(System.Text.NormalizationForm.FormC);
This will combine the characters and the non-spacing accents to
accented characters.

Apparently you text was in "FormD".

Hans Kesting

jm · Dec 19, 2008

Hans Kesting a écrit :

If you normalize the string to "FormC", then you will get a printable
string:
string s = ...
s = s.Normalize(System.Text.NormalizationForm.FormC);
This will combine the characters and the non-spacing accents to accented
characters.

Apparently you text was in "FormD".

That seems to work OK. I had never heard of this normalize stuff !

Many thanks,
Jean-Michel

jm · Dec 19, 2008

Mark Rae [MVP] a écrit :

http://www.google.co.uk/search?sour...1T4GPTB_en-GBGB298GB298&q="C#"+Text+Normalize

Thanks for the link to Google. I had never heard of Google. This is
great. So, if you know you have to normalize a string, you can enter
text and normalize and you get web pages about this.
Very interresting indeed.

Jean-Michel