DB --> XML -->DB

i_robot73 · May 28, 2009

Pretty new to XML, so forgive my ignorance.

I have a string in a DB
'High warning (red) > 125%, Low warning (red) â‰¤ 12%, Normal
(green) > 12 to â‰¤ 110%, and High caution (yellow) > 110 to â‰¤ 125%'

Where, when serialized to XML

'High warning (red) > 125%, Low warning (red) Ã¢â€°Â¤ 12%, Normal
(green) > 12 to Ã¢â€°Â¤ 110%, and High caution (yellow) > 110to Ã¢â€°Â¤ 125%'

& re-imported into a remote DB, looks like:

'High warning (red) > 125%, Low warning (red) = 12%, Normal
(green) > 12 to = 110%, and High caution (yellow) > 110 to = 125%'

Wondering if there's an easy 'switch' in the writing/reading to keep
correct characters intact.

Thanks again,

David D.

Jeroen Mostert · May 28, 2009

i_robot73 said:
Pretty new to XML, so forgive my ignorance.

I have a string in a DB
'High warning (red) > 125%, Low warning (red) â‰¤ 12%, Normal
(green) > 12 to â‰¤ 110%, and High caution (yellow) > 110 to â‰¤ 125%'

Where, when serialized to XML

'High warning (red) > 125%, Low warning (red) Ã¢â€°Â¤ 12%, Normal
(green) > 12 to Ã¢â€°Â¤ 110%, and High caution (yellow) > 110 to Ã¢â€°Â¤ 125%'

This is UTF-8 reinterpreted as Latin-1, a common mistake. Whatever you're
using to read your XML or whatever you used to write the XML (or both!) is
messing up.

If you open the XML in Notepad through File -> Open, making sure to set the
encoding to "UTF-8" manually, does it open correctly? If so, your XML is
fine. (The "â‰¤" may be turned into a square if your display font has no glyph
for the character, so take that into account too -- in any case, it should
never look like three characters.)

& re-imported into a remote DB, looks like:

'High warning (red) > 125%, Low warning (red) = 12%, Normal
(green) > 12 to = 110%, and High caution (yellow) > 110 to = 125%'

How are you importing, and to where? You need something that can process
Unicode and store it into an N[VAR]CHAR field. If the underlying field is a
[VAR]CHAR, it will only be able to store whatever the code page of your
database supports -- this is usually Latin-1 or Windows-1252, neither of
which support "U+2264 LESS-THAN OR EQUAL TO".

Wondering if there's an easy 'switch' in the writing/reading to keep
correct characters intact.

There are no easy cures here, you have to make sure every link in the chain
is using encodings properly. This is usually not as hard as it sounds,
especially if you're using Unicode.

Jeroen Mostert · May 28, 2009