Sql server to xml string with encoding

  • Thread starter Thread starter Urban Andersson
  • Start date Start date
U

Urban Andersson

I try to read data from a SQL server and convert it to a xml document string
where the data itself is saved with different encodings. I figured out how
to write the correct encoding to the string but it seems that when reading
from the database the data i converted from it's original econding to UTF-8.
So when i write the xml string the result will be garbled. This is what I
have done so far could someone help me to get the data in correct encoding
from the database?

SqlConnection con = new
SqlConnection(ConfigurationSettings.AppSettings["ConnectionString"].ToString
());

SqlDataAdapter da = new SqlDataAdapter("Select * from langlabels where
language='tr'",con);

DataSet ds = new DataSet();

da.Fill(ds);

System.IO.Stream ms = new System.IO.MemoryStream();

ds.WriteXml(ms);

byte[] ba = new byte[ms.Length];

ms.Position=0;

ms.Read(ba,0,(int)ms.Length);

String strXmlDocument =
System.Text.Encoding.GetEncoding("ISO-8859-9").GetString(ba);
 
Urban Andersson said:
I try to read data from a SQL server and convert it to a xml document string
where the data itself is saved with different encodings. I figured out how
to write the correct encoding to the string but it seems that when reading
from the database the data i converted from it's original econding to UTF-8.
So when i write the xml string the result will be garbled. This is what I
have done so far could someone help me to get the data in correct encoding
from the database?

Rather than writing to a MemoryStream, write to a StringWriter. You can
subclass StringWriter if you need an encoding other than UTF-8 to
appear in the XML header.

The database encoding shouldn't matter at all though - what you're
doing wrong in the code you quoted is assuming that WriteXml is using
ISO-8859-1.
 
But the data is stored in ISO-8859-9 in the database. Is it not correct the
to set the stream to ISO-8859-9 so that the data will be converted to that
encoding when writing to the stream instead of UTF-8 wich the data is not?
It seems to me that the WriteXml method writes UTF-8 to the stream instead
of ISO-8859-9? I tried using the StringWriter but i am not able to set
the encoding.
 
Urban Andersson said:
But the data is stored in ISO-8859-9 in the database. Is it not correct the
to set the stream to ISO-8859-9 so that the data will be converted to that
encoding when writing to the stream instead of UTF-8 wich the data is not?
It seems to me that the WriteXml method writes UTF-8 to the stream instead
of ISO-8859-9?

It doesn't matter how the data is stored in the database - by the time
it's been fetched from the database, it's just Unicode text, assuming
you've got it in text fields in the database. If it's a text field, the
data is just text data. It doesn't intrinsically have an encoding -
that's an implementation detail for storage, effectively.

See http://www.pobox.com/~skeet/csharp/unicode.html and
http://www.pobox.com/~skeet/csharp/debuggingunicode.html for more
information.
I tried using the StringWriter but i am not able to set
the encoding.

As I said, you need to subclass StringWriter in order to specify an
encoding. You basically override the Encoding property.
 
So, the problem seems to be that during one of these:

1. When the text is read from the database to the dataset
2. When the text is output from the dataset to the XML

....an erroneous original encoding is assumed (either UTF-8 or ISO-8859-1).
There must be a way to either

1. Tell the dataset that it is reading ISO-8859-9 encoded text from the
database when reading, or
2. Tell the dataset, when outputting XML, that the original data was
ISO-8859-9 encoded, so that it can correctly convert to whatever encoding
the XML should have

How is either of these accomplished?

Urban Andersson said:
But the data is stored in ISO-8859-9 in the database. Is it not correct the
to set the stream to ISO-8859-9 so that the data will be converted to that
encoding when writing to the stream instead of UTF-8 wich the data is not?
It seems to me that the WriteXml method writes UTF-8 to the stream instead
of ISO-8859-9? I tried using the StringWriter but i am not able to set
the encoding.
what
 
Jon Skeet said:
It doesn't matter how the data is stored in the database - by the
time it's been fetched from the database, it's just Unicode text,
assuming you've got it in text fields in the database.

But to convert the database text to Unicode, either the reader or the
database must know how the database data is formatted, right?
 
Per Bolmstedt said:
But to convert the database text to Unicode, either the reader or the
database must know how the database data is formatted, right?

The database itself should know.
 
Per Bolmstedt said:
So, the problem seems to be that during one of these:

1. When the text is read from the database to the dataset
2. When the text is output from the dataset to the XML

...an erroneous original encoding is assumed (either UTF-8 or ISO-8859-1).
There must be a way to either

1. Tell the dataset that it is reading ISO-8859-9 encoded text from the
database when reading, or

The driver should do that automatically, I believe.
2. Tell the dataset, when outputting XML, that the original data was
ISO-8859-9 encoded, so that it can correctly convert to whatever encoding
the XML should have

You don't need to - the XmlWriter (or whatever is being used) will sort
things out automatically. How the original data was encoded is
irrelevant once it's been decoded to Unicode.
How is either of these accomplished?

The other problem (which I've seen before) is that the data might have
been incorrectly written to start with. When you use query analyzer,
does it come up correctly? If not, you've effectively got corrupted
data in the database. There may be a way to fix it, but it depends on
exactly what's happened.

What's been writing to this database to start with?
 
Jon Skeet said:
How the original data was encoded is irrelevant once it's been decoded
to Unicode.

Right, but the problem is precisely this process of decoding to Unicode. The
data in the database is not corrupt, it is 8-bit text "encoded as
ISO-8859-9". So, for example, a LATIN SMALL LETTER DOTLESS I would be
returned as 0xFD.
What's been writing to this database to start with?

Most likely classic ASP, using text entered in an HTML form, submitted by
somebody using ISO-8859-9. How do you mean that the database itself should
know how the text is encoded? We do not specify encodings neither when
writing to nor reading from databases. The table in question contains text
"in many different encodings".
 
Per Bolmstedt said:
Right, but the problem is precisely this process of decoding to Unicode. The
data in the database is not corrupt, it is 8-bit text "encoded as
ISO-8859-9". So, for example, a LATIN SMALL LETTER DOTLESS I would be
returned as 0xFD.

The last twice I saw what the problem I think you're having, people
swore blind that the database wasn't corrupt until they understood the
problem. So, what makes you think that your database doesn't have
corrupt data in? If it's just that your ASP app can read the data out
again, that could well just mean that it's broken in the same way in
both directions.

What happens if you look with query analyser? My guess is you'll see
exactly the same "wrong" characters as you do in your XML.
Most likely classic ASP, using text entered in an HTML form, submitted by
somebody using ISO-8859-9. How do you mean that the database itself should
know how the text is encoded? We do not specify encodings neither when
writing to nor reading from databases. The table in question contains text
"in many different encodings".

The database will have an internal storage encoding for text data, and
that's probably specified somewhere in the database definition (not the
table schema - it's probably database-wide).

Now, if you've supposedly got text of "many different encodings" in
your text fields, you've probably been effectively writing to these
fields as if they're binary fields. That's a Bad Thing, because it
leads to problems such as the ones you've been seeing.

I don't know exactly how ASP and ADO write to text fields, but they've
been involved in the previous problems I've seen as well. (I certainly
hope there's a "right" way of doing things from ASP/ADO, but I don't
know...)
 
Jon Skeet said:
So, what makes you think that your database doesn't have
corrupt data in?

Depending on how you define "corrupt data", it may or may not. If there is
a database-wide encoding specified, and its set to ISO-8859-1 (actually, it
is probably set to Windows-1252), and we store the string "yýlý" in the
database, is that corrupt data because the ý should actually be read as a
"small i without dot"? If so, the only way to keep non-corrupt data in a
database is to convert each string entered to the database-wide encoding,
which means that encoding must be a unicode variant. Is this what you
meant?
Now, if you've supposedly got text of "many different encodings" in
your text fields, you've probably been effectively writing to these
fields as if they're binary fields. That's a Bad Thing, because it
leads to problems such as the ones you've been seeing.

This is indeed the problem. But to return to the original question, is the
only answer to create an intermediary stream between the database and the
dataset, which knows the formatting of the incoming data to be "decoded into
unicode"?
 
Per Bolmstedt said:
Depending on how you define "corrupt data", it may or may not.

I define "corrupt data" to be data that isn't in the appropriate
format, or has lost information.

For instance, suppose you had a numeric field where every value was the
value it was meant to be +1, you could easily reverse the problem and
get the correct value, but the data would still be corrupt in the
database.
If there is
a database-wide encoding specified, and its set to ISO-8859-1 (actually, it
is probably set to Windows-1252)

Not for nvarchar fields - it's more likely to be "Unicode" or "UTF-8"
or something similar. It may not even be configurable for nvarchar
fields. What *is* the type of the field in question? And what happens
when you use query analyzer to look at it?
and we store the string "yýlý" in the
database, is that corrupt data because the ý should actually be read asa
"small i without dot"?
Yes.

If so, the only way to keep non-corrupt data in a
database is to convert each string entered to the database-wide encoding,
which means that encoding must be a unicode variant. Is this what you
meant?

Sort of, yes.
This is indeed the problem. But to return to the original question, is the
only answer to create an intermediary stream between the database and the
dataset, which knows the formatting of the incoming data to be "decoded into
unicode"?

You'll need to convert the value of each field of each row from the
dodgy data into the correct data. Creating an intermediate stream would
be *insanely* difficult because you'd need to understand the database's
network protocol, for starters.

A longer term fix is to fix the existing data in the database and
modify the ASP code to submit the data correctly.
 
How come that the code below works? That must meen that the data is
interpted as ISO-8859-1 by ADO.NET and thus not converting it when writting
it to the stream? But if i change the last line to string strXmlDocument =
System.Text.Encoding.GetEncoding("ISO-8859-1").GetString(ba); the result
returns garbled. Is my asumption correct that the data is interpered as
8859-1 when read from the database?



System.IO.Stream stream = new System.IO.MemoryStream();

XmlTextWriter xmltw = new
XmlTextWriter(stream,System.Text.Encoding.GetEncoding("ISO-8859-1"));

ds.WriteXml(xmltw);

byte[] ba = new byte[stream.Length];

stream.Position=0;

stream.Read(ba,0,(int)stream.Length);

string strXmlDocument =
System.Text.Encoding.GetEncoding("ISO-8859-9").GetString(ba);
 
Jon Skeet said:
It may not even be configurable for nvarchar fields.

The field in question is a varchar, but I don't see how that makes any
difference, since ADO passes the data as if it were binary. This only
affects whether the database string is represented internally as 0xfd, "ý",
"Latin small letter y with acute" or U+00FD (nvarchar?) or whatever, some of
which are corrupt.

When using query analyzer, the "small i without dot" comes out as an ý,
which is definitely what I would suspect would happen.
A longer term fix is to fix the existing data in the database and
modify the ASP code to submit the data correctly.

Of course, but since modifying the .NET code on the reading end is much
easier, what is the best solution for doing that?

One way is to use, in this case, Windows-1252 for reading the data (since
this will cause no modification of the corrupt data) and then using
GetString() on a proper Encoding to convert the data. For example,
GetEncoding("iso-8859-9").GetString(corrupt_binary_data) to convert the ý to
a "small i without dot". But is there no easier way to do this on the
DataSet, the SqlConnection or something else?
 
Urban Andersson said:
How come that the code below works? That must meen that the data is
interpted as ISO-8859-1 by ADO.NET and thus not converting it when writting
it to the stream? But if i change the last line to string strXmlDocument =
System.Text.Encoding.GetEncoding("ISO-8859-1").GetString(ba); the result
returns garbled. Is my asumption correct that the data is interpered as
8859-1 when read from the database?

I think the problem is that the data is interpreted as ISO-8859-1 (or
possibly code page 1252) when it's *written* to the database. Let's
take an imaginary example case: suppose that character 'X' has the same
code in ISO-8859-9 as character 'Y' does in ISO-8859-1.

So, when the input to your ASP is 'X', 'Y' ends up in the database,
assuming that the database is interpreting the SQL statement as being
in ISO-8859-1. That's a big assumption, and I don't know enough about
ADO to know for sure. It's *not* correct to say that the data is
interpreted as 8859-1 when read from the database - it's Unicode data,
and however it's stored and transported, it ends up as the same Unicode
data in the .NET string as it is in the database.

Now, the database -> .NET path is much better defined, IMO, so when
it's fetched, the Unicode string contains character 'Y'. Your code then
finds out what the byte array for the string is in ISO-8859-1, i.e. the
binary data that was in the SQL statement. It then treats it as if it's
a byte array for the string in ISO-8859-9, which gets it back to 'X'.

So that's the way to reverse the damage that the ASP is doing. If you
can, it would be a good idea to take the service down temporarily,
rewrite each field with the correct Unicode data as worked out by doing
this munging (when you're absolutely *sure* it all works!), and bring
it back up with fixed ASPs. You can then avoid doing any more munging,
ever.
 
Per Bolmstedt said:
The field in question is a varchar, but I don't see how that makes any
difference, since ADO passes the data as if it were binary.

It may not make any difference to ADO, but it potentially makes a very
big difference to ADO.NET, and almost certainly affects what characters
can be stored in the field.
This only
affects whether the database string is represented internally as 0xfd, "ý",
"Latin small letter y with acute" or U+00FD (nvarchar?) or whatever, someof
which are corrupt.

Sort of, yes.
When using query analyzer, the "small i without dot" comes out as an ý,
which is definitely what I would suspect would happen.

It's what I'd suspect would happen with corrupt data, yes :) I think it
may be time to start writing a page about this topic...
Of course, but since modifying the .NET code on the reading end is much
easier, what is the best solution for doing that?

Modifying the .NET code on the reading end is easier in the short term,
but it means that any time you use anything *else* to read the database
in the future, the same munging has to be done. Fixing the existing
data and the broken ASP is a *much* better long-term fix, IMO.
One way is to use, in this case, Windows-1252 for reading the data
(since this will cause no modification of the corrupt data) and then using
GetString() on a proper Encoding to convert the data. For example,
GetEncoding("iso-8859-9").GetString(corrupt_binary_data) to convert the ý to
a "small i without dot". But is there no easier way to do this on the
DataSet, the SqlConnection or something else?

No. The reason there isn't an easier way to do this is because you've
got corrupt data, and there's no particular benefit from making life
easy for those with corrupt data, as it pretty much encourages people
to keep using broken code.
 
Jon Skeet said:
I think it may be time to start writing a page about this topic...

That's not a bad idea, because as I see it, one of the biggest differences
between text handling in ASP and ADO vs ASP.NET and ADO.NET is that whereas
the former all but encourages you to treat text as binary, and lets you stay
completely ignorant of any encodings, and doesn't do textual conversion, the
latter requires of you to treat text as encoded glyphs, requires of you to
be aware of the encodings involved, and does a lot of implicit textual
conversion. Even if this binary treatment of text was A Bad Thing with ASP
and ADO, there is a world of difference between the assistance the .NET
framework provides to someone who wants to expliclity manage text encoding
and conversion compared to what was offered to the ASP/ADO developer. (This
difference was certainly the problem in this particular thread.)
 
Per Bolmstedt said:
That's not a bad idea, because as I see it, one of the biggest differences
between text handling in ASP and ADO vs ASP.NET and ADO.NET is that whereas
the former all but encourages you to treat text as binary, and lets you stay
completely ignorant of any encodings, and doesn't do textual conversion, the
latter requires of you to treat text as encoded glyphs, requires of you to
be aware of the encodings involved, and does a lot of implicit textual
conversion.

Quite the opposite, actually. There are implicit conversions *whenever*
you communicate text. However, when things are working properly in
ADO.NET (i.e. when your data isn't corrupted) you *don't* need to know
which encoding to use, because the .NET driver just gets it right.

The only reason you had to know anything about encodings *now* is that
the implicit conversions which were being used in ASP/ADO were screwed
up. Using ADO.NET that would have been much harder to do - there
wouldn't have been any encodings involved in your code at all (in the
database fetching, at least).
Even if this binary treatment of text was A Bad Thing with ASP and
ADO, there is a world of difference between the assistance the .NET
framework provides to someone who wants to expliclity manage text
encoding and conversion compared to what was offered to the ASP/ADO
developer. (This difference was certainly the problem in this
particular thread.)

No - the problem in this particular thread was caused by the fact that
ASP/ADO *didn't* do what .NET did.

Put it this way - two different systems which both handled encodings in
the way that ADO.NET does (i.e. keeping it under the driver layer)
could work toegether easily with no problem. Two different systems both
using the ASP/ADO type model would have had the same problem you saw
today.
 
Back
Top