Not necessarily.
Query Analyser often outputs results and in fact even submits querys
based on your various settings. Which if you were do the same through
ADO would give you an error. i.e. in QA if you have an SP with a datetime
parameter and you enter it MySp @mydate = '2004/04/15' then this should
always work. However because all my settings in OS etc... say british
settings
then I have to but the date in dd/mm/yyyy order. However if Idid this with
ADO the opposite would be true.
Fair enough - but that wouldn't have the effect of corrupting
nvarchar/ntext strings.
Yes I was. This is taking a lot to explain.
As I originally said I believe that the database is storing UTF8 encoded
data.
The database is storing Unicode strings. How it stores them is
irrelevant (trust me - go along with it for now, hopefully it'll make
sense in a minute).
.NET I am now told uses UTF16.
It does internally - but fundamentally the point is that they're
Unicode strings. They're sequences of characters, not bytes.
So I got a dataset of the data and tried using virtually the same code as
you along with a number of other methods to convert from UTF8 to unicode.
However as I explained (fairly straight forwardly I thought). The dataset
has already done some conversion on the data behind the scenes so by the time my
code trys to say :
.....GetBytes(CType(ds.Tables(0).Rows(i).Item(0), String))
The value has already been mangled and hence the decoding I am trying to do
has no effect. Even if the dataset didn't do this the CType(xxxxx, String)
would have the same mangling affect.
No. The dataset *hasn't* done any conversion, I believe - or at least,
none that should have any effect. It's presented the Unicode sequence
of characters in the database as a Unicode sequence of characters in
..NET.
The strings you are seeing in .NET are the strings which are in the
database.
See my last comment, I can't think of a way to say the same thing yet again
in a different way.
That's because I believe you have an incorrect view of what's going on
The custom C++ DLL (which I didn't write cause I don't know C++) gives me a
UTF8Codec class which has a Decode method on it.
Therefore to me that makes its intentions pretty clear. And since decoding
from UTF8 to unicode is exactly what I was trying to do to me that suggests its
doing what .NET won't allow me to.
No, it's doing what .NET is discouraging you from by having a clear
separation of bytes from chars in the first place. That's not to say
it's impossible in .NET though.
I tried the util on the DB with the Greek in it and all the garbage
instantly worked, hurrah!
Good. I still think it would be wise to understand what's been going on
though.
What .NET is doing wrong as I have said previously several times and in
several different ways is:
You've said it in different ways, but I believe you haven't listened to
what I've been suggesting has actually happened
1) Datasets implicitly convert the data (messing it up) hence you can't
get to the raw bytes.
There *are* no raw bytes - none that you should be bothered with,
anyway. The database logically stores *characters*, not bytes, in
ntext/nvarchar fields.
2) DataReaders only let you get the raw bytes for text and ntext SQL
Server fields and
I needed it to work on NVarChar as well so that was out.
See above.
3) The conversion from UTF8 to unicode via the GetBytes() method
requires a STRING.
In order to get that string I have to CType() to string the database
object value.
By doing this you would (if it weren't for (1) above) get a UNICODE
string but it would
actually be a mangled UTF8 string. Hence negating the UTF8 to
Unicode conversion
code.
Phew!
Please reread the article I linked to before - I think you still have
some conceptual problems in distinguishing between an encoded string
(which is a sequence of bytes) and a string which is a sequence of
characters.
As a footnote if the browser was getting sent some corrupted text from the
database as you suggest then how when it displays it assuming charset="UTF8"
does it manage to reverse the supposed database corruption and display it
correctly?
By looking at the data it's been given, working out that it probably
isn't what was intended, and making a heuristic guess at what's
happened.
Here's one more try at saying what I believe has happened to the
database:
1) Your VB6 app receives a string
2) The C++ DLL encodes that string into a sequence of UTF-8 bytes
3) The ADO layer then thinks that sequence of UTF-8 bytes is *actually*
a sequence of bytes in the default encoding
4) You end up with a sequence of characters in the database, each of
which is the character in the default encoding for a single byte of the
UTF-8 representation of the string
Reading this from anything which *correctly* gets the string from the
database, you end up with too many characters, some of which look
strange.
However, the C++ DLL will basically reverse its operation - so when
fetching the string from ADO, it ends up as a sequence of bytes in the
default encoding, which the C++ DLL then treats as a sequence of bytes
of UTF-8, and decodes back to Unicode characters, giving you back what
you thought of.
So essentially, the C++ DLL is applying an incorrect transformation -
but it's applying it in a reversible way, so that you end up with
apparently garbage data in the database, but data which can be
"sensibly" retrieved with the C++ DLL. That doesn't mean that .NET has
done *anything* wrong.