Convert UTF-8 byte array back to original binary

  • Thread starter Thread starter BMaxwell
  • Start date Start date
B

BMaxwell

Hello all,

We have a bunch of PNG image files stored in a Sql Server 200 image
column. The app that reads and writes them is written in tcl and
works fine, but they are being converted to UTF-8 format for storage
in the database. I need to write C# code to take the array of bytes I
fetch from the database and get back to the original bytes so I can
re-generate the file as it was.

I have tried creating a UTF8 Decoder, but it didn't seem to help. I'm
new to encoding/decoding and confused about how I can get the original
bytes back from UTF-8/Unicode, etc. I'd appreciate any help. I can
post examples of what my byte streams look like if that would help.

Thanks in advance,
Brad
 
Hi Brad,

You can use Encoding.UTF8.GetBytes(utf8string) to get a byte array from
the string (or any other encoding).
If this doesn't help, some sample code might help.
 
BMaxwell said:
We have a bunch of PNG image files stored in a Sql Server 200 image
column.

If it's in an image column, I can't see where text is involved at all.
Image columns are fundamentally just blocks of bytes, which is exactly
what you want.
The app that reads and writes them is written in tcl and
works fine, but they are being converted to UTF-8 format for storage
in the database.

Then that's broken. Not every byte array is a valid UTF-8 encoded
string. Don't do it. If you absolutely *have* to convert binary data to
text, use Base64 encoding instead.

Why is it doing that in the first place?
I need to write C# code to take the array of bytes I
fetch from the database and get back to the original bytes so I can
re-generate the file as it was.

There's no guarantee you'll be able to do that, and you shouldn't be
trying.
 
Jon Skeet said:
If it's in an image column, I can't see where text is involved at all.
Image columns are fundamentally just blocks of bytes, which is exactly
what you want.


Then that's broken. Not every byte array is a valid UTF-8 encoded
string. Don't do it. If you absolutely *have* to convert binary data to
text, use Base64 encoding instead.

Why is it doing that in the first place?


There's no guarantee you'll be able to do that, and you shouldn't be
trying.

Yes, the original process is broken, possibly because the tcl odbc
package isn't handling binary the way we want it to. We're working on
a solution from that side, but the app has been running for a while
and there are a lot of images already stored incorrectly in UTF-8. The
tcl process that retrieves the data can restore it back to it's
original bytes, so I had hoped that I could decode the bytes with a
Decoder via C#.

Thanks,
Brad
 
BMaxwell said:
Yes, the original process is broken, possibly because the tcl odbc
package isn't handling binary the way we want it to. We're working on
a solution from that side, but the app has been running for a while
and there are a lot of images already stored incorrectly in UTF-8. The
tcl process that retrieves the data can restore it back to it's
original bytes, so I had hoped that I could decode the bytes with a
Decoder via C#.

Has anything tried to retrieve these incorrectly-stored images yet?

If it's an image column, I really don't understand what conversion has
occurred.

Is there any way you can put a sample file with the bytes 0-255 in
using the tcl tool, and see exactly what comes out? It could be that
there's a way of coping, but we'll need to know *exactly* what's
actually happening.
 
Jon Skeet said:
Has anything tried to retrieve these incorrectly-stored images yet?

If it's an image column, I really don't understand what conversion has
occurred.

Is there any way you can put a sample file with the bytes 0-255 in
using the tcl tool, and see exactly what comes out? It could be that
there's a way of coping, but we'll need to know *exactly* what's
actually happening.

Thanks for your help, Jon.

The tcl script can retrieve the data and write the file correctly.

This might help. Here are the first few bytes of one PNG file:

89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52 00 00 02 F8 00 00 03
E4
P N G I H D R

and here's what gets stored in the database:

C2 89 50 4E 47 0D 0A 1A 0A C080 C080 C080 0D 49 48 44 52 C080 C080 02
C3 B8
P N G I H D R

Those may be different files, but you get the idea from the header
bytes. The high-bit bytes are encoded and the 00 bytes are replaced
with C080. Is this "standard" UTF-8?

Thanks again,
Brad
 
BMaxwell said:
The tcl script can retrieve the data and write the file correctly.

For appropriate values of "correctly" :) (I know what you mean, just
teasing.)
This might help. Here are the first few bytes of one PNG file:

89 50 4E 47 0D 0A 1A 0A 00 00 00 0D 49 48 44 52 00 00 02 F8 00 00 03
E4
P N G I H D R

and here's what gets stored in the database:

C2 89 50 4E 47 0D 0A 1A 0A C080 C080 C080 0D 49 48 44 52 C080 C080 02
C3 B8
P N G I H D R

Those may be different files, but you get the idea from the header
bytes. The high-bit bytes are encoded and the 00 bytes are replaced
with C080. Is this "standard" UTF-8?

It doesn't look quite like standard UTF-8 to me, I'm afraid. In
particular, 0xC0 0x80 is illegal UTF-8. Other than that, however, it
looks like it *might* be just about okay. Given the way UTF-8 works, if
you replace every occurence of 0xc0 0x80 with just 0x00 (eg with a
memory stream), and then call:

string s = Encoding.UTF8.GetBytes(bytes);
bytes = Encoding.GetEncoding(28591).GetBytes(s);

you *should* get the right result. (That's "should" as in "I think you
will", not strictly speaking "should" as in "the way encodings work" -
the ISO-8859-1 encoding (28591) simply obliterates the top byte,
leaving characters which aren't *actually* in ISO-8859-1 (between 128
and 160) present. Never mind.)

While I think this will work, I'd strongly recommend replacing the tcl
script with all haste, if possible.
 
Jon Skeet said:
For appropriate values of "correctly" :) (I know what you mean, just
teasing.)


It doesn't look quite like standard UTF-8 to me, I'm afraid. In
particular, 0xC0 0x80 is illegal UTF-8. Other than that, however, it
looks like it *might* be just about okay. Given the way UTF-8 works, if
you replace every occurence of 0xc0 0x80 with just 0x00 (eg with a
memory stream), and then call:

string s = Encoding.UTF8.GetBytes(bytes);
bytes = Encoding.GetEncoding(28591).GetBytes(s);

you *should* get the right result. (That's "should" as in "I think you
will", not strictly speaking "should" as in "the way encodings work" -
the ISO-8859-1 encoding (28591) simply obliterates the top byte,
leaving characters which aren't *actually* in ISO-8859-1 (between 128
and 160) present. Never mind.)

While I think this will work, I'd strongly recommend replacing the tcl
script with all haste, if possible.

That seems to have worked on one test file and gotten close on others.
I appreciate your help a lot, Jon.

FYI, the tcl odbc package did turn out to have a bug in it. It just
so happened that retrieving the "bad" byte stream from the database
and writing it to an output file undid the encoding, so no one had
noticed the problem before.

Thanks again,
Brad
 
Back
Top