Encoding/Codepage: Can't Get There From Here

  • Thread starter Thread starter Christopher H. Laco
  • Start date Start date
C

Christopher H. Laco

Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?

-=Chris
 
Christopher H. Laco said:
Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.

I'd suggest that a better way would be to keep the data in Unicode
until you need it in HP Roman-6, and only decode it then. Going via
Encoding.Default is only going to confuse things, IMO.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?

Writing an Encoding isn't that hard, especially for fixed-size
character sets. You might be able to use a lot of the code I've got for
EBCDIC. See
http://www.pobox.com/~skeet/csharp/miscutil
 
Jon said:
I'd suggest that a better way would be to keep the data in Unicode
until you need it in HP Roman-6, and only decode it then. Going via
Encoding.Default is only going to confuse things, IMO.

That's pretty much what happens. It's not really stored anywhere session
wise. I'm just trying to convert it to something the backend can handle
write before I write it to the socket.

Encoding.Default was my first try. I need to do some more digging. I'm
not sure what CodePage .NET things it is when I get it frim
IIS/ASP->COM->Assembly.

Writing an Encoding isn't that hard, especially for fixed-size
character sets. You might be able to use a lot of the code I've got for
EBCDIC. See
http://www.pobox.com/~skeet/csharp/miscutil

Yeah, that's what I was looking at yesterday. :-)

-=Chris
 
Christopher said:
Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?

-=Chris

To be honest, I still don't know where to start. I'm still a little
comfused on how converting from one codepage to another actually happens.

Converting from latin to HP Roman8 is easy since I have the HP Roman 6
codepage listing the source/dest numbers.

How does conversion happen among all the various codepage variations?
There's some math or process there I'm failing to understand. It's
probably not hard; it's just that I've never had to think of such things
most of the time.

I can just hack a quick latin to hp roman byte conversion together, but
that's not very stable. I'm looking for a "proper" solution that can
convert to HP Roman8 regardless of the source codepage. It would be nice
if I could just register a custom codepage in .NET and get it using
Encoding.GetEncoding('mycustom').

-=Chris
 
Christopher H. Laco said:
To be honest, I still don't know where to start. I'm still a little
comfused on how converting from one codepage to another actually happens.

You shouldn't need to convert from one codepage to another - you should
only need to convert from a .NET string (which is Unicode) to your
target code page.
Converting from latin to HP Roman8 is easy since I have the HP Roman 6
codepage listing the source/dest numbers.

How does conversion happen among all the various codepage variations?
There's some math or process there I'm failing to understand. It's
probably not hard; it's just that I've never had to think of such things
most of the time.

I think you're getting hung up about a "source codepage" for no reason.
Are you correctly getting the data as a .NET string? If so, don't worry
about the original source any more.
I can just hack a quick latin to hp roman byte conversion together, but
that's not very stable. I'm looking for a "proper" solution that can
convert to HP Roman8 regardless of the source codepage.

There's no real source codepage when you're converting a .NET string -
it's just Unicode.
It would be nice
if I could just register a custom codepage in .NET and get it using
Encoding.GetEncoding('mycustom').

Unfortunately I don't believe you can do that. .NET isn't as pluggable
as it might be in a few places...
 
Jon said:
You shouldn't need to convert from one codepage to another - you should
only need to convert from a .NET string (which is Unicode) to your
target code page.




I think you're getting hung up about a "source codepage" for no reason.
Are you correctly getting the data as a .NET string? If so, don't worry
about the original source any more.




There's no real source codepage when you're converting a .NET string -
it's just Unicode.




Unfortunately I don't believe you can do that. .NET isn't as pluggable
as it might be in a few places...

I hear what you're saying. The string comes form the browser, through
ASP, through COM, into .NET when I have it in a string. So, yes, source
is irrelevant in this case. But for the sake of learning, I'd like to
understand how conversions between two codepages works in general,
regardless of .NET.

Just to recap for my sanity. So I've got a string:
string data = "LÁCÔ";

..NET stores it internally as unicode, but to get it over the wire to a
backend that doesn't understand multi-byte character semantics, I need
to do one of the following:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);

The first is bad for obvious reaasons; anything above 127 is turned into ?.

The second converts the .NET unicode string variable data into the
default ANSI 1280 on windows. I can send this over the wire, but it
displays incorrectly on everything on the backend because it's using the
HP Roman8 codepage. Hence the need to convert to the Roman8 codepage
before sending the data.

Now, yes, I'm really needing to convert the string from a .NET unicode
string to HP Roman8. That's where I'm lost. I don't know how or where to
begin.

I know I need to subclass System.Text.Encoding, and that's it.

Thanks to a handy chart on the net comparing HP Roman8 to Latin 1, I
understand the numerical difference for bytes 127 to 255. I don't
understand the difference between HP Roman8 and Unicode do do the
conversion. That's why I'm hung up the source part; it's what I know
thus far.

Part of my misunderstanding is also what happened to the user input from
the browser into VB into .NET. Somewhere along the way, I wouldn't
expect it to all have been utf8/unicode.

The web page in question appears to have been declared as ISO-8859-1
according to the headers and for now, I'll assume the browser is doing
the right thing and sending that encoding back. No special provisions
have been made in the page one way or the other. So .NET just guesses
correctly when converting it from ISO-8859-1 to Unicode for internal
variable storage.

I just don't know where to go from here.

Thanks for the help!
-=Chris
 
Christopher H. Laco said:
I hear what you're saying. The string comes form the browser, through
ASP, through COM, into .NET when I have it in a string. So, yes, source
is irrelevant in this case. But for the sake of learning, I'd like to
understand how conversions between two codepages works in general,
regardless of .NET.

Just to recap for my sanity. So I've got a string:
string data = "LÁCÔ";

.NET stores it internally as unicode, but to get it over the wire to a
backend that doesn't understand multi-byte character semantics, I need
to do one of the following:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);

No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.
The first is bad for obvious reaasons; anything above 127 is turned into ?.

The second converts the .NET unicode string variable data into the
default ANSI 1280 on windows. I can send this over the wire, but it
displays incorrectly on everything on the backend because it's using the
HP Roman8 codepage. Hence the need to convert to the Roman8 codepage
before sending the data.

Now, yes, I'm really needing to convert the string from a .NET unicode
string to HP Roman8. That's where I'm lost. I don't know how or where to
begin.

I know I need to subclass System.Text.Encoding, and that's it.

Fortunately, that's quite easy - and it's all you need to do.
Thanks to a handy chart on the net comparing HP Roman8 to Latin 1, I
understand the numerical difference for bytes 127 to 255. I don't
understand the difference between HP Roman8 and Unicode do do the
conversion. That's why I'm hung up the source part; it's what I know
thus far.

Well, Unicode *is* Latin 1 for the first 256 values, so all you've got
to do is:

1) Convert characters which are in Latin 1 to HP Roman8 appropriately.
2) Do "something" (e.g. use the encoded version of '?') with characters
which aren't in HP Roman8.
Part of my misunderstanding is also what happened to the user input from
the browser into VB into .NET. Somewhere along the way, I wouldn't
expect it to all have been utf8/unicode.

Well, when the web browser sends a request, it includes (or at least
should include :) the encoding for whatever data it's sending - and
ASP.NET converts that into Unicode.
The web page in question appears to have been declared as ISO-8859-1
according to the headers and for now, I'll assume the browser is doing
the right thing and sending that encoding back.

It may not be sending ISO-8859-1 - there's no necessity for a browser
to make a request with the same encoding as the last page it looked at.
No special provisions
have been made in the page one way or the other. So .NET just guesses
correctly when converting it from ISO-8859-1 to Unicode for internal
variable storage.

No - it uses whatever the browser sends. It only has to guess if the
browser doesn't say what encoding to use.
I just don't know where to go from here.

Well, which part of deriving from Encoding are you having trouble with?
As I said before, my EBCDIC encoding should give a good starting point,
although I think I gave the wrong URL - you want
http://www.pobox.com/~skeet/csharp/ebcdic/

It's got a few optimisations in there which you'll need to understand
in order to read the code, but you probably won't need to do
equivalents yourself.
 
Jon said:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);


No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.

And that's the crux. There is no appropriate encoding built into .NET
right. There's nothing in GetEncoding() that is going to help me here.

I just don't yet understand the lines between a subclass of
System.Text.Encoding and the actually conversion code using in
Encoding.Convert...
Fortunately, that's quite easy - and it's all you need to do.




Well, Unicode *is* Latin 1 for the first 256 values, so all you've got
to do is:

1) Convert characters which are in Latin 1 to HP Roman8 appropriately.
2) Do "something" (e.g. use the encoded version of '?') with characters
which aren't in HP Roman8.




Well, when the web browser sends a request, it includes (or at least
should include :) the encoding for whatever data it's sending - and
ASP.NET converts that into Unicode.

It's not that simple in this case. The .NET assembly doing the network
I/O is completely unaware of the browser, ASP or the form post. It's
just given a string from Response.Form('SomeData'). That fact that that
works without much hassle all the way up to this point is a modern
miracle. ;-)

Well, which part of deriving from Encoding are you having trouble with?

Oh, that part where I have to take one byte, map it, and convert it to
another, and how that works by just subclassing Encoding and 'thats all
I have to do'. :-)

As I said before, my EBCDIC encoding should give a good starting point,
although I think I gave the wrong URL - you want
http://www.pobox.com/~skeet/csharp/ebcdic/

To be honest, it's confusing to me. It's way more than I need to some
extent. It's converting EBCEDIC to ASCII using an external codepage
file. OR what that the point of it all?

Time to submit a freatue requiest to .NET 2.5: pluggable codepages so
this isn't necessary. :-)

Thanks,
-=Chris
 
Christopher H. Laco wrote:

OK, the bell just went off. I didn't realize that it was two seperate
parts, and I only need to create the dat file *once* and use it in the
encoder via GetEncoding.

I was looking at a more literal (but less flexible) approach like the
other Encoding. stuff (UTF8/ASCII).

-=Chris
 
Christopher H. Laco said:
Jon said:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);

No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.

And that's the crux. There is no appropriate encoding built into .NET
right. There's nothing in GetEncoding() that is going to help me here.

Indeed - so as I've been saying, you need to write your own encoding.
You don't need GetEncoding.
I just don't yet understand the lines between a subclass of
System.Text.Encoding and the actually conversion code using in
Encoding.Convert...

Encoding.Convert basically calls GetString or GetChars using the source
encoding to convert to Unicode, then GetBytes using the second encoding
to convert to bytes again.
It's not that simple in this case. The .NET assembly doing the network
I/O is completely unaware of the browser, ASP or the form post. It's
just given a string from Response.Form('SomeData'). That fact that that
works without much hassle all the way up to this point is a modern
miracle. ;-)

Well, if it's given a string rather than an array of bytes, it's
already in the right format.
Oh, that part where I have to take one byte, map it, and convert it to
another, and how that works by just subclassing Encoding and 'thats all
I have to do'. :-)

Yes - it's likely to end up being *very* simple. You don't convert one
byte to another though - you convert a sequence of chars to a sequence
of bytes, or vice versa.
To be honest, it's confusing to me. It's way more than I need to some
extent. It's converting EBCEDIC to ASCII using an external codepage
file. OR what that the point of it all?

I think you're looking at the wrong thing. I'm not talking about the
code on that page directly - I'm talking about the EBCDIC library
linked from that page, which gives you an example of how to create your
own Encoding.
Time to submit a freatue requiest to .NET 2.5: pluggable codepages so
this isn't necessary. :-)

Being pluggable wouldn't help you at all - you'd still have to write
your own encoding, and if you've done that you don't need to call
Encoding.GetEncoding at all.
 
Christopher H. Laco said:
Christopher H. Laco wrote:

OK, the bell just went off. I didn't realize that it was two seperate
parts, and I only need to create the dat file *once* and use it in the
encoder via GetEncoding.

You may not need a data file at all - just because I happen to use them
for EBCDIC doesn't mean they necessarily fit what you're doing :)
I was looking at a more literal (but less flexible) approach like the
other Encoding. stuff (UTF8/ASCII).

Not sure what you mean by "more literal" approach, but if you're happy,
that's fine...
 
Back
Top