IE6 SP2: Apostrophe, other chars not rendered properly

  • Thread starter Thread starter FUBARinSFO
  • Start date Start date
F

FUBARinSFO

L’Anatomie de l’image

Above is a snippet of text from a web page with French text on it. It
is supposed to be L'Anatomie de l'image.

I can't get it to render properly in Internet Explorer 6 SP2. The
apostrophe is rendered as some very narrow box, and a second byte as a
space.

1. The page has the attribute 'charset=utf-8' in a meta tag.

2. The specific byte pair underlying this character is 0xC2 0x92

3. As near as I can determine, this is a legitimate if deprecated
representation of an apostrope in Latin-1 Supplemental character set.
The smart single and double quotes are all in the range c2 9x.

4. There is some mapping amgibuity in this range, with characters that
were associated with Windows 1252 character set, which I don't have as
a listed option in IE.

If someone can explain what is going on here, and how to get the right
code page to display this character (and others like it) I would
appreciate it very much. Or failing that, what transforms/character
filters I need to apply to this sort of page in order to remap the
smart quotes into more normal byte streams.

-- Roy Zider
 
Robear:

The phrase above (L'Anatomie de l'image) is actually rendered
correctly here in this forum, even though when I pasted it into my
original posting it did not.

Here are a couple of links to pages with French text that have this
problem:

http://www.sothebys.com/app/live/lot/LotDetail.jsp?lot_id=159217097
http://www.sothebys.com/app/live/lot/LotDetail.jsp?lot_id=159219006

You might have to log onto sothebys -- but it's free, and once you've
got a sessionID you can get any detail records.

Thanks for your help with this. It may be that IE 6 simply can't
render this code sequence correctly, or that the encoding was reversed
in the source document due to bigendian-littleendian issues. But the
fact that the pasted phrase was rendered correctly in this forum after
I submitted my posting makes me believe there's more to it than a
simple byte-order reversal.

-- Roy Zider
 
You might have to log onto sothebys -- but it's free, and once you've
got a sessionID you can get any detail records.

Sorry, but I'm not going to do that. The problem might be caused by a
coding error on the page; if so, only sothebys.com can correct it. In the
meantime, you might want to toggle various options in IE View > Encoding
(including disabling Auto-Select).
 
PA Bear:

Sorry to hear that. I've toggled everything, and presented the actual
code bytes in my original description of the problem. Sorry you can't
help.
 
0xC2 0x92 in UTF-8 is 0x0092 in UTF-16, which is a reserved control
character. The same is true of any combination of 0xC2 followed by a byte
in the range 0x80 thorugh 0x9F. All are control characters and none have
any associated graphic. The page source appears to be a botched
conversion to UTF-8 from Windows 1252, a non-standard Microsoft extension
of ISO 8859-1.

There is nothing you can do to make the page display correctly short of
copying and recoding it. A proper apostrophe would be 0xCA 0xBC, although
many prefer the basic ASCII 0x27.


FUBARinSFO said:
L?Anatomie de l?image
Above is a snippet of text from a web page with French text on it. It
is supposed to be L'Anatomie de l'image.
I can't get it to render properly in Internet Explorer 6 SP2. The
apostrophe is rendered as some very narrow box, and a second byte as a
space.
1. The page has the attribute 'charset=utf-8' in a meta tag.
2. The specific byte pair underlying this character is 0xC2 0x92
3. As near as I can determine, this is a legitimate if deprecated
representation of an apostrope in Latin-1 Supplemental character set.
The smart single and double quotes are all in the range c2 9x.
4. There is some mapping amgibuity in this range, with characters that
were associated with Windows 1252 character set, which I don't have as
a listed option in IE.
 
Gary:

Yes, you are correct as far as current Unicode is concerned. I've
since researched the issue and found that indeed there were a bunch of
proper Windows characters in that range -- 27, in fact -- as part of
Windows-1252.

There were some conversions floating around for a while, apparently,
which mapped the codes in this range (0x80 - 0x9F) to what may have
been a preliminary version of UTF-8, maybe to preserve some
identifiability with the older Windows-1252 mapping. But in the
subsequent formalization of Unicode, bytes in this range were mapped
as control characters. This particular byte sequence is now one of
two for "private use". Now browsers fail to map anything in that
region, apparently (at least IE6 an FF2).

There is one thing a user can do, perhaps, and that is to generate a
custom code page. I haven't looked into that, but it would seem to be
a natural extension of a browser to handle custom codes like this. But
for the purposes of this exercise, which is a screen scraping
situation, I will simply pass all the source pages through a byte
filter and remap these 27 or so control area byte sequences to their
proper UTF-6 encodings.

Thank you both for your help.

-- Roy Zider
 
The problem is that Microsoft created Windows 1252 without regard to
existing standands, resulting in an encoding that is valid only within a
pure Microsoft environment. The minute you get non-Microsoft systems --
such as virtually all web hosts -- involved, the MS non-standard encoding
starts to cause problems.

0xC2 0x92 is the valid UTF-8 encoding of the byte value 0x92. The problem
is that it doesn't designate an apostrophe to any standards-compliant
program. If you're interested in mapping Windows 1252 to Unicode, this
may be helpful:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT.
The Unocde values are given in UTF-16, which is the usual practice in
documentation. Various converters are available, including this page:
http://people.w3.org/rishida/scripts/uniview/conversion.


FUBARinSFO said:
Yes, you are correct as far as current Unicode is concerned. I've
since researched the issue and found that indeed there were a bunch of
proper Windows characters in that range -- 27, in fact -- as part of
Windows-1252.
There were some conversions floating around for a while, apparently,
which mapped the codes in this range (0x80 - 0x9F) to what may have
been a preliminary version of UTF-8, maybe to preserve some
identifiability with the older Windows-1252 mapping. But in the
subsequent formalization of Unicode, bytes in this range were mapped
as control characters. This particular byte sequence is now one of
two for "private use". Now browsers fail to map anything in that
region, apparently (at least IE6 an FF2).
 
Back
Top