xml and character codes such as É

  • Thread starter Thread starter jake
  • Start date Start date
J

jake

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of É I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.
jake
 
jake said:
I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É".

That is an entity reference. To not "choke" on that entity reference you
need to declare the entity in the DTD you include in the XML document.
Otherwise the XML is not well-formed and the XML parser will reject it.
Note that DTD support is by default disabled in .NET 2.0 and later so
you will need to use
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
using (XmlReader reader = XmlReader.Create("file.xml", settings))
{
...
}
if you want to use a DTD declaring the entities the XML uses.
 
I am new to xml.  I have a routine that parses xml files using a
regular XmlReader class.  Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É".  I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going.  In the
case of É I replaced it with "\xC9", the rest follow suit.  The
list of characters is long and I doubt if this is the way it should be
handled.  The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference.  At any rate, is there something I am missing?  Some
XmlReader setting perhaps?  Your help is greatly appreciated.

XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"> <!-- latin capital letter E with acute, U
+00C9 ISOlat1 -->

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root> ... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.
 
Thank you Martin and Pavel. I understand a little more about it now.
Hoped that xml files would be a shallow wade but "nay" said the
gatekeeper. At least now I can proceed on solid grounds. I will most
likely include all the declarations in a separate .DTD that is
independently editable. This way, I can edit the file and add some
expletives without recompiling!
Thank you both again.
jake


I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "&Eacute;". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of &Eacute; I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.

XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"> <!-- latin capital letter E with acute, U
+00C9 ISOlat1 -->

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root> ... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.
 
Back
Top