how to parse XML files with html text in them

  • Thread starter Thread starter Keith G Hicks
  • Start date Start date
K

Keith G Hicks

I'm trying to parse some XML files that contain newspaper articles. Each
file is a separate article. Each element in the file is going to be posted
to a database. I wrote some code previously to read XML files that were laid
out rigidly and had no trouble. But these are not cooperating. They contain
lots of spacing, are not organized nicely line by line and some of the
elements are going to contain html tags (for example the article itself will
have <p>, <b>, <i> and other formatting tags in them). I need to be able to
read the XML tags into variables that I can post to the database. But my old
code for reading XML is not workign in this situation. I've used some
differetn examples I found on various sites but nothing seems to work so
far.

Here is a sample file:

<company_main>
<articles>
<id>
558960
</id>
<location_id>
1
</location_id>
<title>
<p>NY Times counsel</p>
<p>speaks at MSU Law</p>
</title>
<summary>
This is just a bunch of summary information about the article that is in
this file.......
</summary>
<author_id>
1
</author_id>
<text>
<p>
This is<i> paragraph</i> 1 of the article itself. Lorem ipsum dolor sit
amet, consectetur adipiscing elit. Duis nec lorem a tellus pulvinar dapibus.
Proin ut lectus magna. Morbi velit mi, faucibus a malesuada non, vehicula a
leo. Nam dolor elit, adipiscing blandit aliquet non, pellentesque sit amet
justo. Nulla tempor risus in sapien rhoncus mollis. Suspendisse potenti.
Integer vel pulvinar risus.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Mauris non dolor erat,
vitae elementum nisl. <b>Sed ac ante ac purus</b> hendrerit tincidunt quis
eget augue. Nam orci mauris, pulvinar vitae faucibus ac, varius quis nunc.
Vestibulum sed feugiat magna.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Nam bibendum aliquam
adipiscing. Sed congue rutrum sagittis. Ut neque felis, scelerisque a
adipiscing sit amet, pulvinar sed nisl. Praesent metus tortor, iaculis vitae
tempor at, rhoncus eu felis. Proin luctus, magna sit amet dapibus bibendum,
leo urna semper velit, venenatis dictum quam enim at sem.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Proin quis dolor vel
mauris vehicula lobortis in vel nunc. Nullam neque neque, auctor et rutrum
vitae, ultrices in nunc. Sed adipiscing interdum risus et euismod.
</p>
</text>
<date>
10/27/09
</date>
<type>
Published
</type>
<url>
</url>
</articles>
</company_main>

I'm sure it's obvious but I need to read the following:

id
location_id
title
summary
author_id
text
date
type
url

This didn't work (kept finding tags that are not actually XML elements):

Dim xrdr As New XmlTextReader(textFilesLocation &
sArticleToPost)
xrdr.WhitespaceHandling = WhitespaceHandling.None

While xrdr.Read()

If String.Compare(xrdr.Name, "id", True) = 0 Then
ArticleID = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "location_id", True) = 0
Then
LocationID = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "title", True) = 0 Then
ArticleTitle = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "summary", True) = 0 Then
ArticleSummary = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "text", True) = 0 Then
ArticleText = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "author_id", True) = 0 Then
AuthorID = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "date", True) = 0 Then
ArticleDate = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "type", True) = 0 Then
ArticleType = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "url", True) = 0 Then
ArticleURL = Trim(xrdr.ReadElementString())
End If

End While

xrdr.Close()

And this errored as well (error said it found invalid encoding):

Dim m_xmld As XmlDocument
Dim m_nodelist As XmlNodeList
Dim m_node As XmlNode

'Create the XML Document
m_xmld = New XmlDocument()

'Load the Xml file
m_xmld.Load(textFilesLocation & sArticleToPost)

'Get the list of name nodes
m_nodelist = m_xmld.SelectNodes("/company_main/articles")

'Loop through the nodes
For Each m_node In m_nodelist

ArticleID = m_node.Attributes.GetNamedItem("id").Value
LocationID =
m_node.Attributes.GetNamedItem("location_id").Value
ArticleTitle =
m_node.Attributes.GetNamedItem("title").Value

Next

Any help would be greatly appreciated!

Keith
 
Keith said:
I'm trying to parse some XML files that contain newspaper articles. Each
file is a separate article. Each element in the file is going to be posted
to a database. I wrote some code previously to read XML files that were laid
out rigidly and had no trouble. But these are not cooperating. They contain
lots of spacing, are not organized nicely line by line and some of the
elements are going to contain html tags (for example the article itself will
have <p>, <b>, <i> and other formatting tags in them). I need to be able to
read the XML tags into variables that I can post to the database. But my old
code for reading XML is not workign in this situation. I've used some
differetn examples I found on various sites but nothing seems to work so
far.

Here is a sample file:

<company_main>
<articles>
<id>
558960
</id>
<location_id>
1
</location_id>
<title>
<p>NY Times counsel</p>
<p>speaks at MSU Law</p>
</title>
<summary>
This is just a bunch of summary information about the article that is in
this file.......
</summary>
<author_id>
1
</author_id>
<text>
<p>
This is<i> paragraph</i> 1 of the article itself. Lorem ipsum dolor sit
amet, consectetur adipiscing elit. Duis nec lorem a tellus pulvinar dapibus.
Proin ut lectus magna. Morbi velit mi, faucibus a malesuada non, vehicula a
leo. Nam dolor elit, adipiscing blandit aliquet non, pellentesque sit amet
justo. Nulla tempor risus in sapien rhoncus mollis. Suspendisse potenti.
Integer vel pulvinar risus.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Mauris non dolor erat,
vitae elementum nisl. <b>Sed ac ante ac purus</b> hendrerit tincidunt quis
eget augue. Nam orci mauris, pulvinar vitae faucibus ac, varius quis nunc.
Vestibulum sed feugiat magna.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Nam bibendum aliquam
adipiscing. Sed congue rutrum sagittis. Ut neque felis, scelerisque a
adipiscing sit amet, pulvinar sed nisl. Praesent metus tortor, iaculis vitae
tempor at, rhoncus eu felis. Proin luctus, magna sit amet dapibus bibendum,
leo urna semper velit, venenatis dictum quam enim at sem.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Proin quis dolor vel
mauris vehicula lobortis in vel nunc. Nullam neque neque, auctor et rutrum
vitae, ultrices in nunc. Sed adipiscing interdum risus et euismod.
</p>
</text>
<date>
10/27/09
</date>
<type>
Published
</type>
<url>
</url>
</articles>
</company_main>

I'm sure it's obvious but I need to read the following:

id
location_id
title
summary
author_id
text
date
type
url

Well that does not tell us what exactly you want to extract, for
instance for the 'title' element or the 'text' element.
For instance with an element

<title>
<p>NY Times counsel</p>
<p>speaks at MSU Law</p>
</title>

what exactly do you need? The plain text e.g.
"NY Times counselspeaks at MSU Law"
or the markup contained itself e.g.
"<p>NY Times counsel</p>
<p>speaks at MSU Law</p>"
? You will need to clarify that.

And this errored as well (error said it found invalid encoding):

Dim m_xmld As XmlDocument
Dim m_nodelist As XmlNodeList
Dim m_node As XmlNode

'Create the XML Document
m_xmld = New XmlDocument()

'Load the Xml file
m_xmld.Load(textFilesLocation & sArticleToPost)

XmlDocument uses XmlReader under the hood to parse the XML so I am
astonished that you say you could use XmlTextReader explicitly but then
get an error about encoding with XmlDocument.
Please provide more details, the exact error message, the exact
statement (the Load() call?), and if given in the error message the
position in the XML where the error occurs and obviously the excerpt
from the XML that causes the error.
 
Well it's working better. Here's what I have right now:

Dim articleXMLDoc As XmlDocument
Dim articleXMLNodeList As XmlNodeList
Dim articleXMLNode As XmlNode
'Create the XML Document

articleXMLDoc = New XmlDocument()

'Load the Xml file
articleXMLDoc.Load(textFilesLocation & sArticleToPost)

'Get the list of name nodes
articleXMLNodeList =
articleXMLDoc.SelectNodes("/company/articles")

'Loop through the nodes (usually only one per xml file)
For Each articleXMLNode In articleXMLNodeList

ArticleID = articleXMLNode.ChildNodes.Item(0).InnerXml
LocationID = articleXMLNode.ChildNodes.Item(1).InnerXml
ArticleTitle =
articleXMLNode.ChildNodes.Item(2).InnerXml
ArticleSummary =
articleXMLNode.ChildNodes.Item(3).InnerXml
ArticleText = articleXMLNode.ChildNodes.Item(4).InnerXml
AuthorID = articleXMLNode.ChildNodes.Item(5).InnerXml
ArticleDate = articleXMLNode.ChildNodes.Item(6).InnerXml
ArticleType = articleXMLNode.ChildNodes.Item(7).InnerXml
ArticleURL = articleXMLNode.ChildNodes.Item(8).InnerXml

Next

First, to answer one of your questions, I do need the markup so I'm using
InnerXML.

Second, one of the strings in the ArticleSummary element is as follows:

<summary>By Rachel Beck
AP Business Writer

NEW YORK (AP) ? A theme is emerging from the flood of recent corporate
earnings reports: Cost cuts are boosting profits.
Investors are cheering, but they shouldn?t. Even in these tough times, more
CEOs should be talking</summary>

(looks sloppy but that's how it's coming over to me and I can't control
that)

When this line runs: articleXMLDoc.Load(textFilesLocation & sArticleToPost),
I get the error "Invalid character in the given encoding. Line 14, position
15."

It doesn't like the "p" after NEW YORK (AP). No idea why. I'm guessing it
thinks it's supposed to be <p> but it might just be a p on it's own for some
reason. I will have no way to predict what characters will be in the
article.

Third thing I need to be able to do is to read the elements by tag name and
not index #. I may not be able to guarantee that the order of the xml
elements inside each articles node will be the same. So index #'s won't
always work.

Keith
 
Never mind on the getting elements by name. I figured that part out:

ArticleID = articleXMLNode("id").InnerXml
LocationID = articleXMLNode("location_id").InnerXml
ArticleTitle = articleXMLNode("title").InnerXml
ArticleSummary = articleXMLNode("summary").InnerXml
ArticleText = articleXMLNode("text").InnerXml
AuthorID = articleXMLNode("author_id").InnerXml
ArticleDate = articleXMLNode("date").InnerXml
ArticleType = articleXMLNode("type").InnerXml
ArticleURL = articleXMLNode("url").InnerXml

I still need help on the error with the "p"

Keith
 
I think I know what's going on now. The "p" is not a "p". I just noticed
that it turned into a "?" when I pasted into this post so I decided to check
it out. The text strings in the XML are coming out of a Mac system. I
encountered this once before. Some of the Mac characters are getting screwed
up when they get onto windows machines.

I took a look at the files in a hex editor. The characters that are causing
the problems are N's with tildes and other such things.
 
Keith said:
I think I know what's going on now. The "p" is not a "p". I just noticed
that it turned into a "?" when I pasted into this post so I decided to check
it out. The text strings in the XML are coming out of a Mac system. I
encountered this once before. Some of the Mac characters are getting screwed
up when they get onto windows machines.

I took a look at the files in a hex editor. The characters that are causing
the problems are N's with tildes and other such things.

XML has strict rules, if there is an encoding problem then that is a
well-formedness violation. The best approach is to fix the problem at
the source when creating the XML, using an XML API to ensure the
document is properly encoded and has an XML declaration declaring the
used encoding. If you can't do that and nevertheless want to parse the
file with .NET then you will at least need to find out what encoding has
been used and then, if the .NET framework supports that encoding, instead of
xmlDocumentInstance.Load(fileName)
you will need to use
Using sr As StreamReader = new StreamReader(fileName,
Encoding.GetEncoding(nameOrCodePageOfEncodingGoesHere))
xmlDocumentInstance.Load(sr)
sr.Close()
End Using
 
Keith said:
I'm trying to parse some XML files that contain newspaper articles. Each
file is a separate article. Each element in the file is going to be posted
to a database. I wrote some code previously to read XML files that were laid
out rigidly and had no trouble. But these are not cooperating. They contain
lots of spacing, are not organized nicely line by line and some of the
elements are going to contain html tags (for example the article itself will
have <p>, <b>, <i> and other formatting tags in them). I need to be able to
read the XML tags into variables that I can post to the database. But my old
code for reading XML is not workign in this situation. I've used some
differetn examples I found on various sites but nothing seems to work so
far.

Here is a sample file:

<company_main>
<articles>
<id>
558960
</id>
<location_id>
1
</location_id>
<title>
<p>NY Times counsel</p>
<p>speaks at MSU Law</p>
</title>

Can't you request to get an XML file where the content is properly
encoded, instead of a file where the HTML is mixed into the XML?

In an XML file where the HTML code is a value in the element, the HTML
would be encoded like this:

<title>
&lt;p&gt;NY Times counsel&lt;/p&gt;
&lt;p&gt;speaks at MSU Law&lt;/p&gt;
</title>

This would mean that you can simply get the value of the title tag,
instead of trying to figure out where the XML ends and the HTML starts.
 
Yeah. That would be nice! :-) I've tried before with this company. Easier
said than done in this case. The XML is coming out of a proprietary (and
very awkward) db system where they don't like customizing their end at all.
 
I am not sure I understand this - the input file does not seem to have any xml tags in it.

But, for what it is worth - I often have to extract stock market data from mixed XML-HTML files. I use a script like the following.


# Script ExtractNode.txt
var str file, node, input
cat $file > $input
stex -r -c ("^<"+$node+"&\>^]")$input > null
stex -r -c ("[^</"+$node+"&\>^")$input > null
echo $input


Script is in biterscripting. You would call it as


script "ExtractNode.txt" node("title") file("/somefile.extn")


This is the basic concept - The script will read in file "/somefile.extn" into a string variable $input. It will then strip off every thing before (and including) "<title...>". Then, it will strip off everything after (and including) "</title...>". What's remaining is the title node. The script echoes the extracted node to output.

This is just the basic concept. You will find the documentation for the stex (string extractor) command at http://www.biterscripting.com/helppages/stex.html . You may also find more complex XML-parsing scripts posted around the web.

Randi




Keith G Hicks wrote:

how to parse XML files with html text in them
04-Nov-09

I am trying to parse some XML files that contain newspaper articles. Eac
file is a separate article. Each element in the file is going to be poste
to a database. I wrote some code previously to read XML files that were lai
out rigidly and had no trouble. But these are not cooperating. They contai
lots of spacing, are not organized nicely line by line and some of th
elements are going to contain html tags (for example the article itself wil
have <p>, <b>, <i> and other formatting tags in them). I need to be able t
read the XML tags into variables that I can post to the database. But my ol
code for reading XML is not workign in this situation. I have used som
differetn examples I found on various sites but nothing seems to work s
far

Here is a sample file

<company_main
<articles
<id
55896
</id
<location_id

</location_id
<title
<p>NY Times counsel</p
<p>speaks at MSU Law</p
</title
<summary
This is just a bunch of summary information about the article that is i
this file......
</summary
<author_id

</author_id
<text
<p
This is<i> paragraph</i> 1 of the article itself. Lorem ipsum dolor si
amet, consectetur adipiscing elit. Duis nec lorem a tellus pulvinar dapibus
Proin ut lectus magna. Morbi velit mi, faucibus a malesuada non, vehicula
leo. Nam dolor elit, adipiscing blandit aliquet non, pellentesque sit ame
justo. Nulla tempor risus in sapien rhoncus mollis. Suspendisse potenti
Integer vel pulvinar risus
</p
<p
This is<i> paragraph</i> 1 of the article itself. Mauris non dolor erat
vitae elementum nisl. <b>Sed ac ante ac purus</b> hendrerit tincidunt qui
eget augue. Nam orci mauris, pulvinar vitae faucibus ac, varius quis nunc
Vestibulum sed feugiat magna
</p
<p
This is<i> paragraph</i> 1 of the article itself. Nam bibendum aliqua
adipiscing. Sed congue rutrum sagittis. Ut neque felis, scelerisque
adipiscing sit amet, pulvinar sed nisl. Praesent metus tortor, iaculis vita
tempor at, rhoncus eu felis. Proin luctus, magna sit amet dapibus bibendum
leo urna semper velit, venenatis dictum quam enim at sem
</p
<p
This is<i> paragraph</i> 1 of the article itself. Proin quis dolor ve
mauris vehicula lobortis in vel nunc. Nullam neque neque, auctor et rutru
vitae, ultrices in nunc. Sed adipiscing interdum risus et euismod
</p
</text
<date
10/27/0
</date
<type
Publishe
</type
<url
</url
</articles
</company_main

I am sure it is obvious but I need to read the following

i
location_i
titl
summar
author_i
tex
dat
typ
ur

This did not work (kept finding tags that are not actually XML elements)

Dim xrdr As New XmlTextReader(textFilesLocation
sArticleToPost
xrdr.WhitespaceHandling = WhitespaceHandling.Non

While xrdr.Read(

If String.Compare(xrdr.Name, "id", True) = 0 The
ArticleID = Trim(xrdr.ReadElementString()
End I

If String.Compare(xrdr.Name, "location_id", True) =
Then
LocationID = Trim(xrdr.ReadElementString())
End If

Previous Posts In This Thread:

how to parse XML files with html text in them
I am trying to parse some XML files that contain newspaper articles. Each
file is a separate article. Each element in the file is going to be posted
to a database. I wrote some code previously to read XML files that were laid
out rigidly and had no trouble. But these are not cooperating. They contain
lots of spacing, are not organized nicely line by line and some of the
elements are going to contain html tags (for example the article itself will
have <p>, <b>, <i> and other formatting tags in them). I need to be able to
read the XML tags into variables that I can post to the database. But my old
code for reading XML is not workign in this situation. I have used some
differetn examples I found on various sites but nothing seems to work so
far.

Here is a sample file:

<company_main>
<articles>
<id>
558960
</id>
<location_id>
1
</location_id>
<title>
<p>NY Times counsel</p>
<p>speaks at MSU Law</p>
</title>
<summary>
This is just a bunch of summary information about the article that is in
this file.......
</summary>
<author_id>
1
</author_id>
<text>
<p>
This is<i> paragraph</i> 1 of the article itself. Lorem ipsum dolor sit
amet, consectetur adipiscing elit. Duis nec lorem a tellus pulvinar dapibus.
Proin ut lectus magna. Morbi velit mi, faucibus a malesuada non, vehicula a
leo. Nam dolor elit, adipiscing blandit aliquet non, pellentesque sit amet
justo. Nulla tempor risus in sapien rhoncus mollis. Suspendisse potenti.
Integer vel pulvinar risus.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Mauris non dolor erat,
vitae elementum nisl. <b>Sed ac ante ac purus</b> hendrerit tincidunt quis
eget augue. Nam orci mauris, pulvinar vitae faucibus ac, varius quis nunc.
Vestibulum sed feugiat magna.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Nam bibendum aliquam
adipiscing. Sed congue rutrum sagittis. Ut neque felis, scelerisque a
adipiscing sit amet, pulvinar sed nisl. Praesent metus tortor, iaculis vitae
tempor at, rhoncus eu felis. Proin luctus, magna sit amet dapibus bibendum,
leo urna semper velit, venenatis dictum quam enim at sem.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Proin quis dolor vel
mauris vehicula lobortis in vel nunc. Nullam neque neque, auctor et rutrum
vitae, ultrices in nunc. Sed adipiscing interdum risus et euismod.
</p>
</text>
<date>
10/27/09
</date>
<type>
Published
</type>
<url>
</url>
</articles>
</company_main>

I am sure it is obvious but I need to read the following:

id
location_id
title
summary
author_id
text
date
type
url

This did not work (kept finding tags that are not actually XML elements):

Dim xrdr As New XmlTextReader(textFilesLocation &
sArticleToPost)
xrdr.WhitespaceHandling = WhitespaceHandling.None

While xrdr.Read()

If String.Compare(xrdr.Name, "id", True) = 0 Then
ArticleID = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "location_id", True) = 0
Then
LocationID = Trim(xrdr.ReadElementString())
End If

Keith G Hicks wrote:Well that does not tell us what exactly you want to
Keith G Hicks wrote:

Well that does not tell us what exactly you want to extract, for
instance for the 'title' element or the 'text' element.
For instance with an element

<title>
<p>NY Times counsel</p>
<p>speaks at MSU Law</p>
</title>

what exactly do you need? The plain text e.g.
"NY Times counselspeaks at MSU Law"
or the markup contained itself e.g.
"<p>NY Times counsel</p>
<p>speaks at MSU Law</p>"
? You will need to clarify that.

Well it is working better.
Well it is working better. Here is what I have right now:

Dim articleXMLDoc As XmlDocument
Dim articleXMLNodeList As XmlNodeList
Dim articleXMLNode As XmlNode
'Create the XML Document

articleXMLDoc = New XmlDocument()

'Load the Xml file
articleXMLDoc.Load(textFilesLocation & sArticleToPost)

'Get the list of name nodes
articleXMLNodeList =
articleXMLDoc.SelectNodes("/company/articles")

'Loop through the nodes (usually only one per xml file)
For Each articleXMLNode In articleXMLNodeList

ArticleID = articleXMLNode.ChildNodes.Item(0).InnerXml
LocationID = articleXMLNode.ChildNodes.Item(1).InnerXml
ArticleTitle =
articleXMLNode.ChildNodes.Item(2).InnerXml
ArticleSummary =
articleXMLNode.ChildNodes.Item(3).InnerXml
ArticleText = articleXMLNode.ChildNodes.Item(4).InnerXml
AuthorID = articleXMLNode.ChildNodes.Item(5).InnerXml
ArticleDate = articleXMLNode.ChildNodes.Item(6).InnerXml
ArticleType = articleXMLNode.ChildNodes.Item(7).InnerXml
ArticleURL = articleXMLNode.ChildNodes.Item(8).InnerXml

Next

First, to answer one of your questions, I do need the markup so I am using
InnerXML.

Second, one of the strings in the ArticleSummary element is as follows:

<summary>By Rachel Beck
AP Business Writer

NEW YORK (AP) ? A theme is emerging from the flood of recent corporate
earnings reports: Cost cuts are boosting profits.
Investors are cheering, but they shouldn?t. Even in these tough times, more
CEOs should be talking</summary>

(looks sloppy but that is how it is coming over to me and I cannot control
that)

When this line runs: articleXMLDoc.Load(textFilesLocation & sArticleToPost),
I get the error "Invalid character in the given encoding. Line 14, position
15."

It does not like the "p" after NEW YORK (AP). No idea why. I am guessing it
thinks it is supposed to be <p> but it might just be a p on it is own for some
reason. I will have no way to predict what characters will be in the
article.

Third thing I need to be able to do is to read the elements by tag name and
not index #. I may not be able to guarantee that the order of the xml
elements inside each articles node will be the same. So index #'s will not
always work.

Keith

Never mind on the getting elements by name.
Never mind on the getting elements by name. I figured that part out:

ArticleID = articleXMLNode("id").InnerXml
LocationID = articleXMLNode("location_id").InnerXml
ArticleTitle = articleXMLNode("title").InnerXml
ArticleSummary = articleXMLNode("summary").InnerXml
ArticleText = articleXMLNode("text").InnerXml
AuthorID = articleXMLNode("author_id").InnerXml
ArticleDate = articleXMLNode("date").InnerXml
ArticleType = articleXMLNode("type").InnerXml
ArticleURL = articleXMLNode("url").InnerXml

I still need help on the error with the "p"

Keith

I think I know what is going on now. The "p" is not a "p".
I think I know what is going on now. The "p" is not a "p". I just noticed
that it turned into a "?" when I pasted into this post so I decided to check
it out. The text strings in the XML are coming out of a Mac system. I
encountered this once before. Some of the Mac characters are getting screwed
up when they get onto windows machines.

I took a look at the files in a hex editor. The characters that are causing
the problems are N's with tildes and other such things.

Keith G Hicks wrote:XML has strict rules, if there is an encoding problem then
Keith G Hicks wrote:

XML has strict rules, if there is an encoding problem then that is a
well-formedness violation. The best approach is to fix the problem at
the source when creating the XML, using an XML API to ensure the
document is properly encoded and has an XML declaration declaring the
used encoding. If you cannot do that and nevertheless want to parse the
file with .NET then you will at least need to find out what encoding has
been used and then, if the .NET framework supports that encoding, instead of
xmlDocumentInstance.Load(fileName)
you will need to use
Using sr As StreamReader = new StreamReader(fileName,
Encoding.GetEncoding(nameOrCodePageOfEncodingGoesHere))
xmlDocumentInstance.Load(sr)
sr.Close()
End Using

--

Martin Honnen --- MVP XML
http://msmvps.com/blogs/martin_honnen/

Keith G Hicks wrote:Can't you request to get an XML file where the content is
Keith G Hicks wrote:

Can't you request to get an XML file where the content is properly
encoded, instead of a file where the HTML is mixed into the XML?

In an XML file where the HTML code is a value in the element, the HTML
would be encoded like this:

<title>
&lt;p&gt;NY Times counsel&lt;/p&gt;
&lt;p&gt;speaks at MSU Law&lt;/p&gt;
</title>

This would mean that you can simply get the value of the title tag,
instead of trying to figure out where the XML ends and the HTML starts.

--
G?ran Andersson
_____
http://www.guffa.com

Yeah. That would be nice!
Yeah. That would be nice! :-) I have tried before with this company. Easier
said than done in this case. The XML is coming out of a proprietary (and
very awkward) db system where they do not like customizing their end at all.

EggHeadCafe - Software Developer Portal of Choice
Displaying Popup RTF from embedded Resources
http://www.eggheadcafe.com/tutorial...1-191b7d398eb0/displaying-popup-rtf-from.aspx
 
Back
Top