Parse (recovered) corrupt xml files and automatically repair them.

Anna · Jan 9, 2010

I want to parse (recovered) corrupt xml files and automatically repair them
for forensic purposes. (some elements are not properly closed or missing)
I know the original xml scheme.
(When i read the (corrupt) xml file a XmlException raises wich indecates the
problem.)
What's the best approach to solve this problem.

I do appreciate any advice.

Anna

Martin Honnen · Jan 9, 2010

Anna said:
I want to parse (recovered) corrupt xml files and automatically repair them
for forensic purposes. (some elements are not properly closed or missing)
I know the original xml scheme.
(When i read the (corrupt) xml file a XmlException raises wich indecates the
problem.)
What's the best approach to solve this problem.

I do appreciate any advice.

If the markup is not well-formed then I don't think any of the XML APIs
in the .NET framework help, they all want well-formed markup.

If you have well-formed markup but elements are missing and you have a
schema then you could try to validate the XML against the schema with
http://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.validate.aspx
(in .NET 3.5) or with
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.validate.aspx
in earlier versions, that will report invalid elements and using the
schema object model
(http://msdn.microsoft.com/en-us/library/ms255931.aspx) you could try to
find out which elements are missing and add them. Not an easy task
however, and certainly not something one or two API calls will do
"automatically", you will have to write your own code.

Anna · Jan 10, 2010

If the markup is not well-formed then I don't think any of the XML APIs in

the .NET framework help, they all want well-formed markup.

I was afraid of that.
So any advice on what's the best approach to solve this problem, writing my
own code ?

Anna

Martin Honnen · Jan 10, 2010

Anna said:
I was afraid of that.
So any advice on what's the best approach to solve this problem, writing my
own code ?

You will need to find out exactly which rules the markup you have
implements respectively if there are any rules at all. The only other
markup language I know is SGML, it allows omitting certain tags, not
quoting certain attribute values, but there are clear rules how the
parser has to infer elements or has to find out where an attribute value
ends. There is a .NET implementation of an SGML parser, SgmlReader
(http://developer.mindtouch.com/SgmlReader) which can be used to convert
"HTML tag soup" to XHTML. There is also a HTML Tidy application doing
the same. So studying the code of such applications can help.

Jesse Houwing · Jan 10, 2010

* Martin Honnen wrote, On 10-1-2010 14:45:

You will need to find out exactly which rules the markup you have
implements respectively if there are any rules at all. The only other
markup language I know is SGML, it allows omitting certain tags, not
quoting certain attribute values, but there are clear rules how the
parser has to infer elements or has to find out where an attribute value
ends. There is a .NET implementation of an SGML parser, SgmlReader
(http://developer.mindtouch.com/SgmlReader) which can be used to convert
"HTML tag soup" to XHTML. There is also a HTML Tidy application doing
the same. So studying the code of such applications can help.

You might also be able to use the HTML Agility Pack, it's pretty
forgiving when it comes to tags, but I'm not sure it'll parse just any
XML like structure...

See Codeplex.com/HtmlAgilityPack for the download.

Anna · Jan 12, 2010

Thx, i'll give it a try.

Anna

Martin Honnen said:
You will need to find out exactly which rules the markup you have
implements respectively if there are any rules at all. The only other
markup language I know is SGML, it allows omitting certain tags, not
quoting certain attribute values, but there are clear rules how the parser
has to infer elements or has to find out where an attribute value ends.
There is a .NET implementation of an SGML parser, SgmlReader
(http://developer.mindtouch.com/SgmlReader) which can be used to convert
"HTML tag soup" to XHTML. There is also a HTML Tidy application doing the
same. So studying the code of such applications can help.

Richard.Williams.20 · Jan 21, 2010

I had done something like this in the past, but can't find the code.
Here is what I did.

I defined template in the form:

m:company
m:department
m:employee
o:salary

This defines the hiearchy of XML. m: means mandatory, o: means
optional element.

I then parsed the input XML and built a stack of elements, doing the
following as I parsed the file.
- complete incomplete nodes
- ensured that the elements are in the correct hiearchy
- add missing (mandatory) elements with default values

I remember there were some situations where the XML simply could not
be repaired automatically. So this won't be the perfect solution, but
it will be a start. I used biterscripting for easy parsing, stack-
building, etc. Check on http://www.biterscripting.com/helppages_samplescripts.html
if there any sample scripts you can reuse.

Parse (recovered) corrupt xml files and automatically repair them.

Anna

Martin Honnen

Anna

Martin Honnen

Jesse Houwing

Anna

Richard.Williams.20