Parsing badly formed XML  
Author Message
Michael BBB





PostPosted: XML and the .NET Framework, Parsing badly formed XML Top

I am trying to read an HTML document an System.Xml.XmlDocument object in .NET 2.0. The other option is reading the HTML document into a System.Xml.XmlReader object that I can then traverse. Unfortunately the HTML document is NOT a well-formed XML document.

The .NET 2.0 supports a new object called System.Windows.Forms.WebBrowser that obviously is capable of parsing NOT well-formed HTML documents. This object contains a property that returns a System.Windows.Forms.HtmlDocument that essentially provides the DOM capabilities supported by System.Xml.XmlDocument. The WebBrowser object is fairly CPU intensive and because of this this obviously comes with a price, if my application regularly fetches HTML documents and loads them into a WebBrowser object it starts to hog the CPU. I would like to avoid this by finding an alternative that will allow me to continue using the System.Net.HttpRequest to read the HTML pages.

The .NET 2.0 can reduce NOT well-formed HTML into a DOM traversable object in HtmlDocument through the WebBrowser object. Can it do this without having to resort to this CPU intensive method

Many thanks for your help.



.NET Development20  
 
 
Martin Honnen





PostPosted: XML and the .NET Framework, Parsing badly formed XML Top

If you use the .NET 2.0 WebBrowser control then you are using a managed wrapper around the COM based MS web browser control on that Internet Explorer itself is based.

If you only want to parse HTML but not render it then it might be better to look into libraries supporting that, there are at least two such libraries, one being SgmlReader, one being HTML agility pack.



 
 
sKIPper76





PostPosted: XML and the .NET Framework, Parsing badly formed XML Top

I suggest against HTML Agility Pack. I don't think it does such a great job. Some of the malformed documents I tried to transform into well-formed remained malformed. SgmlReader seems to work well for me. Tidy is another option.