WebBrowser and Returing the raw HTML

  • Thread starter Thread starter Craig Francis
  • Start date Start date
C

Craig Francis

Ok I'm making a fairly simple application. It contains 2
web browsers, the top one is used so that you can view a
website (i.e. one you have created). Every time you load
a page, the HTML which was received is then sent to the
http://validator.w3.org website to validate your HTML /
XHTML.

So far I've got everything to work, even the part where
the HTML is posted to the w3.org website.

But all of the following commands (Browser1 is the main
WebBrowser control) produce a form of HTML for the
document, but all the tags get converted to uppercase and
parts of the document go missing such as the "DOCTYPE"...

Browser1.Document.ToString()
Browser1.Document.documentelement.outerhtml
Browser1.Document.documentelement.innerhtml
Browser1.Document.Body.outerhtml
Browser1.Document.Body.innerhtml
Browser1.Document.All(0).outerhtml
Browser1.Document.All(0).innerhtml
Browser1.Document.All(1).outerhtml
Browser1.Document.All(1).innerhtml
Browser1.Document.All(2).outerhtml
Browser1.Document.All(2).innerhtml


NOTE: The HTML sent to the w3.org website must be exactly
the same as what the server sends otherwise what's the
point in validating it?

Finally, because it will be used on interactive websites
(with a user login), you cant use controls such as the
Inet to return the HTML as then the user (main browser)
will make a request to the server (which may delete a
record) then the Inet or Winsock (etc) will make a
request, but this will then return a different page
(saying you cant delete a record).
 
Hi Craig

Because the DOCTYPE tag is outside the main document, it is not included
when you retrieve inner and outer HTML. To include the entire file you will
need to use the IPersistStreamInit interface, e.g.

<interface>
Imports System.Runtime.InteropServices

' IPersistStreamInit interface
<ComVisible(True), ComImport(),
Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), _
InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPersistStreamInit
Sub GetClassID(ByRef pClassID As Guid)

<PreserveSig()> Function IsDirty() As Integer
<PreserveSig()> Function Load(ByVal pstm As UCOMIStream) As Integer
<PreserveSig()> Function Save(ByVal pstm As UCOMIStream, ByVal fClearDirty
As Boolean) As Integer
<PreserveSig()> Function GetSizeMax(<InAttribute(), Out(),
MarshalAs(UnmanagedType.U8)> ByRef pcbSize As Long) As Integer
<PreserveSig()> Function InitNew() As Integer
End Interface
</interface>

<code>
Dim ips as IPersistStreamInit

ips = DirectCast(Browser1.document, IPersistStreamInit)

ips.Save(strm, False)
</code>

This will save the complete HTML to a stream, which you can turn into a
string.

Regarding the conversion to uppercase, is this actually a problem? The
change of case should not affect the validity of the parsing.

There also two particular newsgroups which may give further help:

microsoft.public.inetsdk.programming.mshtml_hosting
microsoft.public.inetsdk.programming.webbrowser_ctl

HTH

Charles
 
Thank you for your quick reply.

But is that VB code? I've been using VB5/6 for several
years and that looks slightly C like - this project is
being written in VB.NET, but I've only just upgraded and
finding some of these new methods a little strange.

Also RE the tags being changed to uppercase - The reason
I mentioned it was because it shows that the HTML
document is being changed, probably into a form that the
browser can easily understand (and is probably strict XML
even if the input wasn't XML based).

Anyway, thanks for giving me something else to try.

Craig
 
Got it, you put all the

<interface></interface>

before the "Public Class Form1" bit - so the first part
of the form, then the

<code></code>

in the function which returns the HTML code. Well that
method doesn't bring up any errors apart from what "strm"
should be dimed as - I've never used a stream before.

But thanks again - this is the most progress I've made in
the past 2 days!
 
Hi Craig

Yes, sorry about that. It's just a habit I have got into to show where code
and stuff begins and ends. Add the following for the stream handling:

<code>
<DllImport("OLE32.DLL")> _
Public Shared Sub CreateStreamOnHGlobal(ByVal hGlobal As IntPtr, ByVal
fDelete As Boolean, ByRef stm As UCOMIStream)
' LEAVE THIS BLANK - PLACEHOLDER
End Sub

<DllImport("OLE32.DLL")> _
Public Shared Sub GetHGlobalFromStream(ByVal stm As UCOMIStream, ByRef
hGlobal As IntPtr)
' LEAVE THIS BLANK - PLACEHOLDER
End Sub

Private Function GetStream(ByVal size As Integer) As UCOMIStream

Dim iptr As IntPtr
Dim strm As UCOMIStream

iptr = Marshal.AllocHGlobal(size)
CreateStreamOnHGlobal(iptr, True, strm)

Return strm

End Function

Private Function StreamToString(ByVal strm As UCOMIStream) As String

Dim iptr As IntPtr
Dim s As String

GetHGlobalFromStream(strm, iptr)
s = Marshal.PtrToStringAnsi(iptr)

Return s

End Function
</code>

<code>
Dim strm As UCOMIStream
Dim s As String

' Allocate a reasonably high value!
strm = GetStream(2048)

' Save HTML and convert to a string
ips.Save(strm, False)
s = StreamToString(strm)
</code>

The code above should allow you to be able get the full HTML. The only issue
with this is the allocation of the stream. IPersistStreamInit.GetSizeMax()
should return a value indicating the size of the stream required, but it
always returns zero. The best way, therefore is to read the stream a bit at
a time until the buffer is empty, but for simplicity I have just allocated a
stream that should be big enough to take it all in one go. You can make it
bigger of course if you need to.

HTH

Charles
 
Hello,

Craig Francis said:
Ok I'm making a fairly simple application. It contains 2
web browsers, the top one is used so that you can view a
website (i.e. one you have created). Every time you load
a page, the HTML which was received is then sent to the
http://validator.w3.org website to validate your HTML /
XHTML.

So far I've got everything to work, even the part where
the HTML is posted to the w3.org website.

But all of the following commands (Browser1 is the main
WebBrowser control) produce a form of HTML for the
document, but all the tags get converted to uppercase and
parts of the document go missing such as the "DOCTYPE"...

I don't really understand why you use the WebBrowser control to download the
web page. Why not use, for example, the 'WebRequest' class?
 
I don't really understand why you use the WebBrowser
control to download the
web page. Why not use, for example, the 'WebRequest'
class?

Because im fairly new to VB.NET and wanted a simple
application to create - well what I thought might be
simple.

Also I've used the WebBrowser control before and it was a
simple way to add a browser to the application where the
user could navigate in exactly the same way as in IE.
 
Charles,
Just a question, I have seen you uses always the mshtml.IHtmldocument2
I use the mshtml.Htmldocument.
I have the idea, that with that I can access all <tags> including the src,
innertext and innerhtml etc per framepage.

What do I mis?
Cor
 
Hi Cor

Long time no speak.

The simple answer is speed. Try the following on an initialised WebBrowser
control and you may be surprised:

<code>
Dim doc As mshtml.HTMLDocument
Dim doc2 As mshtml.IHTMLDocument2
Dim elem As mshtml.IHTMLElement

Dim dt As Date

MsgBox("Start")

dt = Now

For i As Integer = 1 To 1000
doc = DirectCast(AxWebBrowser1.Document, mshtml.HTMLDocument)
elem = doc.createElement("INPUT")
Next i

MsgBox(Now.Subtract(dt).ToString)

dt = Now

For i As Integer = 1 To 1000
doc2 = DirectCast(AxWebBrowser1.Document, mshtml.IHTMLDocument2)
elem = doc2.createElement("INPUT")
Next i

MsgBox(Now.Subtract(dt).ToString)
</code>

I used mshtml.HTMLDocument once in the earlier post because New doesn't work
on interfaces of course. But otherwise I use the interfaces. It means a bit
more code to cast to the correct one all the time [long live Option Strict
On], but it's worth it in performance.

Regards

Charles
 
Charles,
I dont have to test it, In this case I can simple believe you (what I do of
cours always) but in this case there (was) a speed problem.

I did want to insert your piece of program and there was this sentence.
Dim tagname As String = iDocument.all.item(i).tagName ' voor snelheid

voor snelheid=for speed

I normaly try to avoid putting comments in a program because I find that
than the programming is not well done,
but this was such a stupid contruction.

So I go change that big routine and try to use the IHtmldocument2 there.

I did not test it, because I use the Mshmtl in another class than the
webbrowser, I made that once to overcome that slow behaviour from the IDE,
before I did discover that it was to overcome by just not putting the import
in the program.

Again thanks a lot

Cor
 
Hi Craig,

Using WebRequest will allow you to get the HTML in the raw.

Have a play with the routine below. It will show you how easy it is to get
a web page.

Regards,
Fergus

<code>
Public Sub GetThisWebPage (sUrl As String)
'What we want.
Dim oRequest As WebRequest = WebRequest.Create (sUrl)

'Go get it
Dim oResponse As HttpWebResponse = oRequest.GetResponse()

'Let's see some info about the response.
Dim S = "To: " & sUrl & vbCrLf
S = S & "From: " & oResponse.ResponseUri.ToString & vbCrLf & vbCrLf

S = S & "Headers:" & vbCrLf
Dim I As Integer
For I = 0 To oResponse.Headers.Count - 1
S = S & " <" & oResponse.Headers.Keys(I) & "> "
S = S & oResponse.Headers.Item (I) & vbCrLf
Next
S = S & vbCrLf
S = S & "Type: " & oResponse.ContentType & vbCrLf
S = S & "Len: " & oResponse.ContentLength & vbCrLf & vbCrLf
MsgBox (S)

'Now the data itself.
Dim oHtmlStream As New StreamReader (oResponse.GetResponseStream)
Dim sHtml As String = oHtmlStream.ReadToEnd
MsgBox (sHtml)

'Finish
oResponse.Close()
End Sub
</code
 
Back
Top