Parsing HTML without Using the Browser Control
.
Environment: VB6 SP5, XPPro, IE6
The main goal of this article is to provide a way to use the HTML parser inside Microsoft Internet Explorer within your program.
This is something usually easy if you use the browser control. There are plenty of examples on the Internet, but when it comes to using it in a UI-less way, there is nothing done in Visual Basic. All examples I've seen are in Visual C++ using interfaces that are not available in Visual Basic.
After days of trying to find a way, trying the .NET platform to be able to use an HTML parser in a Windows NT service, I finally found a way. I don't claim this is the nicest way to do it, but it works like a charm, and you have access to the DOM of the HTML document you want, which can be very useful if you're looking to parse a HTML document.
Your code must have a reference to the Microsoft HTML Object Library. Internet Explorer 5 or more is required to do this. Simply copy this code in any function.
Dim objLink As HTMLLinkElement Dim objMSHTML As New MSHTML.HTMLDocument Dim objDocument As MSHTML.HTMLDocument ' This function is only available with Internet Explorer 5 Set objDocument = objMSHTML.createDocumentFromUrl(txtURL.Text, _ vbNullString) ' Tricky, to make the function wait for the document to ' complete, usually the transfer is asynchronous. Note ' that this string might be different if you have another ' language than English for Internet Explorer on the ' machine where the code is executed. While objDocument.readyState <> "complete" DoEvents Wend ' Source Code Debug.Print = objDocument.documentElement.outerHTML ' Title Debug.Print "Title : " & objDocument.Title ' Link Collection For Each objLink In objDocument.links lstLinks.AddItem objLink Debug.Print "Link: " & objLink Next

Comments
Great example...
Posted by ShaneB on 08/13/2009 07:21amGreat example, but how would you go about doing this with just parsing text inside a <textarea tag??
ReplyVery good .. but i need more ..
Posted by Ranjan.net on 07/24/2008 01:31pmThis 20 line code is very good. One quick question. If I have already have the file on disk (cached)... Can i parse it? How the partial links (ex: \images\share\full_8789.jpg) will be resolved ?
ReplyAcrux Advanced Html Parser
Posted by Acrux2 on 03/28/2008 06:34amA good parser that handles realworld messy HTML and even provides an XmlDocument like structure of the parsed HTML is the Acrux Advanced Html Parser: http://www.acruxsoftware.net/products.html
ReplyStrange Error - System.AccessViolationException
Posted by beauner13 on 09/21/2006 04:04pmThis code look slike it would be very useful to me, with just one problem:
ReplyWhen I attempt to use that exact code or any derivation, I get this error:
A first chance exception of type 'System.AccessViolationException' occurred in mscorlib.dll
And when I look at the error message, it tells me that memory could be corrupt elsewhere. I've attempted this line of code by omitting the "http://" portion of the URL, by trying numerous web sites, and with various other arguments in the 2nd parameter, such as "", ControlChars.NullChar and "null". I've also reset my PC and created a brand new application with only that code and get the same results.
I am using VS.NET 2005 w/ .NET framework 2.0
I don't know if it will help, but the details from the exception object are as follows:
System.AccessViolationException was caught Message="Attempted to read or write protected memory. This is often an indication that other memory is corrupt." Source="mscorlib" StackTrace: at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFlags flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at mshtml.HTMLDocumentClass.createDocumentFromUrl(String bstrUrl, String bstrOptions) at MSHTML_DOM_Practice_1.Form1.createDoc(String URL) in D:\Dev V.2\Misc practice projects\MSHTML DOM Practice 1\MSHTML DOM Practice 1\Form1.vb:line 19
I'm reaching my threshold of frustration and could really use some help!
Thanks, Beau
I would like to connect visual basic to my html page
Posted by Legacy on 02/05/2004 12:00amOriginally posted by: Tracy Knowles
Hello
I'm doing a project for class and I wanted to know if this code would work to link visual basic to a html page I created?
All I want to do is, from an html page, I would like to have a link that would go to visual basic 6.0 program.
I really do need your help if you can help me.
ReplyGreat stuff, but how
Posted by Legacy on 12/29/2003 12:00amOriginally posted by: Homer
Just what I was looking for. Now I just need to know how to select and activate a button on the page. I'm not sure of the proper lingo becuase I'm new to any type of web development. The page that I'm opening is on an intranet and displays current data. To see the previous weeks data I have to click a back arrow labled "prior week". How do I do that in code?
I know now that the arrow I'm clicking on is executing a javascript. Is there a way to execute that same javascript in code using VB/MSHTML. I have tried using the IHTMLElementCollection. With this I can capture the element but once I have it I don't know how to execute the javascript. Is it possible to do that?
ReplyScope issue
Posted by Legacy on 12/19/2003 12:00amOriginally posted by: Robert C
A little issue I found while using this code:
Dim objMSHTML As New MSHTML.HTMLDocument
Dim objDocument As MSHTML.HTMLDocument
Set objDocument = objMSHTML.createDocumentFromUrl(txtURL.Text, vbNullString)
If you want to pass objDocument between functions then objMSHTML must be global in the module; so if you use this in a form initialise objMSHTML in Form_Load and dispose of it in Form_Terminate.
If you used that code in a function then returned the HTMLDocument you opened, the data would be lost - even though the reference passes OK.
-
Replymemory leaks can you show an exaple of form_terminate?
Posted by blackbookcoder on 10/23/2004 12:40ammemory leaks when i run this. Can you tell us how to use the Form_Terminate sub for this code? thanks, blackbookcoder
ReplyExcept MSHTML doesn't produce correct HTML
Posted by Legacy on 12/16/2003 12:00amOriginally posted by: Neil Stansbury
All well and good, except it's a shame MSHTML doesn't produce standards compliant code. Run document.documentElement.outerHTML past the Validator at the W3C, it produces so many errors it's not funny.
Reply..Been looking for a good example for a while... THANKS!
Posted by Legacy on 11/12/2003 12:00amOriginally posted by: jlw
..Been looking for a good example for a while... THANKS!
ReplyCool, just made my Messager
Posted by Legacy on 09/04/2003 12:00amOriginally posted by: Steve
I parse a HTML page of online users and no longer I need to look at that page, your App tells me who's online. Great thanks!
Reply
Loading, Please Wait ...