Parsing HTML without Using the Browser Control

How to Use MS HTML as a HTML Parser in Visual Basic Without Using the Browser Control.

.



Click here for larger image

Environment: VB6 SP5, XPPro, IE6

The main goal of this article is to provide a way to use the HTML parser inside Microsoft Internet Explorer within your program.

This is something usually easy if you use the browser control. There are plenty of examples on the Internet, but when it comes to using it in a UI-less way, there is nothing done in Visual Basic. All examples I've seen are in Visual C++ using interfaces that are not available in Visual Basic.

After days of trying to find a way, trying the .NET platform to be able to use an HTML parser in a Windows NT service, I finally found a way. I don't claim this is the nicest way to do it, but it works like a charm, and you have access to the DOM of the HTML document you want, which can be very useful if you're looking to parse a HTML document.

Your code must have a reference to the Microsoft HTML Object Library. Internet Explorer 5 or more is required to do this. Simply copy this code in any function.

Dim objLink As HTMLLinkElement
Dim objMSHTML As New MSHTML.HTMLDocument
Dim objDocument As MSHTML.HTMLDocument


' This function is only available with Internet Explorer 5

Set objDocument = objMSHTML.createDocumentFromUrl(txtURL.Text, _
                                                  vbNullString)
    
' Tricky, to make the function wait for the document to 
' complete, usually the transfer is asynchronous. Note 
' that this string might be different if you have another
' language than English for Internet Explorer on the
' machine where the code is executed.

While objDocument.readyState <> "complete"
    DoEvents
Wend

' Source Code

Debug.Print = objDocument.documentElement.outerHTML

' Title

Debug.Print "Title : " & objDocument.Title

' Link Collection

For Each objLink In objDocument.links
    lstLinks.AddItem objLink
    Debug.Print "Link:  " & objLink
Next

Downloads

Download demo project - 3 Kb


Comments

  • Great example...

    Posted by ShaneB on 08/13/2009 07:21am

    Great example, but how would you go about doing this with just parsing text inside a <textarea tag??

    Reply
  • Very good .. but i need more ..

    Posted by Ranjan.net on 07/24/2008 01:31pm

    This 20 line code is very good. One quick question. If I have already have the file on disk (cached)... Can i parse it? How the partial links (ex: \images\share\full_8789.jpg) will be resolved ?

    Reply
  • Acrux Advanced Html Parser

    Posted by Acrux2 on 03/28/2008 06:34am

    A good parser that handles realworld messy HTML and even provides an XmlDocument like structure of the parsed HTML is the Acrux Advanced Html Parser: http://www.acruxsoftware.net/products.html

    Reply
  • Strange Error - System.AccessViolationException

    Posted by beauner13 on 09/21/2006 04:04pm

    This code look slike it would be very useful to me, with just one problem:

    When I attempt to use that exact code or any derivation, I get this error:
    A first chance exception of type 'System.AccessViolationException' occurred in mscorlib.dll

    And when I look at the error message, it tells me that memory could be corrupt elsewhere. I've attempted this line of code by omitting the "http://" portion of the URL, by trying numerous web sites, and with various other arguments in the 2nd parameter, such as "", ControlChars.NullChar and "null". I've also reset my PC and created a brand new application with only that code and get the same results.

    I am using VS.NET 2005 w/ .NET framework 2.0

    I don't know if it will help, but the details from the exception object are as follows:

    System.AccessViolationException was caught Message="Attempted to read or write protected memory. This is often an indication that other memory is corrupt." Source="mscorlib" StackTrace: at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFlags flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at mshtml.HTMLDocumentClass.createDocumentFromUrl(String bstrUrl, String bstrOptions) at MSHTML_DOM_Practice_1.Form1.createDoc(String URL) in D:\Dev V.2\Misc practice projects\MSHTML DOM Practice 1\MSHTML DOM Practice 1\Form1.vb:line 19


    I'm reaching my threshold of frustration and could really use some help!

    Thanks, Beau

    • fix for accessviolation issue

      Posted by sampaths85 on 04/26/2012 03:21am

      This worked for me! http://social.msdn.microsoft.com/forums/en-US/vblanguage/thread/cfbe816a-dc15-4a73-a7fc-8dfbf01d98f0/

      Reply
    Reply
  • I would like to connect visual basic to my html page

    Posted by Legacy on 02/05/2004 12:00am

    Originally posted by: Tracy Knowles

    Hello

    I'm doing a project for class and I wanted to know if this code would work to link visual basic to a html page I created?

    All I want to do is, from an html page, I would like to have a link that would go to visual basic 6.0 program.

    I really do need your help if you can help me.

    Reply
  • Great stuff, but how

    Posted by Legacy on 12/29/2003 12:00am

    Originally posted by: Homer

    Just what I was looking for. Now I just need to know how to select and activate a button on the page. I'm not sure of the proper lingo becuase I'm new to any type of web development. The page that I'm opening is on an intranet and displays current data. To see the previous weeks data I have to click a back arrow labled "prior week". How do I do that in code?

    I know now that the arrow I'm clicking on is executing a javascript. Is there a way to execute that same javascript in code using VB/MSHTML. I have tried using the IHTMLElementCollection. With this I can capture the element but once I have it I don't know how to execute the javascript. Is it possible to do that?

    Reply
  • Scope issue

    Posted by Legacy on 12/19/2003 12:00am

    Originally posted by: Robert C

    A little issue I found while using this code:

    Dim objMSHTML As New MSHTML.HTMLDocument
    Dim objDocument As MSHTML.HTMLDocument

    Set objDocument = objMSHTML.createDocumentFromUrl(txtURL.Text, vbNullString)

    If you want to pass objDocument between functions then objMSHTML must be global in the module; so if you use this in a form initialise objMSHTML in Form_Load and dispose of it in Form_Terminate.

    If you used that code in a function then returned the HTMLDocument you opened, the data would be lost - even though the reference passes OK.

    • memory leaks can you show an exaple of form_terminate?

      Posted by blackbookcoder on 10/23/2004 12:40am

      memory leaks when i run this. Can you tell us how to use the Form_Terminate sub for this code? thanks, blackbookcoder

      • response to memory leak question

        Posted by art on 01/20/2017 07:23am

        Simply set the three objects to Nothing in Form_Unload like this: If Not objLink is Nothing Then Set objLink = Nothing If Not objMSHTML is Nothing Then Set objMSHTML = Nothing If Not objDocument is Nothing Then Set objDocument = Nothing

        Reply
      Reply
    Reply
  • Except MSHTML doesn't produce correct HTML

    Posted by Legacy on 12/16/2003 12:00am

    Originally posted by: Neil Stansbury

    All well and good, except it's a shame MSHTML doesn't produce standards compliant code. Run document.documentElement.outerHTML past the Validator at the W3C, it produces so many errors it's not funny.

    Reply
  • ..Been looking for a good example for a while... THANKS!

    Posted by Legacy on 11/12/2003 12:00am

    Originally posted by: jlw

    ..Been looking for a good example for a while... THANKS!

    Reply
  • Cool, just made my Messager

    Posted by Legacy on 09/04/2003 12:00am

    Originally posted by: Steve

    I parse a HTML page of online users and no longer I need to look at that page, your App tells me who's online. Great thanks!

    Reply
  • Loading, Please Wait ...

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • As all sorts of data becomes available for storage, analysis and retrieval - so called 'Big Data' - there are potentially huge benefits, but equally huge challenges...
  • The agile organization needs knowledge to act on, quickly and effectively. Though many organizations are clamouring for "Big Data", not nearly as many know what to do with it...
  • Cloud-based integration solutions can be confusing. Adding to the confusion are the multiple ways IT departments can deliver such integration...

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date