Lightweight HTML Parsing Using MSHTML

Environment: Windows 2000 / Windows ME / IE 5.0+

I have a lot of experience in programming low-level MSHTML and I always see questions on how one can use MSHTML to parse HTML and then access elements via the DOM.

Well, here it is. I use IMarkupServices provided by MSHTML. There is no need for an IOleClientSite or any sort of embedding. I think is is just about as light as anyone can get.

In future articles, I will be concentrating on the reuse of MSHTML in other aspects of programming. Such as using MSHTML as an editor, for example.

This code makes use of simple COM calls and nothing more. It can be easily adapted for ATL, MFC and VB, among other languages. Please don't ask me to provide samples in other languages. In order to build this you need the IE SDK

/******************************************************************
 * ParseHTML.cpp
 *
 *  ParseHTML: Lightweight UI-less HTML parser using MSHTML
 *
 *  Note: This is for accessing the DOM only. No image download, 
 *        script execution, etc...
 *
 *  8 June 2001 - Asher Kobin (asherk@pobox.com)
 *  
 *  THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY 
 *  OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT 
 *  LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR 
 *  FITNESS FOR A PARTICULAR PURPOSE.
 *
 *******************************************************************/

#include <windows.h>
#include <mshtml.h>

OLECHAR szHTML[] = OLESTR("<HTML><BODY>Hello World!</BODY></HTML>");

int __stdcall WinMain(HINSTANCE hInst, 
                      HINSTANCE hPrev, 
                      LPSTR lpCmdLine, 
                      int nShowCmd)
{
  IHTMLDocument2 *pDoc = NULL;

  CoInitialize(NULL);

  CoCreateInstance(CLSID_HTMLDocument, 
                   NULL, 
                   CLSCTX_INPROC_SERVER, 
                   IID_IHTMLDocument2, 
                   (LPVOID *) &pDoc);

  if (pDoc)
  {
    IPersistStreamInit *pPersist = NULL;

    pDoc->QueryInterface(IID_IPersistStreamInit, 
                       (LPVOID *) &pPersist);

    if (pPersist)
    {
      IMarkupServices *pMS = NULL;
  
      pPersist->InitNew();
      pPersist->Release();

      pDoc->QueryInterface(IID_IMarkupServices, 
                              (LPVOID *) &pMS);

      if (pMS)
      {
        IMarkupContainer *pMC = NULL;
        IMarkupPointer *pMkStart = NULL;
        IMarkupPointer *pMkFinish = NULL;

        pMS->CreateMarkupPointer(&pMkStart);
        pMS->CreateMarkupPointer(&pMkFinish);

        pMS->ParseString(szHTML, 
                         0, 
                         &pMC, 
                         pMkStart, 
                         pMkFinish);

        if (pMC)
        {
          IHTMLDocument2 *pNewDoc = NULL;

          pMC->QueryInterface(IID_IHTMLDocument, 
                              (LPVOID *) &pNewDoc);

          if (pNewDoc)
          {
            // do anything with pNewDoc, in this case 
            // get the body innerText.

            IHTMLElement *pBody;
            pNewDoc-gt;get_body(&pBody);

            if (pBody)
            {
              BSTR strText;

              pBody->get_innerText(&strText);
              pBody->Release();

              SysFreeString(strText);
            }

            pNewDoc->Release();
          }

          pMC->Release();
        }

        if (pMkStart)
            pMkStart->Release();

        if (pMkFinish)
          pMkFinish->Release();

        pMS->Release();
      }
    }

    pDoc->Release();
  }

  CoUninitialize();
  
  return TRUE;
}

Downloads

None. Source code provided above.


Comments

  • http://www.freerun001.com/#nikefree01

    Posted by fubtesmercume on 11/02/2012 07:41am

    [url=http://www.cheapairmax90forsale.com/#shoes113]Buy Air Max 95 Red Black[/url] will have dramatically improved wearing these, [url=http://www.cllouboutinshoes.com/#redd36]Shop Christian Louboutin Jenny Silver[/url] The Free Run shoes are awesome, offer proper cushioning to help absorb shock, [url=http://www.freerun001.com/#nikefree244]Cheap Nike Free Run 3 Men Blue/Yellow/Black Shoes[/url] offers great feel and proprioception.

    Reply
  • Larger Sample

    Posted by jeffrey@toad.net on 12/17/2006 10:53am

    Hi Asher, In typical Micorosft fashion, the big picture is missed due. I have not been able to locate quality samples (any samples for that matter) near IMarkupContainer. Do you have a larger nugget? I'm very interested in IHTMLElement (see http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/ifaces/ihtmlelement/element.asp) Jeff

    Reply
  • Marvelous!

    Posted by ted_b on 05/23/2006 05:19pm

    Thank you very much for this great example -- it saved me so much time!!!

    Reply
  • On a related subject...IHTMLDOMNode

    Posted by Legacy on 12/18/2003 12:00am

    Originally posted by: David Rigby

    I'm having problems using IHTMLDOMNode in my project. All references to this interface produce 'use of undefined type' errors, even though looking in the TLH/TLI generated form the import of MSHTML the definition of the interface is there. I can use other interfaces defined in MSHTML (and appearing in the TLH in an identical way) but it seems that none of the HTMLDOMxxx ones are usable (IHTMLDOMAttribute also produces the same error). I'm sure this is something simple but I really can't see it! Any clues appreciated.

    Thanks

    David

    Reply
  • Delphi Version ( Simpler Technique + Extras )

    Posted by Legacy on 11/16/2003 12:00am

    Originally posted by: Joe James

    ( add to uses clause, MSHTML, ActiveX, ComObj )
    
    

    const
    IID_IPersistStreamInit : TGUID = '{7FD52380-4E07-101B-AE2D-08002B2EC713}';

    procedure TFormMain.FormCreate(Sender: TObject);
    var
    pDoc : IHTMLDocument2;
    pBody : IHTMLElement;
    strText : string;
    szHTML : widestring;
    didInit : boolean;
    begin
    didInit :=Succeeded(CoInitialize(nil));
    szHTML :='<HTML><BODY>Hello World!</BODY></HTML>';
    CoCreateInstance(CLASS_HTMLDocument, nil, CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, pDoc);
    if pDoc <> nil then
    begin
    pDoc.Set_designMode('On'); //no script execution
    while not (pDoc.readyState = 'complete') do Application.ProcessMessages;
    pDoc.body.innerHTML :=szHTML;
    pBody :=pDoc.Get_body;
    if pBody <> nil then
    strText :=pBody.Get_innerText else strText :='';
    m.Text :=strText;
    pDoc._Release;
    end;
    if didInit then CoUninitialize();
    end;

    ============== Other Useful Routines ===============
    ============== Other Useful Routines ===============
    ============== Other Useful Routines ===============


    function GetHTMLSource(Document: IDispatch) : string;
    var
    pStream : IStream;
    pPersist : IPersistStreamInit;
    li,lo : int64;
    stat : STATSTG;
    str : string;
    BytesRead : longint;
    begin
    result :='';
    if SUCCEEDED(CreateStreamOnHGlobal(0, TRUE, pStream)) then
    begin
    if (SUCCEEDED(Document.QueryInterface(IID_IPersistStreamInit, pPersist))) then
    begin
    pPersist.Save(pStream, FALSE);
    li :=0;
    pStream.Seek(li, STREAM_SEEK_SET, lo);
    pStream.Stat(stat, 0);
    SetLength(str,stat.cbSize + 1);
    pStream.Read(@str[1], stat.cbSize, @BytesRead);
    result :=str;
    end;
    end;
    end;

    procedure SetHTMLSource(Document: IDispatch; value: string);
    var
    stm : TMemoryStream;
    psi : IPersistStreamInit;
    sa : TStreamAdapter;
    begin
    stm :=TMemoryStream.Create;
    stm.SetSize(Length(value));
    stm.Seek(0,0);
    stm.Write(value[1],Length(value));
    stm.Seek(0,0);
    sa :=TStreamAdapter.Create(stm, soReference); //if you pass soOwned instead, the stream will be freed for you
    if (SUCCEEDED(Document.QueryInterface(IID_IPersistStreamInit,psi))) then
    psi.Load(sa);
    end;

    Reply
  • Delphi Version

    Posted by Legacy on 11/16/2003 12:00am

    Originally posted by: Joe James

    ( add to uses clause, MSHTML, ActiveX, ComObj )
    
    

    const
    IID_IPersistStreamInit : TGUID = '{7FD52380-4E07-101B-AE2D-08002B2EC713}';

    procedure TFormMain.FormCreate(Sender: TObject);
    var
    pDoc : IHTMLDocument2;
    pNewDoc : IHTMLDocument2;
    pPersist : IPersistStreamInit;
    pMS : IMarkupServices;
    pMC : IMarkupContainer;
    pMkStart : IMarkupPointer;
    pMkFinish : IMarkupPointer;
    pBody : IHTMLElement;
    strText : string;
    szHTML : widestring;
    didInit : boolean;
    begin
    didInit :=Succeeded(CoInitialize(nil));
    szHTML :='<HTML><BODY>Hello World!</BODY></HTML>';
    CoCreateInstance(CLASS_HTMLDocument, nil, CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, pDoc);
    if pDoc <> nil then
    begin
    pDoc.QueryInterface(IID_IPersistStreamInit, pPersist);
    if (pPersist <> nil) then
    begin
    pPersist.InitNew;
    pPersist._Release;
    pDoc.QueryInterface(IID_IMarkupServices, pMS);
    if (pMS <> nil) then
    begin
    pMS.CreateMarkupPointer(pMkStart);
    pMS.CreateMarkupPointer(pMkFinish);
    pMS.ParseString(word(szHTML[1]), 0, pMC, pMkStart, pMkFinish);
    if (pMC <> nil) then
    begin
    pMC.QueryInterface(IID_IHTMLDocument, pNewDoc);
    if (pNewDoc <> nil) then
    begin
    // do anything with pNewDoc, in this case
    // get the body innerText.
    pBody :=pNewDoc.Get_body;
    if (pBody <> nil) then
    begin
    strText :=pBody.Get_innerText;
    m.Text :=strText;
    pBody._Release;
    end;
    pNewDoc._Release;
    end;
    pMC._Release;
    end;
    if (pMkStart <> nil) then pMkStart._Release;
    if (pMkFinish <> nil) then pMkFinish._Release;
    pMS._Release;
    end;
    pPersist._Release;
    end;
    pDoc._Release;
    end;
    if didInit then CoUninitialize();
    end;

    Reply
  • $ seems to be reserved to indicate a template

    Posted by Legacy on 03/19/2003 12:00am

    Originally posted by: Peter

    The $ character seems to be used to indicate that the item is a template.

    Reply
  • error LNK2001

    Posted by Legacy on 02/13/2003 12:00am

    Originally posted by: genie

    I create ParseHTML.cpp file after this site copy source.
    result is "LIBCD.lib(crt0.obj) : error LNK2001: unresolved external symbol _main"
    what is wrong?

    Reply
  • Another bug with ParseString

    Posted by Legacy on 10/08/2002 12:00am

    Originally posted by: Simon

    When I created IMarkupContainer using ParseString the <BODY> element magically lost all it's attributes.

    Now I'm using open, write and cloes methods, to write in IHTMLDocument2 first and it works.

    Reply
  • 0x80004002 "No such interface supported" Error

    Posted by Legacy on 10/06/2002 12:00am

    Originally posted by: Simon

    When I query IHTMLDocument2 to get IMarkupInterfaces I get the 0x80004002 error result. Any clues what am I doing wrong. I'm using the same code as the sample above in VC++ 6.0 with Platform and IE6 SDK. I'm developing COM server (in multithreaded apartment).

    Thanks for any clues in advance.

    Reply
  • Loading, Please Wait ...

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • On-demand Event Event Date: September 10, 2014 Modern mobile applications connect systems-of-engagement (mobile apps) with systems-of-record (traditional IT) to deliver new and innovative business value. But the lifecycle for development of mobile apps is also new and different. Emerging trends in mobile development call for faster delivery of incremental features, coupled with feedback from the users of the app "in the wild." This loop of continuous delivery and continuous feedback is how the best mobile …

  • Packaged application development teams frequently operate with limited testing environments due to time and labor constraints. By virtualizing the entire application stack, packaged application development teams can deliver business results faster, at higher quality, and with lower risk.

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds