Lightweight HTML Parsing Using MSHTML | CodeGuru

Lightweight HTML Parsing Using MSHTML

Environment: Windows 2000 / Windows ME / IE 5.0+ I have a lot of experience in programming low-level MSHTML and I always see questions on how one can use MSHTML to parse HTML and then access elements via the DOM. Well, here it is. I use IMarkupServices provided by MSHTML. There is no need for […]

Written By
CodeGuru Staff
CodeGuru Staff
Sep 17, 2001
1 minute read
CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Environment: Windows 2000 / Windows ME / IE 5.0+

I have a lot of experience in programming low-level MSHTML and I always see questions on how one can use MSHTML to parse HTML and then access elements via the DOM.

Well, here it is. I use IMarkupServices provided by MSHTML. There is no need for an IOleClientSite or any sort of embedding. I think is is just about as light as anyone can get.

In future articles, I will be concentrating on the reuse of MSHTML in other aspects of programming. Such as using MSHTML as an editor, for example.

This code makes use of simple COM calls and nothing more. It can be easily adapted for ATL, MFC and VB, among other languages. Please don’t ask me to provide samples in other languages. In order to build this you need the IE SDK

/******************************************************************
 * ParseHTML.cpp
 *
 *  ParseHTML: Lightweight UI-less HTML parser using MSHTML
 *
 *  Note: This is for accessing the DOM only. No image download,
 *        script execution, etc…
 *
 *  8 June 2001 – Asher Kobin (asherk@pobox.com)
 *
 *  THIS CODE AND INFORMATION IS PROVIDED “AS IS” WITHOUT WARRANTY
 *  OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
 *  LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
 *  FITNESS FOR A PARTICULAR PURPOSE.
 *
 *******************************************************************/

#include <windows.h>
#include <mshtml.h>
OLECHAR szHTML[] = OLESTR(“<HTML><BODY>Hello World!</BODY></HTML>”);
int __stdcall WinMain(HINSTANCE hInst,
                      HINSTANCE hPrev,
                      LPSTR lpCmdLine,
                      int nShowCmd)
{
  IHTMLDocument2 *pDoc = NULL;
  CoInitialize(NULL);
  CoCreateInstance(CLSID_HTMLDocument,
                   NULL,
                   CLSCTX_INPROC_SERVER,
                   IID_IHTMLDocument2,
                   (LPVOID *) &pDoc);
  if (pDoc)
  {
    IPersistStreamInit *pPersist = NULL;
    pDoc->QueryInterface(IID_IPersistStreamInit,
                       (LPVOID *) &pPersist);
    if (pPersist)
    {
      IMarkupServices *pMS = NULL;
      pPersist->InitNew();
      pPersist->Release();
      pDoc->QueryInterface(IID_IMarkupServices,
                              (LPVOID *) &pMS);
      if (pMS)
      {
        IMarkupContainer *pMC = NULL;
        IMarkupPointer *pMkStart = NULL;
        IMarkupPointer *pMkFinish = NULL;
        pMS->CreateMarkupPointer(&pMkStart);
        pMS->CreateMarkupPointer(&pMkFinish);
        pMS->ParseString(szHTML,
                         0,
                         &pMC,
                         pMkStart,
                         pMkFinish);
        if (pMC)
        {
          IHTMLDocument2 *pNewDoc = NULL;
          pMC->QueryInterface(IID_IHTMLDocument,
                              (LPVOID *) &pNewDoc);
          if (pNewDoc)
          {
            // do anything with pNewDoc, in this case
            // get the body innerText.

            IHTMLElement *pBody;
            pNewDoc-gt;get_body(&pBody);
            if (pBody)
            {
              BSTR strText;
              pBody->get_innerText(&strText);
              pBody->Release();
              SysFreeString(strText);
            }
            pNewDoc->Release();
          }
          pMC->Release();
        }
        if (pMkStart)
            pMkStart->Release();
        if (pMkFinish)
          pMkFinish->Release();
        pMS->Release();
      }
    }
    pDoc->Release();
  }
  CoUninitialize();
  return TRUE;
}

Downloads

None. Source code provided above.

CodeGuru Logo

CodeGuru covers topics related to Microsoft-related software development, mobile development, database management, and web application programming. In addition to tutorials and how-tos that teach programmers how to code in Microsoft-related languages and frameworks like C# and .Net, we also publish articles on software development tools, the latest in developer news, and advice for project managers. Cloud services such as Microsoft Azure and database options including SQL Server and MSSQL are also frequently covered.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.