Lightweight HTML Parsing Using MSHTML
I have a lot of experience in programming low-level MSHTML and I always see questions on how one can use MSHTML to parse HTML and then access elements via the DOM.
Well, here it is. I use IMarkupServices provided by MSHTML. There is no need for an IOleClientSite or any sort of embedding. I think is is just about as light as anyone can get.
In future articles, I will be concentrating on the reuse of MSHTML in other aspects of programming. Such as using MSHTML as an editor, for example.
This code makes use of simple COM calls and nothing more. It can be easily adapted for ATL, MFC and VB, among other languages. Please don't ask me to provide samples in other languages. In order to build this you need the IE SDK
/****************************************************************** * ParseHTML.cpp * * ParseHTML: Lightweight UI-less HTML parser using MSHTML * * Note: This is for accessing the DOM only. No image download, * script execution, etc... * * 8 June 2001 - Asher Kobin (asherk@pobox.com) * * THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY * OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT * LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR * FITNESS FOR A PARTICULAR PURPOSE. * *******************************************************************/ #include <windows.h> #include <mshtml.h> OLECHAR szHTML[] = OLESTR("<HTML><BODY>Hello World!</BODY></HTML>"); int __stdcall WinMain(HINSTANCE hInst, HINSTANCE hPrev, LPSTR lpCmdLine, int nShowCmd) { IHTMLDocument2 *pDoc = NULL; CoInitialize(NULL); CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, (LPVOID *) &pDoc); if (pDoc) { IPersistStreamInit *pPersist = NULL; pDoc->QueryInterface(IID_IPersistStreamInit, (LPVOID *) &pPersist); if (pPersist) { IMarkupServices *pMS = NULL; pPersist->InitNew(); pPersist->Release(); pDoc->QueryInterface(IID_IMarkupServices, (LPVOID *) &pMS); if (pMS) { IMarkupContainer *pMC = NULL; IMarkupPointer *pMkStart = NULL; IMarkupPointer *pMkFinish = NULL; pMS->CreateMarkupPointer(&pMkStart); pMS->CreateMarkupPointer(&pMkFinish); pMS->ParseString(szHTML, 0, &pMC, pMkStart, pMkFinish); if (pMC) { IHTMLDocument2 *pNewDoc = NULL; pMC->QueryInterface(IID_IHTMLDocument, (LPVOID *) &pNewDoc); if (pNewDoc) { // do anything with pNewDoc, in this case // get the body innerText. IHTMLElement *pBody; pNewDoc-gt;get_body(&pBody); if (pBody) { BSTR strText; pBody->get_innerText(&strText); pBody->Release(); SysFreeString(strText); } pNewDoc->Release(); } pMC->Release(); } if (pMkStart) pMkStart->Release(); if (pMkFinish) pMkFinish->Release(); pMS->Release(); } } pDoc->Release(); } CoUninitialize(); return TRUE; }

Comments
http://www.freerun001.com/#nikefree01
Posted by fubtesmercume on 11/02/2012 07:41am[url=http://www.cheapairmax90forsale.com/#shoes113]Buy Air Max 95 Red Black[/url] will have dramatically improved wearing these, [url=http://www.cllouboutinshoes.com/#redd36]Shop Christian Louboutin Jenny Silver[/url] The Free Run shoes are awesome, offer proper cushioning to help absorb shock, [url=http://www.freerun001.com/#nikefree244]Cheap Nike Free Run 3 Men Blue/Yellow/Black Shoes[/url] offers great feel and proprioception.
ReplyLarger Sample
Posted by jeffrey@toad.net on 12/17/2006 10:53amHi Asher, In typical Micorosft fashion, the big picture is missed due. I have not been able to locate quality samples (any samples for that matter) near IMarkupContainer. Do you have a larger nugget? I'm very interested in IHTMLElement (see http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/ifaces/ihtmlelement/element.asp) Jeff
ReplyMarvelous!
Posted by ted_b on 05/23/2006 05:19pmThank you very much for this great example -- it saved me so much time!!!
ReplyOn a related subject...IHTMLDOMNode
Posted by Legacy on 12/18/2003 12:00amOriginally posted by: David Rigby
I'm having problems using IHTMLDOMNode in my project. All references to this interface produce 'use of undefined type' errors, even though looking in the TLH/TLI generated form the import of MSHTML the definition of the interface is there. I can use other interfaces defined in MSHTML (and appearing in the TLH in an identical way) but it seems that none of the HTMLDOMxxx ones are usable (IHTMLDOMAttribute also produces the same error). I'm sure this is something simple but I really can't see it! Any clues appreciated.
Thanks
David
ReplyDelphi Version
Posted by Legacy on 11/16/2003 12:00amOriginally posted by: Joe James
ReplyDelphi Version ( Simpler Technique + Extras )
Posted by Legacy on 11/16/2003 12:00amOriginally posted by: Joe James
Reply$ seems to be reserved to indicate a template
Posted by Legacy on 03/19/2003 12:00amOriginally posted by: Peter
The $ character seems to be used to indicate that the item is a template.
Replyerror LNK2001
Posted by Legacy on 02/13/2003 12:00amOriginally posted by: genie
I create ParseHTML.cpp file after this site copy source.
Replyresult is "LIBCD.lib(crt0.obj) : error LNK2001: unresolved external symbol _main"
what is wrong?
Another bug with ParseString
Posted by Legacy on 10/08/2002 12:00amOriginally posted by: Simon
When I created IMarkupContainer using ParseString the <BODY> element magically lost all it's attributes.
Now I'm using open, write and cloes methods, to write in IHTMLDocument2 first and it works.
Reply0x80004002 "No such interface supported" Error
Posted by Legacy on 10/06/2002 12:00amOriginally posted by: Simon
When I query IHTMLDocument2 to get IMarkupInterfaces I get the 0x80004002 error result. Any clues what am I doing wrong. I'm using the same code as the sample above in VC++ 6.0 with Platform and IE6 SDK. I'm developing COM server (in multithreaded apartment).
Thanks for any clues in advance.
ReplyLoading, Please Wait ...