Parsing HTML Documents with the HTML Agility Pack

Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser - namely, by making an HTTP request from code and then parsing and analyzing the returned HTML.

The .NET Framework offers a variety of classes for accessing data from a remote website, namely the WebClient class and the HttpWebRequest class. These classes are useful for making an HTTP request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly rely on string parsing methods like String.IndexOf, String.Substring, and the like, or through the use of regular expressions.

Another option for parsing HTML documents is to use the HTML Agility Pack, a free, open-source library designed to simplify reading from and writing to HTML documents. The HTML Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the HTML Agility Pack can return specific nodes in the DOM through the use of XPath expressions. (The HTML Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both download and parse an external web page using the HTML Agility Pack.)

This ASP.NET tutorial shows how to get started using the HTML Agility Pack and includes a number of real-world examples that illustrate this library's utility. A complete, working demo is available for download at the end of this tutorial. To read the entire, Parsing HTML Documents with the HTML Agility Pack follow the link.

Related Articles



About the Author

Scott Mitchell

Scott Mitchell is the Editor, founder, and primary contributor to 4GuysFromRolla.com. In addition to founding 4GuysFromRolla.com, Scott also created ASPFAQs.com and ASPMessageboard.com. He works as a freelance writer, trainer, and consultant and resides in California.

Comments

  • soru : Turkey www.betskor.com

    Posted by arif on 02/19/2013 01:47pm

    Merhaba HtmlAgilitypack ile çektiğim dökümanın içinde linkler var onları nasıl farklı bir sayfada aşağıdaki kod ile açabiliryorum ama o sayfalar olmadığı için hata veriyor. açılan sayfa da hata veriyo. aşağıdaki koda ne eklemem gerekli sizce. uzantıyı kontrol edebilimmem için. var linksThatDoNotOpenInNewWindow = kaynak_burada.DocumentNode.SelectNodes("//a[@ href]"); if (linksThatDoNotOpenInNewWindow != null) { foreach (var link in linksThatDoNotOpenInNewWindow) if (link.Attributes["target"] == null) link.Attributes.Add("target", "_blank"); else link.Attributes["target"].Value = "_blank"; }

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: May 7, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT This eSeminar will explore three popular games engines and how they empower developers to create exciting, graphically rich, and high-performance games for Android® on Intel® Architecture. Join us for a deep dive as experts describe the features, tools, and common challenges using Marmalade, App Game Kit, and Havok game engines, as well as a discussion of the pros and cons of each engine and how they fit into your development …

  • Instead of only managing projects organizations do need to manage value! "Doing the right things" and "doing things right" are the essential ingredients for successful software and systems delivery. Unfortunately, with distributed delivery spanning multiple disciplines, geographies and time zones, many organizations struggle with teams working in silos, broken lines of communication, lack of collaboration, inadequate traceability, and poor project visibility. This often results in organizations "doing the …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds