Parsing HTML Documents with the HTML Agility Pack | CodeGuru

Parsing HTML Documents with the HTML Agility Pack

Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser […]

Written By
CodeGuru Staff
CodeGuru Staff
Jan 12, 2011
2 minute read
CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a
price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers
are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser – namely, by making an HTTP
request from code and then parsing and analyzing the returned HTML.

The .NET Framework offers a variety of classes for accessing data from a remote website, namely the
WebClient class and the HttpWebRequest class. These classes are useful for making an HTTP request to a
remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead,
developers commonly rely on string parsing methods like String.IndexOf, String.Substring, and the like, or
through the use of regular expressions.

Another option for parsing HTML documents is to use the HTML Agility Pack, a free, open-source library designed to
simplify reading from and writing to HTML documents. The HTML Agility Pack constructs a Document Object Model
(DOM)
view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a
node to its children, or vice versa. Also, the HTML Agility Pack can return specific nodes in the DOM through the use
of XPath expressions. (The HTML Agility Pack also includes a class for downloading an HTML document from a remote
website; this means you can both download and parse an external web page using the HTML Agility Pack.)

This ASP.NET tutorial shows how to get started using the HTML Agility Pack and includes a number of real-world examples that illustrate this library’s utility. A complete, working
demo is available for download at the end of this tutorial. To read the entire, Parsing HTML Documents with the HTML Agility Pack follow the link.

CodeGuru Logo

CodeGuru covers topics related to Microsoft-related software development, mobile development, database management, and web application programming. In addition to tutorials and how-tos that teach programmers how to code in Microsoft-related languages and frameworks like C# and .Net, we also publish articles on software development tools, the latest in developer news, and advice for project managers. Cloud services such as Microsoft Azure and database options including SQL Server and MSSQL are also frequently covered.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.