ASP.NET: Post Data Programmatically with "Webscraping"

Screen-scraping was a popular method for slowly converting mainframe applications into applications that would run on PCs. The application would connect to the mainframe, read data from the screen, and re-display it in a Windows-based application. Data entered into the Windows application would then be transmitted back to the mainframe.

If you have a Web-based application that doesn't support Web services, you can do a Web-based screen scraping using the HttpWebRequest and HttpWebResponse covered in a previous tip. The example in this tip posts a query to the Weather Channel and extracts the current temperature from the data sent back to the Web page. Here's the code you can put into a Web page for testing:

using System;
using System.Data;
using System.Configuration;
using System.Collections;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.Net;
using System.IO;

public partial class PostingData : System.Web.UI.Page
{
  protected void Page_Load(object sender, EventArgs e)
  {
    string outputBuffer = "where=46038";

    HttpWebRequest req = 
      (HttpWebRequest)WebRequest.Create("http://www.weather.com/
                                         search/enhanced");
    req.Method = "POST";
    req.ContentLength = outputBuffer.Length;
    req.ContentType = "application/x-www-form-urlencoded";

    StreamWriter swOut = new StreamWriter(req.GetRequestStream());
    swOut.Write(outputBuffer);
    swOut.Close();

    HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
    StreamReader sr = new StreamReader(resp.GetResponseStream());
    string buffer = sr.ReadToEnd();
    sr.Close();

    int start = 0, end = 0;
    string startTag = "<B CLASS=obsTempTextA>";
    string endTag = "</B>";
    start = buffer.IndexOf(startTag, 
      StringComparison.CurrentCultureIgnoreCase);
    end = buffer.IndexOf(endTag, start, StringComparison.
      CurrentCultureIgnoreCase);

    Response.Write("Current temperature in ZIP 46038: " 
      + buffer.Substring(start + startTag.Length, end - start
      - startTag.Length));

    // Response.Write(Server.HtmlEncode(buffer));
  }
}

It starts by creating a HttpWebRequest to weather.com's search URL, which I found by looking at its home page search form. Part of the "fun" of webscraping is trying to figure out what all has to be sent on a post in order to get back valid results. In this case, you have to send only a value of where with the ZIP code you want to query. That information is stored in the outputBuffer variable in POST format, which means each name/value pair is separated by ampersands, similar to what you would see in a query string.

Next, the example populates the request with the post information and then requests the response, which has the effect of sending the data to the remote server. It retrieves the information into a string buffer and closes up the response stream.

This, unfortunately, is the tedious part of webscraping. You have to find the information you want in the response buffer. For this page, the resulting HTML (which can be dumped out to the page using the commented line at the end of the code) is 224KB of HTML to search through. However, the data you want is stashed between a reasonably easy tag to find. Using some simple string manipulation, you can extract the value and show it on the screen.

As you might guess, this is fairly "fragile" code. If the Weather Channel decides to change its page or the tag you're looking for, the code will fail to find the information it needs. That's one of the major reasons why Web services have become popular. The Weather Channel's page is designed for humans to read, not computers. The Web services that handle weather, on the other hand, send back only the relevant content and not all the formatting found in the page. However, if you don't have another option, webscraping can be a handy tool.

About the Author

Eric Smith is the owner of Northstar Computer Systems, a Web-hosting company based in Indianapolis, Indiana. He is also a MCT and MCSD who has been developing with .NET since 2001. In addition, he has written or contributed to 12 books covering .NET, ASP, and Visual Basic. Send him your questions and feedback via e-mail at questions@techniquescentral.com.



Comments

  • How to *stay* on the destination page?

    Posted by inbugable on 01/08/2007 06:19pm

    What would be the procedure if I wanted to display the content of the destination page in my browser (like a redirect, but with form data)? Since I have a lot of form data, I don't want to use GET with a querystring, but rather a POST with a simulated from.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • When individual departments procure cloud service for their own use, they usually don't consider the hazardous organization-wide implications. Read this paper to learn best practices for setting up an internal, IT-based cloud brokerage function that service the entire organization. Find out how this approach enables you to retain top-down visibility and control of network security and manage the impact of cloud traffic on your WAN.

  • Lenovo recommends Windows 8 Pro. "I dropped my laptop getting out of the taxi." This probably sounds familiar to most IT professionals. If your employees are traveling, you know their devices are in for a rough go. Whether it's a trip to the conference room or a convention out of town, any time equipment leaves a user's desk it is at risk of being put into harm's way. Stay connected at all times, whether at the office or on the go, with agile, durable, and flexible devices like the Lenovo® …

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date