Web scraping is a term that is becoming increasingly popular in the development world. It could because developers always tend to try to make things more and more convenient for users. At first, I wasn't a big fan of scraping because it can be used to obtain data not intended to be had by a user. Today you will create a program to scrape text from a website.
Here is a nice definition of Web Scraping
Figure 1 - Our Design
Not much coding. Be warned though, that doesn't mean that the code you will learn today will be easy. Nothing worthwhile is ever easy. Now, you're not here to learn about my life lessons, nor philosophy. Let's get the show on the road.
Add the following Imports above your Form's class definition:
Imports System.Text Imports System.Net Imports System.IO Imports System.Text.RegularExpressions
The System.Net namespace provides objects such as the WebResponse object that provides a certain response to a calling application, and the HttpWebRequest, which enables us to understand the HTTP text. HTTP stands for Hyper Text Transfer Protocol. This protocol deals solely with web pages. You also get other Protocols such as the FTP (File Transfer Protocol) but that is a topic for another day.
The System.IO namespace will give you the StreamReader object, which you can use to read any stream of data.
Let's put all of these technologies together. Add the following sub to your application:
Private Sub Scrape() Try Dim strURL As String = "http://codeguru.com" Dim strOutput As String = "" Dim wrResponse As WebResponse Dim wrRequest As WebRequest = HttpWebRequest.Create(strURL) txtScrape.Text = "Extracting..." & Environment.NewLine wrResponse = wrRequest.GetResponse() Using sr As New StreamReader(wrResponse.GetResponseStream()) strOutput = sr.ReadToEnd() ' Close and clean up the StreamReader sr.Close() End Using txtScrape.Text = strOutput 'Formatting Techniques ' Remove Doctype ( HTML 5 ) strOutput = Regex.Replace(strOutput, "<!(.|\s)*?>", "") ' Remove HTML Tags strOutput = Regex.Replace(strOutput, "</?[a-z][a-z0-9]*[^<>]*>", "") ' Remove HTML Comments strOutput = Regex.Replace(strOutput, "<!--(.|\s)*?-->", "") ' Remove Script Tags strOutput = Regex.Replace(strOutput, "<script.*?</script>", "", RegexOptions.Singleline Or RegexOptions.IgnoreCase) ' Remove Stylesheets strOutput = Regex.Replace(strOutput, "<style.*?</style>", "", RegexOptions.Singleline Or RegexOptions.IgnoreCase) txtFormatted.Text = strOutput 'write Formatted Output To Separate TB Catch ex As Exception Console.WriteLine(ex.Message, "Error") End Try End Sub
Let me break my logic down for you. I created a string object to hold the URL (Universal Resource Locator, in layman's terms it means a web address) from which I will be scraping text. In this case it is Codeguru.com. Next I created a WebResponse object and an HttpWebRequest object. The HttpWebRequest Object creates a request to the specified URL. After I create the request, I send it via the WebResponse object. This object returns the text sent from the HTTP protocol back to us. Now, the tricky part...
Once we have the text, we need to format it appropriately. This is where the Regular Expressions come in. If you haven't heard of Regular Expressions before, have a look through this article of mine. Regular Expressions makes it easy to return certain strings in an appropriate way. Here I also had to compensate for the HTML tags, HTML comments, possible Script and CSS Style tags.
To finish up, you need to call the Scrape sub. Add this code now:
Private Sub btnExtract_Click(sender As Object, e As EventArgs) Handles btnExtract.Click Scrape() 'Scrape Text From URL End Sub
Very interesting stuff indeed! As you can see, it is very easy to scrape text from websites. All you need is a basic understanding of HTML and VB.NET. If you are interested in downloading images from websites, you can have a look here. Until next time, cheers!