Creating a Web Text Scraper with Visual Basic

WEBINAR: On-demand webcast

How to Boost Database Development Productivity on Linux, Docker, and Kubernetes with Microsoft SQL Server 2017 REGISTER >


Web scraping is a term that is becoming increasingly popular in the development world. It could because developers always tend to try to make things more and more convenient for users. At first, I wasn't a big fan of scraping because it can be used to obtain data not intended to be had by a user. Today you will create a program to scrape text from a website.

Web Scraping

Here is a nice definition of Web Scraping

Our Project

Open Visual Studio 2012, and create a VB.NET Windows Forms project. Name it anything you like and design it as shown in Figure 1.

Our Design
Figure 1 - Our Design


Not much coding. Be warned though, that doesn't mean that the code you will learn today will be easy. Nothing worthwhile is ever easy. Now, you're not here to learn about my life lessons, nor philosophy. Let's get the show on the road.

Add the following Imports above your Form's class definition:

Imports System.Text
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions

You will use the System.Text and System.Text.RegularExpressions namespaces to manipulate the HTML tags of the webpage being read. HTML tags are in a specific format: HTML tags are usually enclosed within <> signs. Apart from the ordinary HTML tags, there might be certain script languages such as VBScript, JavaScript, ASP and PHP also involved in a website as well as CSS (Cascading Style Sheets - which aid in the formatting of web pages). We need to take all of these factors into consideration when dealing with web pages.

The System.Net namespace provides objects such as the WebResponse object that provides a certain response to a calling application, and the HttpWebRequest, which enables us to understand the HTTP text. HTTP stands for Hyper Text Transfer Protocol. This protocol deals solely with web pages. You also get other Protocols such as the FTP (File Transfer Protocol) but that is a topic for another day.

The System.IO namespace will give you the StreamReader object, which you can use to read any stream of data.

Let's put all of these technologies together. Add the following sub to your application:

    Private Sub Scrape()


            Dim strURL As String = ""

            Dim strOutput As String = ""

            Dim wrResponse As WebResponse
            Dim wrRequest As WebRequest = HttpWebRequest.Create(strURL)

            txtScrape.Text = "Extracting..." & Environment.NewLine

            wrResponse = wrRequest.GetResponse()

            Using sr As New StreamReader(wrResponse.GetResponseStream())
                strOutput = sr.ReadToEnd()
                ' Close and clean up the StreamReader
            End Using

            txtScrape.Text = strOutput

            'Formatting Techniques

            ' Remove Doctype ( HTML 5 )
            strOutput = Regex.Replace(strOutput, "<!(.|\s)*?>", "")

            ' Remove HTML Tags
            strOutput = Regex.Replace(strOutput, "</?[a-z][a-z0-9]*[^<>]*>", "")

            ' Remove HTML Comments
            strOutput = Regex.Replace(strOutput, "<!--(.|\s)*?-->", "")

            ' Remove Script Tags
            strOutput = Regex.Replace(strOutput, "<script.*?</script>", "", RegexOptions.Singleline Or RegexOptions.IgnoreCase)

            ' Remove Stylesheets
            strOutput = Regex.Replace(strOutput, "<style.*?</style>", "", RegexOptions.Singleline Or RegexOptions.IgnoreCase)

            txtFormatted.Text = strOutput 'write Formatted Output To Separate TB

        Catch ex As Exception

            Console.WriteLine(ex.Message, "Error")

        End Try

    End Sub

Let me break my logic down for you. I created a string object to hold the URL (Universal Resource Locator, in layman's terms it means a web address) from which I will be scraping text. In this case it is Next I created a WebResponse object and an HttpWebRequest object. The HttpWebRequest Object creates a request to the specified URL. After I create the request, I send it via the WebResponse object. This object returns the text sent from the HTTP protocol back to us. Now, the tricky part...

Once we have the text, we need to format it appropriately. This is where the Regular Expressions come in. If you haven't heard of Regular Expressions before, have a look through this article of mine. Regular Expressions makes it easy to return certain strings in an appropriate way. Here I also had to compensate for the HTML tags, HTML comments, possible Script and CSS Style tags.

To finish up, you need to call the Scrape sub. Add this code now:

    Private Sub btnExtract_Click(sender As Object, e As EventArgs) Handles btnExtract.Click

        Scrape() 'Scrape Text From URL

    End Sub


Very interesting stuff indeed! As you can see, it is very easy to scrape text from websites. All you need is a basic understanding of HTML and VB.NET. If you are interested in downloading images from websites, you can have a look here. Until next time, cheers!

Related Articles



  • Nice!

    Posted by Cindy on 03/29/2017 06:57am

    Never having done this before and working very little with VB, I was able to take your example modify and pull data from HTML EDI response files. Thank you!

  • More help if possible

    Posted by Paul K on 12/01/2016 03:28pm

    How do I now pull in specific data from a website where the data is wrapped in CSS like this" [removed] customer_request = { ticket_id: "AB123456789", create_date:ConvertUTC("1479244195"), How do I get that data into a variable within VBA? Any help would be appreciated

  • newbie-ish

    Posted by Paul K on 11/28/2016 02:49pm

    I am trying to pull in data from a page and the data I am looking for is wrapped in CSS like this: indicator: { city: "City", state: "State"}, contact: { contact_name: "Company Name Here", contact_phone: "617555-1212", contact_email: ""}, Any help on how I can go directly after this data? Thanks

  • Thanking you

    Posted by Divya on 02/01/2015 11:11pm

    IT is very nice for a beginner. Thank you.

  • well done

    Posted by Tom on 06/24/2014 06:05am

    Ok, good job. However, it would be great to have web scraper that would work for booking sites. Example, you want to get a query from car hire website. You pass some query data and get the quote back. It looks a bit more complicated, but there must be a way to do that, as there are loads of similar applications that do this.

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • As all sorts of data becomes available for storage, analysis and retrieval - so called 'Big Data' - there are potentially huge benefits, but equally huge challenges...
  • The agile organization needs knowledge to act on, quickly and effectively. Though many organizations are clamouring for "Big Data", not nearly as many know what to do with it...
  • Cloud-based integration solutions can be confusing. Adding to the confusion are the multiple ways IT departments can deliver such integration...

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date