Data Mining with the Yahoo Query Language (YQL)

It never ceases to amaze me the things that you can look up using Yahoo Query Language.

I've covered YQL in this column before, where I presented a desktop application that plotted searched geographic locations on a Google map. YQL, however, has way more to cover, and even more beyond that if you use an authenticated connection and a Yahoo account.

Once you have a Yahoo account, you can manage an online contacts book and social presence API as well as take part in the Yahoo Fantasy Sports leagues. For this post, however, we're going to look at YQL's data mining abilities.

If you point your browser at https://developer.yahoo.com/yql/console/, you should get the following default page:

YQL01
Figure 1: Default Yahoo Query Language Console

You'll see straight away that it has a default query in "show tables" that will, as you can see, present you a list of the available data tables that YQL has to offer, and the level of access needed to get them.

At the bottom of the page, you should see this:

YQL02
Figure 2: The Yahoo YQL rest query for the demo page

This HTTP rest query is all you need to call from your favourite language (in this case, .NET/C#) to reproduce the call you see in the YQL console above.

If you copy and paste this into another browser, such as Firefox, you will see what's in Figure 3.

YQL03
Figure 3: The demo query running in Firefox

You'll see immediately that you get just the data and nothing else. The same applies if you use something like the .NET HttpWebClient to request the same URL.

What, however, if you don't want the result in XML?

Well, that's easy to change. Go back to the YQL console and, near the top, you'll see some different controls, one of which allows you to select JSON as the output format. Change the YQL demo to JSON, then click the blue "Test" button, and you'll see the data change to JSON format.

You'll also see that the rest of the URL at the bottom of the page has changed, too. This now has a "format=json" parameter in it to make sure the data comes back as JSON. If you paste the URL into another browser or browser tab again, you'll see this time you get JSON data.

Getting External Data

By now, you should be able to see that you can do some interesting things with YQL. For example, you can get JSON data showing the weather forecast in your area, or search Flickr for content.

If you look at the list of available data stores down the left, however, you'll see the bottom one is called 'data'. If you expand that, you'll see a number of interesting options.

YQL04
Figure 4: Options available under the data category in YQL

What 'data' allows you to do is to specify a remote data source formatted in one of the formats listed, and then use YQL and its SQL-like syntax to extract data from that source. YQL then will format that data into XML or JSON as requested by your YQL rest call and return the data back to you.

As an example, click the 'rss' option, select 'json' as the format, and reload the test. You should see that you get the RSS feed located at http://news.yahoo.com/rss/topstories sent back to you as a JSON object.

YQL05
Figure 5: Yahoo top stories RSS feed in Firefox

YQL06
Figure 6: The same feed passed through YQL and presented as JSON

Immediately, you can see a use for this, and that's grabbing RSS feeds from other sites, then loading them into .NET objects using something like Json.Net.

It doesn't stop there, however. Pop over to http://support.spatialkey.com/spatialkey-sample-csv-data/ and you'll find they have a few useful CSV data sets available on the site. The real estate transactions demo, http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv, leads you directly to a CSV file.

Go back to the YQL console, and select 'csv' from the 'data' section, then replace the URL in the YQL sample query with the url of the CSV above. Click the blue test button and you should get the CSV data presented to you in YQL as a JSON formatted array (or XML if you chose XML).

YQL07
Figure 7: YQL converts our CSV data to JSON

There are other options you can pass to the query, allowing you to name columns, skip the first row, and various other operations. I personally use this often to present JSON data directly from a PostgreSQL data export operation on a URL endpoint, allowing me to send database data directly to HTML without using any middleware. You also could have a CSV file on a legacy server and use YQL to serve it up to newer .NET clients in a more structured way.

Before I present you with a small but useful .NET example however, here's one last data extraction trick. You'll see in the data list there is an HTML entry. Click that to load the demo into the YQL console.

You'll see the now-familiar 'url' parameter, but you should also see a parameter marked 'xpath'. Let's try a little experiment. Open up Wikipedia's main search page, and perform a search query for the term "programming."

YQL08
Figure 8: Wikipedia 'programming search'

Now let's take a look at the source for this page by pressing 'Ctrl+U':

YQL09
Figure 9: Our search showing the source code

If you scroll to about a third of the way down, you'll find a 'p' tag that marks the start of the search results. If you look closely, you'll see that each result is an 'li' tag containing an 'a' tag with the result in it.

YQL10
Figure 10: Wikipedia search result

If you right-click the actual element in the page (not the source view) and 'inspect element' in Chrome, and then right-click it again in the developer tools, you can copy the 'Xpath' for the element.

YQL11
Figure 11: Copying the Xpath of an element

And that should give you the following:

//*[@id="mw-content-text"]/ul[1]/li[1]/a

Switch back to the YQL console, and you should now be able to plug both the Xpath and URL into the query, and get YQL to return the extracted data as JSON data.

YQL12
Figure 12: Our extracted Wikipedia data in YQL

With a little bit of clever Xpath querying, you could get all the results on the page. You also could use it to extract images from a Twitter stream, grab data tables from a HTML formatted page, or grab product data from an eBay page.

For now, however, fire up Visual Studio and start a console mode program. Use the YQL console to create a RSS data feed and point it to the BBC News UK RSS feed.

In your console mode program, make sure program.cs looks as follows:

using System;
using System.Net;

namespace bbcnewsyql
{
   class Program
   {
      static void Main()
      {
         const string yqlQuery =
            @"https://query.yahooapis.com/v1/public/yql?
            q=select%20* %20from%20rss%20where%20url%
            3D'http%3A%2F%2Ffeeds.
            bbci.co.uk%2Fnews%2Fuk%2Frss.xml'
            &format=json& diagnostics=true&callback=";

         WebClient client = new WebClient();
         string jsonData = client.DownloadString(yqlQuery);

         Console.WriteLine(jsonData);
      }

   }
}

Copy the string from the "rest query" box at the bottom of the YQL console into the 'yqlQuery' string variable in the code. When you run your program, you should get something like Figure 13:

YQL13
Figure 13: One big, ugly JSON string produced by the sample code

What you do with the JSON data at this point is entirely up to you.

Want an API for that, or just a little snippet of code to get you on your way? Let me hear your ideas; leave me a comment below telling me what you'd like to see in this column, or send me a tweet to @shawty_ds and I'll write a future post on the subject.



About the Author

Peter Shaw

As an early adopter of IT back in the late 1970s to early 1980s, I started out with a humble little 1KB Sinclair ZX81 home computer. Within a very short space of time, this small 1KB machine became a 16KB Tandy TRS-80, followed by an Acorn Electron and, eventually, after going through many other different machines, a 4MB, ARM-powered Acorn A5000. After leaving school and getting involved with DOS-based PCs, I went on to train in many different disciplines in the computer networking and communications industries. After returning to university in the mid-1990s and gaining a Bachelor of Science in Computing for Industry, I now run my own consulting business in the northeast of England called Digital Solutions Computer Software, Ltd. I advise clients at both a hardware and software level in many different IT disciplines, covering a wide range of domain-specific knowledge—from mobile communications and networks right through to geographic information systems and banking and finance.

Related Articles

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

Most Popular Programming Stories

More for Developers

RSS Feeds

Thanks for your registration, follow us on our social networks to keep up-to-date