Data scraping: Getting gas data from a Web page

21Oct08

The Internet is a bit like a giant database. There’s information and data strewn about, distributed all over the world, and we run queries on the data using tools like Google.

Many different sites put out constantly updated, useful information, like US gas prices. The problem is that they don’t all supply APIs and XML feeds like the New York Times. This means after we’ve tracked down our data source, we need to find a way to pry that information out of the tangled mess of the HTML document.

There are a few ways to do this, and I’m going to guide you through one that people might call messy and error-prone, but for sites like the AAA gas prices list it works well. Once we’ve untangled this data from the HTML, we have access to some nice evergreen content for our site without ever having to lift a finger (until, that is, they change their site layout). This method is known as “screen scraping,” because it relies on the way the data is structured and displayed on the page.

Oh, and did I forget to mention? We’re doing it in Perl. The code first:


#!/usr/bin/perl

use strict;
use warnings;

use WWW::Mechanize;

sub download_n_strip {
	my $mech = WWW::Mechanize->new();
	my $url = $_[0];
	my $aggro;

	$mech->get($url);

	my $dump = $mech->content();
	my @dumparr = split(/\n/, $dump);
	my $line = "";

	foreach $line (@dumparr) {
                if($line =~ m/<a href=".*?">(.*?)\s+<\/a>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>/m) {
			$aggro .= "$1, $2, $3, $4, $5\n";
		}
	}	

	return $aggro;
}

my $url = "http://www.fuelgaugereport.com/sbsavg.asp";

print download_n_strip($url);

The first thing you need to do (if you haven’t already installed Perl) is to get the WWW::Mechanize module. The easiest way to do this is with a CPAN installer. Note that if you’re on Windows, from now on I’m going to assume you’re using cygwin to avoid the need for multiple sets of instructions.

So, regardless of whether you’re using cygwin, OS X’s Terminal.app, or any number of Linux terminal emulators, type cpan install WWW::Mechanize. If that fails, try sudo cpan install WWW::Mechanize — I had to on OS X. This might take a minute, especially if you have dependencies to install. Make sure to hit enter at all of the prompts asking you to install extra packages, as Mechanize needs these to work properly.

Once you’ve done that, open a text editor, and paste the code above into a file called gas.pl. Now, let’s step through it a bit.

The first line points to the interpreter to use. In my case, Perl in installed at /usr/bin/perl. You should alter this line to point to wherever your interpreter is — in your terminal, try typing which perl and hitting enter. It should let you know where it’s installed.

After that, we set some interpreter options that you should more-or-less always set for your personal scripts. Then, we tell it to load the WWW::Mechanize module.

Once all of that boilerplate is done, we reach the meat of the script: the download_n_strip function. Maybe it’s not the most creative name, but it gets the job done. In Perl you declare functions using the sub keyword.

It may seem a little rough, but the next few lines can be explained without much fuss. I’m going to annotate the code with comments to help you follow along easily:


sub download_n_strip {
my $mech = WWW::Mechanize->new(); #Create a new Mechanize instance
my $url = $_[0]; #Grab the first variable passed to the function
my $aggro; #declare the string we'll later return

$mech->get($url); #Gets the HTML from the url

my $dump = $mech->content(); #puts the HTML into the $dump variable
my @dumparr =split(/\n/, $dump); #Splits the string by lines, stores each resulting line in an array slot.

my $line = ""; #declare a variable for temporary storage.

Sorry for the switch in code style, but the code box was too large to add these comments.

After knocking that out, we have the core of the function: the regular expression.


	foreach $line (@dumparr) {
		if($line =~ m/<a href=".*?">(.*?)\s+.*?\$(.*?).*?\$(.*?).*?\$(.*?).*?\$(.*?)/m) {
			$aggro .= "$1, $2, $3, $4, $5\n";
		}
	}

It might seem a bit complicated, but let’s take a look at one of the lines from the HTML file we’re dealing with:


<tr><td><a href="FLavg.asp">Florida                                                     </a></td><td>$2.953</td><td>$3.106</td><td>$3.164</td><td>$3.671</td></tr>

If you recall, regular expressions are just a way of defining a pattern in text. In this case, we’re defining a pattern that matches each row of this gas table. You might need another fairly painless tutorial on the subject, so check it out. Try to match each special character in the regular expression up and see how everything interacts. If you’re still confused, drop a comment and I’ll help out.

Basically, the loop goes through each line of text in the HTML looking for that pattern. When it finds it, it grabs anything between the parentheses, (.*?). Now, using the .*? pattern (match anything up until the following character in the regex) is a bit of a cop-out. In reality, we should be using character classes or something similar there, but for our purposes this might be a bit easier to understand. The way we do it, the regex will try to take in all of the data between the tags we’ve defined, excluding dollar signs.

In that if statement, where we test the current line against the pattern, when there is a positive match we concatenate a new line onto the $aggro variable, which is holding all of the data we want to return. For ease of use and portability, I have everything going into a CSV format.

It seems pretty straight-forward:

$aggro .= "$1, $2, $3, $4, $5\n";

"$1, $2, $3, $4, $5\n" expands to a string like "Florida, 2.953, 3.106, 3.164, 3.671", where each number corresponds to a hit in the regex, and a newline character is appended to the end so the next string appended will be on a new line.

After that, we return $aggro and all is well.

Once the function is nice and defined, we can call it in the code:


my $url = "http://www.fuelgaugereport.com/sbsavg.asp";

print download_n_strip($url);

Here we store the url we want to pull the data from, pass it to the function we just defined and the print the result.

Now, save the file, navigate to the directory it’s in using the terminal and chmod +x gas.pl to make the file executable.

Once you’ve done that, you can either run it and output to the screen by typing ./gas.pl, or output to a CSV file by piping it to a file. In a Unix/Linux environment, you can pipe the output of a command to a file using the > character, so to output the data to an Excel-readable file, type ./gas.pl > gas_prices.csv at the terminal.

You can do some fun stuff with this data now. Load it into an excel file, graph it, whatever. Store the data over a couple of months and chart the changes across states. This is part of the fun of using data you find around the Web — you don’t have to collect it! Of course, there are caveats with that — namely, make sure you trust your data source.

Let me know if you’ve found this useful. I know a few news organizations use this exact data set for packages and graphics on their Web sites, and they copy them out by hand!. The horror!

Advertisements


No Responses Yet to “Data scraping: Getting gas data from a Web page”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: