So I was just browsing some code, and I came across a cool module I’d never seen before: glob.

Basically, it has two functions that either return a list or an iterator of files in a directory using shell pattern matching. So, for example, to return a list of all jpg files in a directory:

import glob
for file in glob.glob("*.jpg"):
    print file

Introduced in Python 2.5, iglob is the other function in the module. It returns an iterator, which means the data isn’t all stored in one buffer or list in memory, but can be read out one at a time.

Take this interpreter session, for example:

>>> import glob
>>> files = glob.glob("*.*")
>>> files
['default.jpg', 'my_generated_image.png', '', 'playbutton.png', 'playbutton.psd']
>>> files = glob.iglob("*.*")
>>> files
<generator object at 0x827d8>

Notice how each file name had to be popped out of the iterator individually? That’s useful in circumstances where you might have a huge number of files as a result of your glob query. The caveat is that you can’t go back in an iteration, so you’d best either store a few file names as you go, or be sure you’re done with the file before you call .next().

Anyway, if you need to get a list of files that follow a pattern like that, give it a shot. It’s pretty nifty.


So unfortunately, the guy who developed libgmail has stopped upkeep, and most likely, the version you have is defunct. HOWEVER: There is an updated version of libgmail. Share and enjoy.

I was working on a project once where I needed to download some Google Alert e-mails from a Gmail account and automatically download and organize the data linked from the e-mails. It may or may not sound complicated depending on who you are, but I found a little module called libgmail to be invaluable.

It’s basically an API that, as far as I can tell, interacts with the Gmail Web interface.

You may be asking yourself what benefit this has over, say, accessing your Gmail using IMAP or POP3. I’ll tell you: queries. See, in the Gmail Web interface, you can run queries on your e-mail in the search box, such as from:hilary to get all e-mail correspondence between you and anyone named Hilary. You can do that with domain names as well ( to get anyone who has sent you mail from a Gmail address. You can also combine those with things like is:unread, is:starred, label:inbox or label:work. You get the picture — you can have very specific views of your data, and we can tap into this power using Python with libgmail

So let’s go through a basic script. Once you’ve downloaded and installed libgmail (you must install the mechanize module as a prerequisite), open up your Python interpreter and type this:

>>> import libgmail
>>> ga = libgmail.GmailAccount("", "PASSWORD")
>>> ga.login()
>>> threads = ga.getMessagesByQuery("label:work is:unread", allPages=True)
>>> len(threads)

Fair warning, with allPages=True in the getMessagesByQuery function, it could take a while. If you just want to test to see if this works, set allPages=False.This should print out the number of e-mail threads you’ve filtered to a “work” label, just to see how much of a slacker you’re really being. I get a fairly high number, and I’m sure you will too.

Now, to see the subjects of the last 10 threads you’ve not read in your work e-mail, type this at your console:

>>> for thread in threads[0:10]:
...     print "%s (%s)" % (thread.subject, len(thread))

This should print out something like:

<b>tuesday's page schedule</b> (1) <b>Final Notice -- Free Picture Frame Offer</b> (1) ...


What the previous code did was loop through a slice of the first 10 e-mails, printing out the subject (unread e-mails are bold in Gmail, hence the <b></b> tag pairs.) and number of e-mails in the thread. A “thread” in Gmail is a collection of related messages.

In any case, this is just a small sliver of what you can do with libgmail. I suggest you download the latest version and documentation to figure out a little more of what it can do. The downloads even include sample scripts to give you a hand.

One of Python’s many strengths is its large library of modules.

OK, so you may not be able to fly (well, maybe with that medicine…), but you can import antigravity now. It just opens up a Web browser with that comic in it.

In any case, what does this have to do with processing a bunch of images in Python? Well, there’s a handy module called PIL — the Python Imaging Library, that can do a number of interesting things to image files using just code. You can blur an image, resize it, save it as a different file type and more.

Unfortunately, most installs of Python don’t come with PIL installed by default. To remedy this, first download the source file, unzip it (you can use the command tar xvf Imaging-1.1.6.tar.gz on most *nix systems), change to the directory (cd Imaging-1.1.6) and run sudo python install.

If all goes according to plan, when you go into the Python interactive interpreter and type import Image, you shouldn’t get an error.

The Coderholic blog has a great post on how to do batch imagine processing using PIL. Check it out.

If you’re like me and you take a lot of screenshots using OSX’s built-in Grab application, you can see how you could write a script to process all of the resulting TIFF files into a JPEG format. Try it out for yourselves; I’ll post the code for how to do it in a day or so if nobody else does.

In my previous post I mentioned a cool little tool called Yahoo! Pipes, but didn’t really explain what it was. Pipes is a tool much like Automator in that it lets you create workflows to deal with data instead of writing code.

It’s not that I don’t like writing code, but sometimes it’s a lot easier to get things done when Yahoo! takes care of the hosting of your script and provides output options like RSS, ATOM and KML (for mapping).

It’s not always the best way to go — to be honest, I don’t like Yahoo!’s geocoder — but in a pinch, Pipes can be a life-saver. Hell, even if you have the time to write a script it might be worthwhile to see what you can get done with a little drag-and-drop magic.

Webmonkey provides a really great overview of how to get started with Pipes, guiding you through filtering, sorting and merging various RSS feeds. If you work through their tutorial and play with it a bit, you should have a pretty decent idea of the power Yahoo! is giving you.

I actually use it to get a combined RSS feed for all of the Alligator’s individual sections. With the click of a button, I can get that outputted as JSON, PHP and even an iGoogle gadget! Pretty nifty, right?

Sometimes I like to point out that there are tools around that will let you accomplish tasks you’d normally have to script. Interestingly enough, Google Docs comes with a set of functionsimportHTML, importData, importXML and importFeed — that will allow you to grab data on the Web and put it into your spreadsheet.

The blog has a good post on how you can use importHtml to pull in a table from Wikipedia. It even goes so far as to pump the resulting spreadsheet into Yahoo! Pipes to map the data. If you don’t know what Yahoo Pipes! is, it’s basically another service that allows you to manipulate data without writing code. I’ll be sure to cover it in a blog post coming soon.

In any case, being able to scrape text using Google Docs is a great way to get moving quickly on data-based projects.

Has anyone done a project using this method? I’m working on one right now at the Alligator. I’ll update this post when I get it up and running.

The Internet is a bit like a giant database. There’s information and data strewn about, distributed all over the world, and we run queries on the data using tools like Google.

Many different sites put out constantly updated, useful information, like US gas prices. The problem is that they don’t all supply APIs and XML feeds like the New York Times. This means after we’ve tracked down our data source, we need to find a way to pry that information out of the tangled mess of the HTML document.

There are a few ways to do this, and I’m going to guide you through one that people might call messy and error-prone, but for sites like the AAA gas prices list it works well. Once we’ve untangled this data from the HTML, we have access to some nice evergreen content for our site without ever having to lift a finger (until, that is, they change their site layout). This method is known as “screen scraping,” because it relies on the way the data is structured and displayed on the page.

Oh, and did I forget to mention? We’re doing it in Perl. The code first:


use strict;
use warnings;

use WWW::Mechanize;

sub download_n_strip {
	my $mech = WWW::Mechanize->new();
	my $url = $_[0];
	my $aggro;


	my $dump = $mech->content();
	my @dumparr = split(/\n/, $dump);
	my $line = "";

	foreach $line (@dumparr) {
                if($line =~ m/<a href=".*?">(.*?)\s+<\/a>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>/m) {
			$aggro .= "$1, $2, $3, $4, $5\n";

	return $aggro;

my $url = "";

print download_n_strip($url);

The first thing you need to do (if you haven’t already installed Perl) is to get the WWW::Mechanize module. The easiest way to do this is with a CPAN installer. Note that if you’re on Windows, from now on I’m going to assume you’re using cygwin to avoid the need for multiple sets of instructions.

So, regardless of whether you’re using cygwin, OS X’s, or any number of Linux terminal emulators, type cpan install WWW::Mechanize. If that fails, try sudo cpan install WWW::Mechanize — I had to on OS X. This might take a minute, especially if you have dependencies to install. Make sure to hit enter at all of the prompts asking you to install extra packages, as Mechanize needs these to work properly.

Once you’ve done that, open a text editor, and paste the code above into a file called Now, let’s step through it a bit.

The first line points to the interpreter to use. In my case, Perl in installed at /usr/bin/perl. You should alter this line to point to wherever your interpreter is — in your terminal, try typing which perl and hitting enter. It should let you know where it’s installed.

After that, we set some interpreter options that you should more-or-less always set for your personal scripts. Then, we tell it to load the WWW::Mechanize module.

Once all of that boilerplate is done, we reach the meat of the script: the download_n_strip function. Maybe it’s not the most creative name, but it gets the job done. In Perl you declare functions using the sub keyword.

It may seem a little rough, but the next few lines can be explained without much fuss. I’m going to annotate the code with comments to help you follow along easily:

sub download_n_strip {
my $mech = WWW::Mechanize->new(); #Create a new Mechanize instance
my $url = $_[0]; #Grab the first variable passed to the function
my $aggro; #declare the string we'll later return

$mech->get($url); #Gets the HTML from the url

my $dump = $mech->content(); #puts the HTML into the $dump variable
my @dumparr =split(/\n/, $dump); #Splits the string by lines, stores each resulting line in an array slot.

my $line = ""; #declare a variable for temporary storage.

Sorry for the switch in code style, but the code box was too large to add these comments.

After knocking that out, we have the core of the function: the regular expression.

	foreach $line (@dumparr) {
		if($line =~ m/<a href=".*?">(.*?)\s+.*?\$(.*?).*?\$(.*?).*?\$(.*?).*?\$(.*?)/m) {
			$aggro .= "$1, $2, $3, $4, $5\n";

It might seem a bit complicated, but let’s take a look at one of the lines from the HTML file we’re dealing with:

<tr><td><a href="FLavg.asp">Florida                                                     </a></td><td>$2.953</td><td>$3.106</td><td>$3.164</td><td>$3.671</td></tr>

If you recall, regular expressions are just a way of defining a pattern in text. In this case, we’re defining a pattern that matches each row of this gas table. You might need another fairly painless tutorial on the subject, so check it out. Try to match each special character in the regular expression up and see how everything interacts. If you’re still confused, drop a comment and I’ll help out.

Basically, the loop goes through each line of text in the HTML looking for that pattern. When it finds it, it grabs anything between the parentheses, (.*?). Now, using the .*? pattern (match anything up until the following character in the regex) is a bit of a cop-out. In reality, we should be using character classes or something similar there, but for our purposes this might be a bit easier to understand. The way we do it, the regex will try to take in all of the data between the tags we’ve defined, excluding dollar signs.

In that if statement, where we test the current line against the pattern, when there is a positive match we concatenate a new line onto the $aggro variable, which is holding all of the data we want to return. For ease of use and portability, I have everything going into a CSV format.

It seems pretty straight-forward:

$aggro .= "$1, $2, $3, $4, $5\n";

"$1, $2, $3, $4, $5\n" expands to a string like "Florida, 2.953, 3.106, 3.164, 3.671", where each number corresponds to a hit in the regex, and a newline character is appended to the end so the next string appended will be on a new line.

After that, we return $aggro and all is well.

Once the function is nice and defined, we can call it in the code:

my $url = "";

print download_n_strip($url);

Here we store the url we want to pull the data from, pass it to the function we just defined and the print the result.

Now, save the file, navigate to the directory it’s in using the terminal and chmod +x to make the file executable.

Once you’ve done that, you can either run it and output to the screen by typing ./, or output to a CSV file by piping it to a file. In a Unix/Linux environment, you can pipe the output of a command to a file using the > character, so to output the data to an Excel-readable file, type ./ > gas_prices.csv at the terminal.

You can do some fun stuff with this data now. Load it into an excel file, graph it, whatever. Store the data over a couple of months and chart the changes across states. This is part of the fun of using data you find around the Web — you don’t have to collect it! Of course, there are caveats with that — namely, make sure you trust your data source.

Let me know if you’ve found this useful. I know a few news organizations use this exact data set for packages and graphics on their Web sites, and they copy them out by hand!. The horror!

I work with a lot of files at the Alligator, mostly because our CMS is retarded, but we’d end up having a ton of files either way.

In any case, I find that a lot of times, the photo department doesn’t end up naming the files how I want them — they should be prefixed with yymmdd, where the day is the day we’re producing the paper, not the day the paper comes out. Why it’s like that is a long story, but in any case, I find that I need to rename files en mass a lot.

I’ve tried searching for easy ways of doing this, and it generally comes down to two ways: the command line, or Automator. Once I discovered you could create Finder plugins with Automator, I knew I had my answer. Of course, you won’t always need to rename files the same way, so the big question was whether I could be prompted to input the text I want to replace.

Fortunately, Automator can take input from the user at run time. Check out my workflow:

What this does is get whatever Finder items you have selected (on your Desktop or in a folder), asks you what text you want to replace, and what you want to replace it with. The magic here is selecting “options” on the “Rename Finder Items” pane and clicking “Show this action when the workflow runs.”

Amazingly, that exact pane pops up as a dialog box, meaning you can find and replace whatever you want, whenever you want. If you click “File,” then “Save As Plugin…,” save it as a Finder plugin.

Now, when you want to rename a bunch of files that are misnamed in the same way, select them, right-click (or control-click), go to “More,” “Automator,” and select your plugin. You should then get something like this:

Voila! If you’ve done all of the steps right, you should be renaming tons of files in no time. Don’t thank me, thank Apple for taking some of the pain out of scripting their operating system.