So I was just browsing some code, and I came across a cool module I’d never seen before: glob.
Basically, it has two functions that either return a list or an iterator of files in a directory using shell pattern matching. So, for example, to return a list of all jpg files in a directory:
import glob
for file in glob.glob("*.jpg"):
print file
Introduced in Python 2.5, iglob is the other function in the module. It returns an iterator, which means the data isn’t all stored in one buffer or list in memory, but can be read out one at a time.
Take this interpreter session, for example:
>>> import glob
>>> files = glob.glob("*.*")
>>> files
['default.jpg', 'my_generated_image.png', 'piltest.py', 'playbutton.png', 'playbutton.psd']
>>> files = glob.iglob("*.*")
>>> files
<generator object at 0x827d8>
>>> files.next()
'default.jpg'
>>> files.next()
'my_generated_image.png'
Notice how each file name had to be popped out of the iterator individually? That’s useful in circumstances where you might have a huge number of files as a result of your glob query. The caveat is that you can’t go back in an iteration, so you’d best either store a few file names as you go, or be sure you’re done with the file before you call .next().
Anyway, if you need to get a list of files that follow a pattern like that, give it a shot. It’s pretty nifty.
Filed under: Info, Tutorial | 1 Comment
Tags: files, information, python
Scripting Gmail: A short example
I was working on a project once where I needed to download some Google Alert e-mails from a Gmail account and automatically download and organize the data linked from the e-mails. It may or may not sound complicated depending on who you are, but I found a little module called libgmail to be invaluable.
It’s basically an API that, as far as I can tell, interacts with the Gmail Web interface.
You may be asking yourself what benefit this has over, say, accessing your Gmail using IMAP or POP3. I’ll tell you: queries. See, in the Gmail Web interface, you can run queries on your e-mail in the search box, such as from:hilary to get all e-mail correspondence between you and anyone named Hilary. You can do that with domain names as well (from:@gmail.com) to get anyone who has sent you mail from a Gmail address. You can also combine those with things like is:unread, is:starred, label:inbox or label:work. You get the picture — you can have very specific views of your data, and we can tap into this power using Python with libgmail
So let’s go through a basic script. Once you’ve downloaded and installed libgmail (you must install the mechanize module as a prerequisite), open up your Python interpreter and type this:
>>> import libgmail
>>> ga = libgmail.GmailAccount("YOURACCT@gmail.com", "PASSWORD")
>>> ga.login()
>>> threads = ga.getMessagesByQuery("label:work is:unread", allPages=True)
>>> len(threads)
Fair warning, with allPages=True in the getMessagesByQuery function, it could take a while. If you just want to test to see if this works, set allPages=False.This should print out the number of e-mail threads you’ve filtered to a “work” label, just to see how much of a slacker you’re really being. I get a fairly high number, and I’m sure you will too.
Now, to see the subjects of the last 10 threads you’ve not read in your work e-mail, type this at your console:
>>> for thread in threads[0:10]:
... print "%s (%s)" % (thread.subject, len(thread))
This should print out something like:
<b>tuesday's page schedule</b> (1) <b>Final Notice -- Free Picture Frame Offer</b> (1) ...
etc.
What the previous code did was loop through a slice of the first 10 e-mails, printing out the subject (unread e-mails are bold in Gmail, hence the <b></b> tag pairs.) and number of e-mails in the thread. A “thread” in Gmail is a collection of related messages.
In any case, this is just a small sliver of what you can do with libgmail. I suggest you download the latest version and documentation to figure out a little more of what it can do. The downloads even include sample scripts to give you a hand.
Filed under: Tutorial | 1 Comment
Tags: gmail, libgmail, python, Tutorial
One of Python’s many strengths is its large library of modules.

OK, so you may not be able to fly (well, maybe with that medicine…), but you can import antigravity now. It just opens up a Web browser with that comic in it.
In any case, what does this have to do with processing a bunch of images in Python? Well, there’s a handy module called PIL — the Python Imaging Library, that can do a number of interesting things to image files using just code. You can blur an image, resize it, save it as a different file type and more.
Unfortunately, most installs of Python don’t come with PIL installed by default. To remedy this, first download the source file, unzip it (you can use the command tar xvf Imaging-1.1.6.tar.gz on most *nix systems), change to the directory (cd Imaging-1.1.6) and run sudo python install.
If all goes according to plan, when you go into the Python interactive interpreter and type import Image, you shouldn’t get an error.
The Coderholic blog has a great post on how to do batch imagine processing using PIL. Check it out.
If you’re like me and you take a lot of screenshots using OSX’s built-in Grab application, you can see how you could write a script to process all of the resulting TIFF files into a JPEG format. Try it out for yourselves; I’ll post the code for how to do it in a day or so if nobody else does.
Filed under: Info, Tutorial | Leave a Comment
Tags: batch, images, PIL, python
In my previous post I mentioned a cool little tool called Yahoo! Pipes, but didn’t really explain what it was. Pipes is a tool much like Automator in that it lets you create workflows to deal with data instead of writing code.
It’s not that I don’t like writing code, but sometimes it’s a lot easier to get things done when Yahoo! takes care of the hosting of your script and provides output options like RSS, ATOM and KML (for mapping).
It’s not always the best way to go — to be honest, I don’t like Yahoo!’s geocoder — but in a pinch, Pipes can be a life-saver. Hell, even if you have the time to write a script it might be worthwhile to see what you can get done with a little drag-and-drop magic.
Webmonkey provides a really great overview of how to get started with Pipes, guiding you through filtering, sorting and merging various RSS feeds. If you work through their tutorial and play with it a bit, you should have a pretty decent idea of the power Yahoo! is giving you.
I actually use it to get a combined RSS feed for all of the Alligator’s individual sections. With the click of a button, I can get that outputted as JSON, PHP and even an iGoogle gadget! Pretty nifty, right?
Filed under: Info | Leave a Comment
Tags: Yahoo! Pipes
Sometimes I like to point out that there are tools around that will let you accomplish tasks you’d normally have to script. Interestingly enough, Google Docs comes with a set of functions — importHTML, importData, importXML and importFeed — that will allow you to grab data on the Web and put it into your spreadsheet.
The OUseful.info blog has a good post on how you can use importHtml to pull in a table from Wikipedia. It even goes so far as to pump the resulting spreadsheet into Yahoo! Pipes to map the data. If you don’t know what Yahoo Pipes! is, it’s basically another service that allows you to manipulate data without writing code. I’ll be sure to cover it in a blog post coming soon.
In any case, being able to scrape text using Google Docs is a great way to get moving quickly on data-based projects.
Has anyone done a project using this method? I’m working on one right now at the Alligator. I’ll update this post when I get it up and running.
Filed under: Info | 1 Comment
Tags: data scraping, google docs, scraping
The Internet is a bit like a giant database. There’s information and data strewn about, distributed all over the world, and we run queries on the data using tools like Google.
Many different sites put out constantly updated, useful information, like US gas prices. The problem is that they don’t all supply APIs and XML feeds like the New York Times. This means after we’ve tracked down our data source, we need to find a way to pry that information out of the tangled mess of the HTML document.
There are a few ways to do this, and I’m going to guide you through one that people might call messy and error-prone, but for sites like the AAA gas prices list it works well. Once we’ve untangled this data from the HTML, we have access to some nice evergreen content for our site without ever having to lift a finger (until, that is, they change their site layout). This method is known as “screen scraping,” because it relies on the way the data is structured and displayed on the page.
Oh, and did I forget to mention? We’re doing it in Perl. The code first:
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
sub download_n_strip {
my $mech = WWW::Mechanize->new();
my $url = $_[0];
my $aggro;
$mech->get($url);
my $dump = $mech->content();
my @dumparr = split(/\n/, $dump);
my $line = "";
foreach $line (@dumparr) {
if($line =~ m/<a href=".*?">(.*?)\s+<\/a>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>.*?\$(.*?)<\/td>/m) {
$aggro .= "$1, $2, $3, $4, $5\n";
}
}
return $aggro;
}
my $url = "http://www.fuelgaugereport.com/sbsavg.asp";
print download_n_strip($url);
The first thing you need to do (if you haven’t already installed Perl) is to get the WWW::Mechanize module. The easiest way to do this is with a CPAN installer. Note that if you’re on Windows, from now on I’m going to assume you’re using cygwin to avoid the need for multiple sets of instructions.
So, regardless of whether you’re using cygwin, OS X’s Terminal.app, or any number of Linux terminal emulators, type cpan install WWW::Mechanize. If that fails, try sudo cpan install WWW::Mechanize — I had to on OS X. This might take a minute, especially if you have dependencies to install. Make sure to hit enter at all of the prompts asking you to install extra packages, as Mechanize needs these to work properly.
Once you’ve done that, open a text editor, and paste the code above into a file called gas.pl. Now, let’s step through it a bit.
The first line points to the interpreter to use. In my case, Perl in installed at /usr/bin/perl. You should alter this line to point to wherever your interpreter is — in your terminal, try typing which perl and hitting enter. It should let you know where it’s installed.
After that, we set some interpreter options that you should more-or-less always set for your personal scripts. Then, we tell it to load the WWW::Mechanize module.
Once all of that boilerplate is done, we reach the meat of the script: the download_n_strip function. Maybe it’s not the most creative name, but it gets the job done. In Perl you declare functions using the sub keyword.
It may seem a little rough, but the next few lines can be explained without much fuss. I’m going to annotate the code with comments to help you follow along easily:
sub download_n_strip {
my $mech = WWW::Mechanize->new(); #Create a new Mechanize instance
my $url = $_[0]; #Grab the first variable passed to the function
my $aggro; #declare the string we'll later return
$mech->get($url); #Gets the HTML from the url
my $dump = $mech->content(); #puts the HTML into the $dump variable
my @dumparr =split(/\n/, $dump); #Splits the string by lines, stores each resulting line in an array slot.
my $line = ""; #declare a variable for temporary storage.
Sorry for the switch in code style, but the code box was too large to add these comments.
After knocking that out, we have the core of the function: the regular expression.
foreach $line (@dumparr) {
if($line =~ m/<a href=".*?">(.*?)\s+.*?\$(.*?).*?\$(.*?).*?\$(.*?).*?\$(.*?)/m) {
$aggro .= "$1, $2, $3, $4, $5\n";
}
}
It might seem a bit complicated, but let’s take a look at one of the lines from the HTML file we’re dealing with:
<tr><td><a href="FLavg.asp">Florida </a></td><td>$2.953</td><td>$3.106</td><td>$3.164</td><td>$3.671</td></tr>
If you recall, regular expressions are just a way of defining a pattern in text. In this case, we’re defining a pattern that matches each row of this gas table. You might need another fairly painless tutorial on the subject, so check it out. Try to match each special character in the regular expression up and see how everything interacts. If you’re still confused, drop a comment and I’ll help out.
Basically, the loop goes through each line of text in the HTML looking for that pattern. When it finds it, it grabs anything between the parentheses, (.*?). Now, using the .*? pattern (match anything up until the following character in the regex) is a bit of a cop-out. In reality, we should be using character classes or something similar there, but for our purposes this might be a bit easier to understand. The way we do it, the regex will try to take in all of the data between the tags we’ve defined, excluding dollar signs.
In that if statement, where we test the current line against the pattern, when there is a positive match we concatenate a new line onto the $aggro variable, which is holding all of the data we want to return. For ease of use and portability, I have everything going into a CSV format.
It seems pretty straight-forward:
$aggro .= "$1, $2, $3, $4, $5\n";
"$1, $2, $3, $4, $5\n" expands to a string like "Florida, 2.953, 3.106, 3.164, 3.671", where each number corresponds to a hit in the regex, and a newline character is appended to the end so the next string appended will be on a new line.
After that, we return $aggro and all is well.
Once the function is nice and defined, we can call it in the code:
my $url = "http://www.fuelgaugereport.com/sbsavg.asp";
print download_n_strip($url);
Here we store the url we want to pull the data from, pass it to the function we just defined and the print the result.
Now, save the file, navigate to the directory it’s in using the terminal and chmod +x gas.pl to make the file executable.
Once you’ve done that, you can either run it and output to the screen by typing ./gas.pl, or output to a CSV file by piping it to a file. In a Unix/Linux environment, you can pipe the output of a command to a file using the > character, so to output the data to an Excel-readable file, type ./gas.pl > gas_prices.csv at the terminal.
You can do some fun stuff with this data now. Load it into an excel file, graph it, whatever. Store the data over a couple of months and chart the changes across states. This is part of the fun of using data you find around the Web — you don’t have to collect it! Of course, there are caveats with that — namely, make sure you trust your data source.
Let me know if you’ve found this useful. I know a few news organizations use this exact data set for packages and graphics on their Web sites, and they copy them out by hand!. The horror!
Filed under: Tutorial | Leave a Comment
Tags: data, perl, scraping
More Automator: Batch renaming
I work with a lot of files at the Alligator, mostly because our CMS is retarded, but we’d end up having a ton of files either way.
In any case, I find that a lot of times, the photo department doesn’t end up naming the files how I want them — they should be prefixed with yymmdd, where the day is the day we’re producing the paper, not the day the paper comes out. Why it’s like that is a long story, but in any case, I find that I need to rename files en mass a lot.
I’ve tried searching for easy ways of doing this, and it generally comes down to two ways: the command line, or Automator. Once I discovered you could create Finder plugins with Automator, I knew I had my answer. Of course, you won’t always need to rename files the same way, so the big question was whether I could be prompted to input the text I want to replace.
Fortunately, Automator can take input from the user at run time. Check out my workflow:
What this does is get whatever Finder items you have selected (on your Desktop or in a folder), asks you what text you want to replace, and what you want to replace it with. The magic here is selecting “options” on the “Rename Finder Items” pane and clicking “Show this action when the workflow runs.”
Amazingly, that exact pane pops up as a dialog box, meaning you can find and replace whatever you want, whenever you want. If you click “File,” then “Save As Plugin…,” save it as a Finder plugin.
Now, when you want to rename a bunch of files that are misnamed in the same way, select them, right-click (or control-click), go to “More,” “Automator,” and select your plugin. You should then get something like this:
Voila! If you’ve done all of the steps right, you should be renaming tons of files in no time. Don’t thank me, thank Apple for taking some of the pain out of scripting their operating system.
Filed under: Info, Tutorial | Leave a Comment
Tags: Automator, OSX
In my last post, I said that using Python and Ruby were great alternatives to a calculator. I even provided some examples of how you could do nested calculations in the interpreters.
What I failed to both notice and remember is how Python (and apparently Ruby) handle integer arithmetic. Even in my examples, the answer comes out differently for Python and Perl, and I failed to notice. I also used a different example for Ruby, which I shouldn’t have done for consistency’s sake.
Here’s the problem — try this in either Python or Ruby, and you’ll see what happened:
>>> (5 + (2 * 80))/2
82
The answer, of course, should be 82.5. Why is it giving us the wrong answer? It’s not because Python can’t do the math, it’s because we’re telling it to do the math wrong.
Consider:
cranleigh:~ kschwen$ irb
>> (5.0 + (2 * 80))/2
=> 82.5
cranleigh:~ kschwen$ python
>>> (5.0 + (2 * 80))/2
82.5
So, what changed? Before, we were telling Python and Ruby to take an integer and divide it by another integer (in this case, 165 was being divided by 2). These languages take their cues from C, an older programming language that makes a distinction between integer math and decimal/floating-point math.
The code that I originally posted comes out roughly to this in C:
#include <stdio.h>
int main (void) {
int x = 165/2;
printf("%i", x);
return 0;
}
This will output 82 just like Python did, because we’re telling it to do the math with integers. If we change it to this, however, things change:
#include <stdio.h>
int main (void) {
float x = 165.0 / 2;
printf("%f", x);
return 0;
}
Here, we’re explicitly telling the compiler to use floating-point math, which will provide us with a more accurate answer.
This digs down to the foundations of the languages. In Python, you don’t have to explicitly give your variables a type as you do in C. If you’ll notice in the C code, when we were doing division of integers, we stored it in an “int,” which means a variable meant to hold an integer. In the code with the correct answer, we had to tell it we wanted floating-point math.
Python takes care of this for us, but sometimes it can burn you. It takes a look at your numbers — 165 and 2 — and figures that if you’re using two integers, you want an integer in return.
If you add a decimal point to even one of the numbers, you’ve turn it into a float, and Python will return a decimal number — even if there’s just a zero after the decimal:
>>> 80.0 + 5
85.0
Searching for answer, I stumbled across a blog with a good look at reasonable output of floating-point numbers. If you’re still confused and want another perspective, I suggest you take a look.
Not having to type your variables can often save you a headache, but you have to make sure you’re not introducing a new one in the process. Sorry if I confused you with the last post.
If you’re wondering why Perl gives the correct answer, it’s because Perl uses more advanced context cues to figure out what to do with your variables in this and other situations. In this case, it will automatically use floating-point math to return a proper answer. This takes slightly longer to execute, which is why C — a language meant to be lean and fast — doesn’t do it, and languages that follow C’s lead will have the same quirks.
Filed under: Info | 3 Comments
Tags: float, math, python
Interpreters as calculators
Note: This post contains errors. I’m quite sorry, but see my correction post on the difference between floating-point and integer math for details.
Sometimes when you’re sitting at a computer you need to hammer out a quick calculation. Despite not being a huge math fan, I find this to be true more often than I’d like.
The problem is, I find calculators to be fairly limited and annoying machines to work with. If you make a mistake, you have to go back and do all of those steps over again. As your math gets more complicated, this tends to get more and more aggrevating.
What does this have to do with scripting? Well, if you have a Perl, Python or Ruby interpreter on hand, you can hammer out calculations quickly using just the command line.
Perl
With Perl, just fire up a command line and type your calculations (prefaced by the word “print”) in single-quotes after “perl -e”:
kschwen$ perl -e 'print sqrt((5 + (2 * 80))/2) . "\n"'
9.08295106229247
I think it’s pretty nifty because you can see your whole equation mapped out. The ‘. “\n”‘ at the end forces the answer to print on its own line, otherwise the number would run into the beginning of your command prompt.
Perl is widely known for its command-line scripting abilities. Savvy system administrators know that executing a “perl -e” can save a ton of time when there’s work to be done. I’ve actually found a great resource on perl one-liners if you’re interested.
Python
In Python, the easiest way is to just type “python” at the prompt, and do your calculations in the interactive interpreter:
kschwen$ python
>>> from math import sqrt
>>> sqrt((5 + (2 * 80))/2)
9.0553851381374173
I like doing it this way better, as you don’t have to add print statements or force a newline at the end. Also, since you’re in the interpreter, you can import new modules from the math library to do different things as you please.
Ruby
I know I haven’t really talked about Ruby, but that’s because I don’t use it. However, Ruby is interesting because you can accomplish this two ways:
kschwen$ ruby -e 'print 5 *5; print "\n";'
25
Or:
kschwen$ irb
>> 5 * 5
=> 25
From what I know, this is because Ruby has two interpreters. The interactive one, “irb”, is like Python’s interpreter — you can play with the language, executing different statements in your session. The actual “ruby” program interprets your script files, but can also be given the -e switch like perl’s interpreter, allowing you to execute one-liners on the command prompt.
If that’s inaccurate, please feel free to correct me, as I am not a ruby person.
In any case, doing math in a scripting language’s interepreter is an interesting and simple way to both play with a language and get some serious math done.
Filed under: Info, Tutorial | 1 Comment
Tags: perl, python, ruby
After last week’s mention of AppleScript I want to move onto something even easier and more fun to toy with on OSX. Under your applications folder, you should find a program called Automator.
It’s basically an abstraction layer over AppleScript. Just about anything you can do with Automator can be done by coding AppleScript, but it’s easier to drag some actions and set some options sometimes.
In Automator, you create workflows, which are basically just step-by-step graphical representations of programming functions. You can do things like rename files, create disk images, extract PDF text, get RSS feeds, find iCal items and more. You can even declare and use variables, run outside scripts and use system variables. Some of this stuff — even with proper knowledge of the libraries involved — would take a fair bit of programming skill to pull off.
A fairly simple example is one I use to create archives of files and folders.
If you create a workflow to match this (by taking action items from the left-hand Automator pane and dragging them to the right side), go to the File menu and save it as a plugin. Select Finder plugin, and give it a name like “Dated backup.”
Now, go to a folder and select some files you’d like to create a dated backup of. Right-click, go to “More” at the bottom of the menu, then go to the “Automator” item and select your plugin. Give it a second.
If all went well, a new file should appear in the folder you’re currently in, containing an archive of the file(s) you selected. It’s a useful workflow that showcases a cool part of Automator, namely how it allows you to manipulate files and folders using Finder and your script as a context-menu plugin.
There’s a bit of a lag, but I believe it’s the actual compression happening, and not an indication that Automator workflows are slower than most scripts. Even if there is some additional execution time, you’ve saved a good chunk of development time.
Tool around with Automator a bit though. I bet you’ll be impressed. I even ran across a guy who uses Automator to move tabs from Opera to Firefox. In his post, he shows off a cool trick where you can record your mouse and keyboard inputs for playback in your script, for when items aren’t nicely AppleScriptable.
Filed under: Info | 1 Comment
Tags: Automator, OSX
Recent Entries
- Python’s glob module is really cool
- Scripting Gmail: A short example
- PIL: Batch processing images in Python
- Yahoo! Pipes: More scripting without code
- Data scraping: Using Google Docs to grab table data
- Data scraping: Getting gas data from a Web page
- More Automator: Batch renaming
- A correction: Interpters as calculators
- Interpreters as calculators
- Automator: Script without coding in OSX
- AppleScript: Automating with folder actions on a Mac


