Parsing XML Feeds: Ridiculously straight-forward examples

23Sep08

When I first started with Python, I noticed that it had a built-in utility for parsing XML. After using regular expressions to rip through XML files as chunks of structured text (not a fun experience), I thought it would be an interesting idea to attempt it in Python using the built-in minidom parser. As a student of online journalism, I know a lot of data can be found in XML, including data from the National Weather Service. The ability to automate the fetching of data using XML and some scripting is very cool, and insanely useful if you have the right feed.

The test feed I used — and our test feed here — is one of the most-updated XML feeds I can think of: the Twitter public timeline. This XML feed updates about once per minute with the most recent posts to Twitter from all over the world. I decided to parse a Twitter feed and display peoples’ names and tweets, just to see how easy it would be.

As always, code first:

from urllib2 import urlopen
from xml.dom import minidom

feed = urlopen("http://twitter.com/statuses/public_timeline.xml")

doc = minidom.parse(feed)

#Get all doc elements matiching a given tag
names = doc.getElementsByTagName("screen_name") #Get all elements
updates = doc.getElementsByTagName("text") #Get all elements

tweets = zip(names, updates)
for tweeter_node, tweet_node in tweets:
    tweeter = tweeter_node.childNodes[0].nodeValue
    tweet = tweet_node.childNodes[0].nodeValue
    print "%s: %s" % (tweeter, tweet)

Astute readers will see that now we’re using the urllib2 library instead of urllib. The reason is that urllib2 has the urlopen() function, which will allow us to treat a URL like a local file handle instead of just caching it locally.

Our next step is to use the parse function of minidom. This function takes a handle to a file and returns a minidom object with the XML data structured an accessible through its methods. In XML, data is set between tags, such as <name>Ken Schwencke</name>. Using the minidom, we can return a set of objects contained within name tags by calling the getElementsByTagName() function of a minidom object returned from the parse() function earlier.

So we do this to the screen_name and text tags in the Twitter feed in order to grab all of the tweets and tweeters in the file.

We’re stuck with an odd problem now, though: there’s a one-to-one relationship between each element in the “names” and “updates” lists, so how do we iterate through them both at the same time? We need to combine them into one list and iterate through that.

Python’s built-in zip() function comes in handy here. It takes the corresponding elements of separate lists and “zips” them together into one. For example, if we had two lists of names that had a one-to-one relationship:

>>> first_name = ("Ken", "Adam")
>>> last_name = ("Schwencke", "Wynn")
>>> zip (first_name, last_name)
[('Ken', 'Schwencke'), ('Adam', 'Wynn')]

As you see, the zip() function combined the proper first and last names into matching tuples, all contained within one larger list.

Of course, the first thing we do after zipping the lists into one is split it back up in the for loop. Now that each element in the tweets list corresponds to a matching names/updates pair, we can iterate through the list.

Here’s where the magic happens, as far as getting data is involved:

tweeter = tweeter_node.childNodes[0].nodeValue
tweet = tweet_node.childNodes[0].nodeValue

Since the Twitter feed is fairly simple, the nodes we’re looking at don’t have children — that is, the only thing between matching screen_name tags is the screen name itself. There are no tags nested between them. Same with all text tags. If there were more, the parsing would get more complicated, but this is a “ridiculously straight-forward example.”

So we take the first child, which is the node itself, and access the nodeValue. This is the actual data between the XML tags. Now it’s just a matter of printing out the relevant data:

print "%s: %s" % (tweeter, tweet)

A “%s “inside of a string is Python shorthand for “a string variable will go here later.” The following % means that we’re passing a tuple with the follows for Python to plug into the previous string. In this case, I want the “tweeter” (the name from the screen_name XML tags) followed by the “tweet” itself (culled from the text XML tags).

That’s it! You’ve just parsed your first XML feed in Python.

Since I promised multiple examples, here’s another. Get the last published weather information from your nearest airport, or other weather-monitoring station:


from urllib2 import urlopen
from xml.dom import minidom

#Feed for the Gainesville airport.
feed = urlopen("http://www.weather.gov/xml/current_obs/KGNV.xml")

doc = minidom.parse(feed)

loc = doc.getElementsByTagName("location")
temp_f = doc.getElementsByTagName("temperature_string")
time = doc.getElementsByTagName("observation_time")

location = loc[0].childNodes[0].nodeValue
temperature = temp_f[0].childNodes[0].nodeValue
date = time[0].childNodes[0].nodeValue

print "It is %s at %s. %s" % (temperature, location, date)

Find your nearest location and plug it into the urlopen() function.

Advertisements


2 Responses to “Parsing XML Feeds: Ridiculously straight-forward examples”

  1. 1 Todd Fiske

    Hi Ken,

    I like your series of Python posts, I’m going to use your weather retrieval code as part of a script to create “dynamic” desktop wallpaper with current information, sort of like a low-end Samurize.

    A tip, instead of having people edit “YOUR-USER-NAME” in your scripts, you could include a little snippet based on something like this:

    # PathTest.py
    # use win32api to expand environment strings

    import win32api

    sEnv = “%USERPROFILE%\Desktop”
    sExp = win32api.ExpandEnvironmentStrings(sEnv)

    print “[%s] -> [%s]” % (sEnv, sExp)

    Then it would work without changes for most users.

    Todd

  2. 2 schwanksta

    Hi Todd,

    That is a very good tip; however, I want to try to make most code as platform-agnostic as possible wherever I can.

    To be honest, I don’t like hard-coding file names into scripts. I only do so in examples to keep things simple. I may do a blog post later on ways to avoid that, though — loading config files and fetching environment variables, mainly. For example,


    >>> import os
    >>> os.getlogin()
    'kschwen'

    works on OSX and Linux.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: