First contact: Downloading from the Internet

19Sep08

Most people, when they decide they want learn how to program or script, probably want to do something involving the Internet. At the very least, it’s a good way to show off the power of a scripting language like Python. You might be floored by how easy it is to download a file. When I came across this post on fetching a URL and downloading it to a file, a little light bulb went off above my head.

Let’s make something useful.

If you’ll reference the first post on automatically saving the clipboard in Windows, you might see where this is heading. Then again, maybe not. So let’s get to it.

We’re actually creating two scripts this time. The first is a modified version of the auto-save script, which will allow you to save a list of links to a file on your desktop (or wherever), called links.txt. The second, when run, will parse the links.txt file and download all of the files from the Internet. Once more, I’ll start with the full code for the first script:


import win32clipboard as w

w.OpenClipboard()
d=w.GetClipboardData(w.CF_TEXT)
w.CloseClipboard()

f = file(”C:/Documents and Settings/YOUR-USER-NAME/Desktop/links.txt”, “a”)
f.write(d)
f.close()

You should refer to the first post if you need help understanding this. A few changes: first, we no longer need to import datetime, since we don’t have to name the file with the current date and time. The second is the line where we open the file:


f = file(”C:/Documents and Settings/YOUR-USER-NAME/Desktop/links.txt”, “a”)

I’m using the file() function here because I came across some information that, apparently, open() is an alias for file(). It’s a matter of preference, but I’d rather use the real function. The other change here, besides the different filename, is the “a” at the end. Previously, we used “w” because we were writing to a new file; “a” stands for “append,” and will both create the file and allow us to continually write new information to the end of it if it already exists.

Here comes the second script:


from urllib import urlretrieve

f = file("C:/Documents and Settings/YOUR-USER-NAME/Desktop/links.txt", "r")
for n, link in enumerate(f.readlines()):
    urlretrieve(link, "C:/Documents and Settings/YOUR-USER-NAME/Desktop/" + str(n) + ".html")
f.close()

That’s it. Five lines of code. Let’s break it down.

from urllib import urlretrieve

As before, this just gives us access to the urlretrieve() function of the urllib library. Urlretrieve downloads a URL to a temporary location unless you pass it a file to save to, but we’ll get to that in a minute.

f = file("C:/Documents and Settings/YOUR-USER-NAME/Desktop/", "r")
for n, link in enumerate(f.readlines()):

The first line we should be familiar with by now; this time we pass file() (or open()) an “r” because we wish to read from a file.

The next line looks a little tricky, though. It’s a Python for loop, which allows you to iterate over multiple objects or a list of some sort. The variables “n” and “link” store where we are in the list and the item in the list, respectively. Where is this list coming from? Well, that explanation will come in the next post (I had to split it up because it was veering off too much into Python syntax and data types).

Suffice it to say for now that enumerate() takes a list of some sort and returns two variables: a counter that increases by one each time (starting with 0) and whatever the value was that was originally in that position in the list. This way, we can keep track of where we are while looping. I only use it for naming the file here, but it can be useful in other ways.

Let’s move on though, shall we?
    urlretrieve(link, "C:/Documents and Settings/YOUR-USER-NAME/Desktop/" + str(n) + ".html")

Note the spacing before the function. It’s because Python is whitespace-sensitive. Putting spaces there denotes that urlretrieve() is within the scope of the for loop, i.e., the loop will execute that code as many times as it needs to.

In any case, this line just downloads the value in the link variable (one of the lines from the file), and saves it to your desktop sequentially. The str(n) part there converts “n,” a variable holding the current position in the list, to a string, which allows us to append it as the file name.

After that, we simply f.close() the file to be good programmers. We don’t need to indent the file closing because we only want that executed once, when all of the looping is done with.

That’s it. Save the file somewhere, call it something like “autodownload.py,” and double-click it whenever you’ve stored up some things in links.txt you want to cache locally. Feel free to create a directory somewhere to store the files in, and tell the script to download things to there. No need to clutter the desktop.

Now, you might catch something here: what happens if you get a new set of links and download them? Won’t the enumerating start over again, causing the other cached files to be overwritten? Good catch. If you want to plan for this sort of thing, you’ll need to create an md5 hash of the URLs and store with that file name.

An md5 hash will simply create a unique string of characters for another given string. It’s not much more work, just add import md5 to the top of the file, and replace the str(n) code with md5.new(link).hexdigest(). Now your filenames should never collide, unless you’re repeatedly downloading the same URL, in which case you probably want them to overwrite.

That leaves us with:

from urllib import urlretrieve
import md5

f = file("C:/Documents and Settings/YOUR-USER-NAME/Desktop/links.txt", "r")
for link in f.readlines():
    urlretrieve(link, "C:/Documents and Settings/YOUR-USER-NAME/Desktop/" + md5.new(link).hexdigest() + ".html")
f.close()

Note that I got rid of the “n” variable and the enumerate() function, because they were only there for naming the files.

Toy with the code a bit, see what you can get it to do. Let me know how it works out for you! Check back in a day or two for the explanation of the enumerate() function.

Advertisements


No Responses Yet to “First contact: Downloading from the Internet”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: