William J. Turkel and Alan MacEachern, The Programming Historian, 1st ed. NiCHE: Network in Canadian History & Environment (2007-08).
Making use of your ability to do close reading
From now on, you will be seeing more and more samples of code. Try to get into the habit of reading each one closely, the way that you would read a particularly important primary source. If there is something in the code that you haven't seen before or don't understand, try to make an explicit hypothesis about how it must work. Sometimes your hypothesis will be correct, and sometimes it won't, but it is much easier to make progress if you are mindful about your own assumptions. This is also the stance that you will need to take when you begin to debug code that doesn't work. One of the advantages that historians have when they turn to programming is that they are already in the habit of interrogating sources rather than taking them at face value.
Sending information to text files
In a previous section, you saw how to send information to the "Command Output" pane of Komodo Edit by using Python's print command.
print 'hello world'
The Python programming language is object-oriented. That is to say that it is constructed around a special kind of entity, an object, which contains both data and a number of methods for accessing and processing that data. In the example above, we see one kind of object, the string hello world. A string object is a sequence of characters; we'll learn more about string methods soon. Print is a command that prints objects in textual form.
You will use print like this in cases where you want to create information that you are going to act on right away. Sometimes, however, you will be creating information that you want to save, to send to someone else, or to use as input for further processing by another program or set of programs. In these cases you will want to send information to files on your hard drive rather than to the "Command Output" pane. Enter the following program into Komodo Edit and save it as file-output.py.
# file-output.py f = open('helloworld.txt','w') f.write('hello world') f.close()
In this program f is a file object, and open, write and close are file methods. In the open method, 'helloworld.txt' is the name of the file that you are going to create, and the 'w' parameter says that you are opening the file to write to it. Note that both the file name and the parameter are strings in this case. Your program writes the message (another string) to the file and then closes the file. (For more information about these statements, see the section on File Objects in the Python Library Reference.)
Double-click on your "Run Python" button to execute the program. Although nothing will be printed to the "Command Output" pane, you will see a status message that says
`/usr/bin/python file-output.py` returned 0.
on the Mac, or
'C:\Python25\Python.exe file-output.py' returned 0.
on Windows. This means that your program executed successfully. If you use File->Open->File in Komodo Edit, you can open the file helloworld.txt. It should contain your one-line message:
hello world
Since text files include a minimal amount of formatting information, they tend to be small, easy to exchange between different platforms (i.e., from Windows to Linux or Mac or vice versa), and easy to send from one computer program to another. They can usually also be read by people with a text editor like Komodo Edit.
Getting information from text files
Python also has statements which allow you to get information from files. Type the following program into Komodo Edit and save it as file-input.py. When you double-click "Run Python" to execute it, it will open the text file, read the one-line message from it, and print the message to the "Command Output" pane.
# file-input.py f = open('helloworld.txt','r') message = f.read() print message f.close()
In this case, the 'r' parameter is used to indicate that you are opening a file to read from it. Read is another file method. The contents of the file (the one-line message) are copied into message, which is a string, and then the print command is used to send the contents of message to the "Command Output" pane.
Splitting code into modules and functions
You often find that you want to re-use a particular set of statements, usually because you have a task that you need to do over and over. Suppose, for example, that you keep all of your bibliographic references in Zotero and you have a tag to indicate which ones you need to get on your next trip to the library. It would be useful to have a program that selected only those tagged items and sorted them by call number (so you don't have to waste time wandering from one part of the library to the next when you're retrieving them). Since this is part of your research practice, you'll want to be able to re-run this program before each trip to the library. A program, in other words, is a mechanism for bundling a collection of statements together to facilitate re-use. Zotero itself is a bundle of useful statements, as is Firefox.
When programs are small, they are typically stored in a single file. When you want to run one of your programs, you can simply send the file to the interpreter. As programs become larger, it makes sense to split them into separate files known as modules. In essence, this modularization allows programmers to re-use code for tasks that they have to do over and over. Below, for example, you'll see that commands for working with web pages have been put into a separate Python module. Python has a special import statement that allows one program to gain access to the contents of another program file. (As you work through the examples below, make sure that you understand the difference between loading a data file and importing a program file.)
At a finer level of detail, programs are mostly composed of routines that are powerful and general-purpose enough to be reused. These are known as functions, and Python has mechanisms that allow you to define new functions. Let's work through a very simple example of a function and a module. Suppose you want to create a general purpose function for greeting people. Copy the following function definition into Komodo Edit and save it as greet.py. This file is your module.
# greet.py def greetEntity (x): print "hello " + x
Note that indentation is very important in Python. The blank space before the print statement tells the interpreter that it is part of the function being defined. You will learn more about this as we go along; for now, make sure to keep indentation the way we show it.
Now you can create another program that imports code from your module and makes use of it. Copy this code to Komodo Edit and save it as using-greet.py. This file is your program.
# using-greet.py import greet greet.greetEntity("everybody") greet.greetEntity("programming historian")
You can run your using-greet.py program with the Run Python command that you created in Komodo Edit. Note that you do not have to run your module... just the program that calls it. (Note that from this example and the previous ones, you might infer that strings in Python can be delimited with single or double quotes. That is true.) If all went well, you should see
hello everybody hello programming historian
in the command output pane of Komodo Edit.
You can think of the granularity of code in two ways:
Top-down. If you think of all the things that you want to use a computer for, you can decompose the problem into recurring sub-problems. You need to work with files (operating system), documents (word processor), numbers (spreadsheet), data (database), pictures (image processing program), web pages (browser) and so on. A particular program will need to be able to open, manipulate, and store files. You may want the ability to check your spelling in documents, e-mail or presentations. In order to check spelling, you need some kind of dictionary and the ability to look up each word in it. Looking up words involves being able to compare them character-by-character, and so on. Each task can be partitioned into smaller ones.
Bottom-up. Suppose you start with a simple task, like adding two numbers together (a+b). Once you know how to do that, it is possible to generalize your ability to add any number of numbers together (a+b)+c = (a+b+c). From adding you can get multiplication (a*3) = (a+a+a). Being able to add numbers is such a useful function, that it recurs constantly. Your operating system will need addition to determine how much file space is left on your hard drive. Your word processor will need it to keep track of word counts and page numbers. Your spreadsheet will need to do a lot of addition. Useful building blocks can be combined and recombined at every level of complexity.
About URLs
A web page is a file that is stored on another computer, a machine known as a web server. When you 'go to' a web page, what is actually happening is that your computer, the client, sends a request to the server out over the network, and the server replies by sending a copy of the page back to your machine. One way to get to a web page with your browser is to follow a link from somewhere else. You also have the ability, of course, to paste or type a Uniform Resource Locator (URL) into a web page. The URL tells your browser where to find an online resource by specifying the server, directory and name of the file to be retrieved, as well as the kind of protocol that the server and your browser will agree to use while exchanging information (like HTTP, the Hypertext Transfer Protocol). The basic structure of a URL is
protocol: //host :port /path ?query
Let's look at a few examples.
http://niche-canada.org
The most basic kind of URL simply specifies the protocol and host. If you give this URL to your browser, it will return the main page of the NiCHE website. The default assumption is that the main page in a given directory will be named index, usually index.html. The NiCHE website is written in a different language than HTML, however, so the name of the main page is index.php. (PHP is another web programming language. If you'd like to learn more about it, there is a W3 Schools tutorial.)
The URL can also include an optional port number. Without getting into too much detail at this point, the network protocol that underlies the exchange of information on the internet allows computers to connect in different ways. Port numbers are used to distinguish these different kinds of connection. Since the default port for HTTP is 80, the following URL is equivalent to the previous one.
http://niche-canada.org:80
As you know, there are usually many web pages on a given website. These are stored in directories on the server, and you can specify the path to a particular page. The table of contents for this book has the following URL. Note that we don't need to specify the filename.
http://niche-canada.org/programming-historian/
Finally, some web pages allow you to enter queries. The NiCHE website, for example, is laid out in such a way that you can request a particular page within it by using a query string. The following URL will take you to the main page for the NiCHE Digital Infrastructure.
http://niche-canada.org/?q=node/12
Opening URLs with Python
In order to be able to automatically harvest and process web pages, you're going to need to be able to open URLs with your own programs. The Python language includes a number of standard ways to do this.
As an example, let's work with the kind of file that you might encounter while doing historical research. Say you're interested in Adam Dollard Des Ormeaux (1635-60), a controversial figure in Canadian historiography. With Google, it's easy to locate his biographical entry in the online Dictionary of Canadian Biography.
IMPORTANT NOTE: The DCB website was updated at the end of June 2008, and can no longer be used for the example code we have here. As a temporary solution, we've changed our code to link to a few files on the NiCHE server that have the same formatting as the DCB site used to have. When we get a chance, we'll rewrite the sections so they are compatible with the new online DCB. In the meantime, please e-mail us if you find something that doesn't work!
Adam Dollard Des Ormeaux's biography in DCB
The URL for the main entry is (i.e., used to be... just play along)
http://www.biographi.ca/EN/ShowBio.asp?BioId=34298
By studying the URL we can learn a few things. First, the DCB website uses Microsoft Active Server Pages (ASP) and it's possible to retrieve individual biographies by making use of the query string. Each is apparently given a unique 5-digit ID number. From the presence of EN in the path (presumably standing for the English language entries), we can infer there might be a corresponding French-language entry at
http://www.biographi.ca/FR/ShowBio.asp?BioId=34298
In fact, this inference is correct. Looking at the webpage, we also notice that there is a printable version of the page.
Printable biography of Adam Dollard Des Ormeaux in DCB
Its URL is
http://www.biographi.ca/EN/ShowBioPrintable.asp?BioId=34298
When you are processing web resources automatically, it is often a good idea to work with printable versions, as they tend to have less formatting.
Now let's try opening the printable version of the page. Copy the following program into Komodo Edit and save it as open-html.py. When you execute it, it will open the biography file, read its contents into a Python string called html and then print the first three hundred characters of the string to the "Command Output" pane. Use the View->Page Source command in Firefox to verify that the HTML source of the page is the same as the source that your program retrieved. (See the Python library reference to learn more about urllib2.)
# open-html.py import urllib2 # note that we have to grab a copy of the old page since the new DCB website doesn't work the way it used to url = 'http://niche-canada.org/files/dcb/dcb-34298.html' response = urllib2.urlopen(url) html = response.read() print html[0:300]
Saving a local copy of a web page
Given what you already know about writing to files, it is quite easy to modify the above program so that it writes the contents of the html string to a local file rather than the "Command Output" pane. Copy the following program into Komodo Edit, save it as save-html.py and execute it. Using the File->Open File command in Firefox, open the local file that it creates (dcb-34298.html) to confirm that your saved copy is the same as the online copy.
# save-html.py import urllib2 # note that we have to grab a copy of the old page since the new DCB website doesn't work the way it used to url = 'http://niche-canada.org/files/dcb/dcb-34298.html' response = urllib2.urlopen(url) html = response.read() f = open('dcb-34298.html', 'w') f.write(html) f.close
So, if you can save a single file this easily, could you write a program to download a bunch of files? Could you step through biography ID numbers, for example, and make your own copies of a whole bunch of them? Yep. We'll get there soon.
Suggested Readings
Lutz, Learning Python
Ch. 4: Introducing Python Object Types
