William J. Turkel and Alan MacEachern, The Programming Historian, 1st ed. NiCHE: Network in Canadian History & Environment (2007-08).
Putting new information where you can use it
At this point, you've started to learn how to use Python to download online sources and extract information from them automatically. Remember that your ultimate goal is to incorporate programming seamlessly into your historical practice. Since you are already using Firefox and Zotero to find and keep track of your sources, it also makes sense to use these programs to keep track of any new information that you create. The easiest way to do this is to have your Python programs output local web pages that you can read in Firefox and index and annotate with Zotero. We turn to that now, starting with a discussion of some more of the things that you can do with Python strings.
Python string formatting
Python includes a special formatting operator that allows you to interpolate one string in another one. It is represented by a percent sign. Open a Python shell and try the following examples.
frame = 'This is a %s' print frame
-> This is a %s
print frame % 'cat'
-> This is a cat
print frame % 'dog'
-> This is a dog
There is also a form which allows you to interpolate a list of strings into another one.
frame2 = 'These are %s and %s' print frame2
-> These are %s and %s
print frame2 % ('cats', 'dogs')
-> These are cats and dogs
In these examples, a %s in one string indicates that another string is going to be embedded at that point. There are a range of other string formatting codes, most of which allow you to embed numbers in strings in various formats, like %i for integer, %f for floating-point decimal, and so on. We will introduce these later as necessary.
Creating HTML output
One of the more powerful ideas in computer science is that something that is code from one perspective can be seen as data from another. It's possible, in other words, to write programs that manipulate other programs. The Python interpreter is one example. What we're going to do next is combine Python files, multiline block strings and simple HTML tags to create a Python program which outputs an HTML file. Note that we are writing to a file with an .html extension rather than a .txt extension.
# write-html.py f = open('helloworld.html','w') message = """<html> <head></head> <body>Hello World!</body> </html>""" f.write(message) f.close()
Save this program as write-html.py and execute it. Use File->Open->File in Komodo Edit to open helloworld.html to verify that your program actually created the file. It should look something like this:
"Hello World" HTML source generated by Python program
Now go to your Firefox browser and choose File->New Tab, go to the tab, and choose File->Open File. Select helloworld.html. You should now be able to see your message in the browser.
Sending HTML output to Firefox
We automatically created an HTML file, but then we had to leave Komodo Edit and go to Firefox to open the file in a new tab. Wouldn't it be cool to have our Python program include that final step? Enter the code below into Komodo Edit and save it as write-html-2.py. When you execute it, it should create your HTML file and then automatically open it in a new tab in Firefox. Sweet!
# write-html-2.py import webbrowser f = open('helloworld.html','w') message = """<html> <head></head> <body>Hello World!</body> </html>""" f.write(message) f.close() webbrowser.open_new_tab('helloworld.html')
N.B. Some people couldn't get this to work on Mac OS X (because of the different ways that the various operating systems handle web browsers). If you are getting a bunch of "MacOS.Error -673" messages, you have to comment out two lines in the function wrapStringInHTML defined below.
Self-documenting data files
The distinction between data and metadata is crucial to information science. Metadata are data about data. This concept should already be very familiar to you, even if you haven't heard the term before. Consider a traditional book. If we take the text of the book to be the data, there are a number of other characteristics which are associated with that text, but which may or may not be explicitly printed in the book. The title of the work, the author, the publisher, and the place and date of publication are metadata that are typically printed in the work. The place and date of writing, the name of the copy editor, Library of Congress cataloging data, and the name of the font used to typeset the book are sometimes printed in it. The person who purchased a particular copy may or may not write their name in the book. If the book belongs in the collection of a library, that library will keep additional metadata, only some of which will be physically attached to the book. The record of borrowing, for example, is usually kept in some kind of database and linked to the book by a unique identifier. Libraries, archives and museums all have elaborate systems in-place to generate and keep track of metadata.
When you're working with digital data, it is a good idea to incorporate metadata into your files whenever possible. In later sections, we will work with the Extensible Markup Language (XML), which is ideal for this purpose. For now, however, we need to develop a few basic strategies for making our data files self-documenting.
Python comments
You've already seen one example of this. In Python, any line that begins with a hash mark is known as a comment and is ignored by the Python interpreter. Comments are intended to allow programmers to communicate with one another. In a larger sense, programs themselves are typically written and formatted in a way that makes it easier for programmers to communicate with one another. Code that is closer to the requirements of the machine is referred to as low-level; code that is closer to natural language is high-level. One of the benefits of using a language like Python is that it is very high level, making it easier for us to communicate with you (at some cost in terms of computational efficiency).
Building an HTML wrapper
You've just learned how to embed a message like "Hello World!" in HTML tags, write the result to a file and open it automatically in the browser. A program that puts formatting codes around something so that it can be used by another program is called a wrapper. What we're going to do now is develop an HTML wrapper for the output of our code that computes word frequencies.
Let's bundle some of the code that we've already written into functions. One of these will take a URL and return a string of lowercase text from the web page. Copy this into the dh.py module.
# Given a URL, return string of lowercase text from page. def webPageToText(url): import urllib2 response = urllib2.urlopen(url) html = response.read() text = stripTags(html).replace(' ', ' ') return text.lower()
We're also going to want a function that takes a string of any sort and makes it the body of an HTML file which is opened automatically in Firefox. This function should include some basic metadata, like the time and date that it was created and the name of the program that created it. Study the following code carefully, then copy it into the dh.py module.
N.B. If you are using Mac OS X and you were unable to run the program write-html-2.py above, then you have to comment out two lines in the following program by putting a hash mark in front of each one:
...
# from webbrowser import open_new_tab
...
# open_new_tab(filename)
...
Once you've made the changes, you can copy the code into the dh.py module. Please e-mail us if this fix doesn't work for you.
# Given name of calling program, a url and a string to wrap, # output string in HTML body with basic metadata # and open in Firefox tab. def wrapStringInHTML(program, url, body): import datetime from webbrowser import open_new_tab now = datetime.datetime.today().strftime("%Y%m%d-%H%M%S") filename = program + '.html' f = open(filename,'w') wrapper = """<html> <head> <title>%s output - %s</title> </head> <body><p>URL: <a href=\"%s\">%s</a></p><p>%s</p></body> </html>""" whole = wrapper % (program, now, url, url, body) f.write(whole) f.close() open_new_tab(filename)
Note that this function makes use of the string formatting operator that you learned about. It also calls the Python datetime library to determine the current time and date. This metadata, along with the name of the program that called the function, is stored in the HTML title tag. The HTML file that is created has the same name as the Python program that creates it, but with an .html extension rather than a .py one.
Putting it all together
Now we can create another version of our program to compute frequencies. Instead of sending its output to the "Command Output" pane in Komodo, it sends it to an HTML file which is opened in a new Firefox tab. From there, the program's output can be added easily to Zotero. Copy the following code to Komodo Edit, save it as html-to-freq-3.py and execute it, to confirm that it works as expected.
# html-to-freq-3.py
import dh
# create sorted dictionary of word-frequency pairs
url = 'http://niche-canada.org/files/dcb/dcb-34298.html'
text = dh.webPageToText(url)
fullwordlist = dh.stripNonAlphaNum(text)
wordlist = dh.removeStopwords(fullwordlist, dh.stopwords)
dictionary = dh.wordListToFreqDict(wordlist)
sorteddict = dh.sortFreqDict(dictionary)
# compile dictionary into string and wrap with HTML
outstring = ""
for s in sorteddict:
outstring += str(s)
outstring += "<br />"
dh.wrapStringInHTML("html-to-freq-3", url, outstring)Note that we interspersed our word-frequency pairs with the HTML break tag, which acts as a newline. If all went well, you should see the same word frequencies that you computed in the last section.
Using word frequencies to refine a Google search
Let's go through one more cycle of refinement. Start by doing a Google search for "dollard" and counting the number of hits on the first five pages that actually refer to Adam Dollard Des Ormeaux. When we tried this on 7 Jan 2008, we ended up with three out of fifty, or six percent.
Now try doing a Google search for "dollard iroquois long sault enemy" and counting the number of hits that refer to Adam Dollard Des Ormeaux. When we tried this on 7 Jan 2008, we ended up with fifty out of fifty, or one hundred percent. So by using the words that are most characteristic of this text, we can easily find others like it. Wouldn't it be great to do this automatically?
Look at the URL of the Google search that you just did. It should begin with something like
http://www.google.com/search?q=dollard+iroquois+long+sault+enemy
Suppose we choose some small number n. If we take the top n keywords from our word frequency list, we can construct a query like this automatically and build it into a link that we display on our wrapped results page.
The basic form of a hyperlink in HTML is
We want to build up the URL for a Google search automatically, then embed it in an HTML a tag like the one above. Study the following function then add it to the dh.py module.
# Given a list of keywords and a link name, return an # HTML link to a Google search for those terms. def keywordListToGoogleSearchLink(keywords, linkname): gsearch = '<a style=\"text-decoration:none\" ' gsearch += 'href=\"http://www.google.com/search?q=' gsearch += '+'.join(keywords) gsearch += '\">' gsearch += linkname gsearch += '</a>' return gsearch
Note that we've added a bit of inline CSS to the HTML a tag to prevent the browser from underlining hyperlinks automatically. We'll learn more about CSS (Cascading Style Sheets) later; for now this will make the output of the next couple of programs that we write more legible.
(There is one thing about this code that is somewhat counterintuitive. In order to create a string from a list, you call a string method join on a string consisting of the delimiter that you want to use between list elements. The delimiter is a plus sign in our case, since we're building the query string of a URL. Many people expect join to be a list method, but it isn't.)
You can test the keywordListToGoogleSearchLink function in a Python shell if you'd like. Copy the function definition, paste it into the shell and press Enter. Then you can do something like the following:
testwords = ('this', 'is', 'a', 'test') print keywordListToGoogleSearchLink(testwords, "Do Google Search")
-> Do Google Search
Now we can revise our code to include this automatically-constructed Google search link. Copy the following to Komodo Edit, save it as html-to-freq-4.py and execute it.
# html-to-freq-4.py import dh # create sorted dictionary of word-frequency pairs url = 'http://niche-canada.org/files/dcb/dcb-34298.html' text = dh.webPageToText(url) fullwordlist = dh.stripNonAlphaNum(text) wordlist = dh.removeStopwords(fullwordlist, dh.stopwords) dictionary = dh.wordListToFreqDict(wordlist) sorteddict = dh.sortFreqDict(dictionary) # create Google search link keywords = [] for k in sorteddict[0:5]: keywords.append(str(k[1])) gsearch = dh.keywordListToGoogleSearchLink(keywords, 'Google Search n=5') # compile dictionary into string and wrap with HTML outstring = gsearch + "<br /><br />" for s in sorteddict: outstring += str(s) outstring += "<br />" dh.wrapStringInHTML("html-to-freq-4", url, outstring)
When you try this program, you will see that there is now a hyperlink in the output that you can follow to submit the refined search to Google automatically. As an exercise, try modifying your script to process the biography of the explorer Pierre-Esprit Radisson (1640-1710). You can see that the ability to automatically generate refined searches can make it much easier to find things that are relevant to your work.
Suggested Readings
Lutz, Learning Python
Re-read and review Chs. 1-17
