Programming Historian Home

Harvesting Links and Downloading Pages

The idea of text mining

The programs that we've written so far take a single web page or text as their input and do some processing that would be time-consuming to do by hand. Since historians are used to reading a lot, this may have seemed like a questionable exercise... wouldn't it be faster to read Dollard's biography, for example, than spend time writing programs to manipulate it? Probably. The real payoff doesn't come until you have enough text that it would take a (very) long time to read or skim through it. In this section we will start to make the transition to working with collections rather than individual texts. When you have a group of texts, you can do a number of different kinds of analysis which fall under the general heading of text mining. These include

We will work through a number of text mining projects. To get started, we need to create a potentially interesting but manageable collection to work with.

Selecting a group of biographies

IMPORTANT: As of June 2008, the online DCB site no longer looks or works the way that it used to. The code in this section won't work, although the basic technique is still valid. As a temporary solution, we've put some files with the old formatting on the NiCHE server. Using techniques you learned earlier, save a copy of the following files to a directory called iroquois on your own machine. At that point, you should read through this section to get what you can out of it before moving on to the next one.

NiCHE copy of DCB iroquois files

In our work with Dollard's biography, we repeatedly came across references to Iroquois people. Let's assume that we want to use our programming ability to learn more about them, at least as portrayed in the Dictionary of Canadian Biography. Start by going to the Advanced Search page. For the "Date Range of Death," choose 1000-1700 (Volume I). When you click on this link, some JavaScript code reloads the page with the parameter set. If you look on the right hand side of the page, you will see that there are 592 biographies in Volume I. Type iroquois into the search box and press Go. This will return 167 biographies on 11 separate pages. It is going to be easier for us to automatically download these entries if we can get them all onto the same page. Fortunately, there is a "Show all results" link. Click it. If all went well, you should have a web page of biographies from volume 1 of the DCB that contain the word 'iroquois'. It should look something like the following.

Links to bios containing "iroquois" in Volume 1 of the Dictionary of Canadian Biography
Links to bios containing "iroquois" in Volume 1 of the Dictionary of Canadian Biography

In Firefox, choose File->Save Page As and save this page as dcb-v01-iroquois.html using the "Web Page, HTML only" option. Now you should be able to use File->Open File in Firefox to make sure that you got a copy of the page. It should look like this:

Local copy of the DCB v.1 biographies containing "iroquois"
Local copy of the DCB v.1 biographies containing "iroquois"

Note that your local copy of the page is missing the images that the online version used for formatting. If you had saved a complete copy of the page, rather than HTML only, all of those images would have been downloaded to a directory on your machine. You don't need them, however.

Extracting hyperlinks with Beautiful Soup

Our next task is to extract all of the hyperlinks from the saved copy of the web page, so we can then write a routine to download each biography automatically. Recall that HTML has a hierarchical structure. The problem of taking apart a structured representation in an orderly way is known as parsing. When the rules by which the structure was created are well-known and inflexible, parsing is easier. When there are a lot of exceptions or errors, parsing becomes more difficult and programmers sometimes turn to scraping instead. Rather than taking the structure apart, scraping relies on regular expression pattern matching to pull meaningful strings out of an undifferentiated mass. We're going to do a little of both.

Later you'll learn more about how to create your own parsers. For right now, we're going to use a Python library called Beautiful Soup. If you haven't already installed it, download this package to your machine and save it in the C:\Python25\Lib directory.

To parse out all of the HTML a tags, we first load our local copy of the web page into a string:

# load search results from saved file into string
searchresultfile = 'dcb-v01-iroquois.html'
f = open(searchresultfile, 'r')
searchresulthtml = f.read()
f.close()

We then call Beautiful Soup to extract all of the tags into a list. This version of the Python import statement allows us to load the part of the library that we need to parse HTML, without loading the part that parses XML.

from BeautifulSoup import BeautifulSoup
 
# parse search results file to extract hyperlinks
searchresultsoup = BeautifulSoup(searchresulthtml)
linklist = searchresultsoup.findAll('a')
 
for link in linklist: print link

If you were to copy this code to Komodo and execute it, you would find that there are a number of links in the page that aren't of interest to us, in addition to ones that are:

<a name="TOP"></a>
<a href="../FR/index.html">
<img name="topnav_e_r1_c1" src="images/nav/topnav_e_r1_c1.jpg"
width="89" height="14" border="0" alt="Français" /></a>
<a href="mailto:WebServices@lac-bac.gc.ca?Subject=www.biographi.ca">
<img name="topnav_e_r1_c3" src="images/nav/topnav_e_r1_c3.jpg"
width="89" height="14" border="0" alt="Contact Us" /></a>
...
<a href="ShowBio.asp?BioId=34298&query=iroquois">DOLLARD DES ORMEAUX, ADAM</a>
<a href="ShowBio.asp?BioId=34146&query=iroquois">ANNAOTAHA, Étienne</a>
<a href="ShowBio.asp?BioId=34590&query=iroquois">PIESKARET, Simon</a>
...

The links that we want to process all have the form

<a href="ShowBio.asp?BioId=   BIOID   &query=iroquois">   BIONAME   </a>

where BIOID is a five digit number and BIONAME is the person whose biography it is.

Scraping with regular expressions

We now want to go through each link in linklist and use a regular expression to see if it matches the form shown above. Let's start by trying to match a five digit number. Open a Python shell so you can try the following expressions. Note that \d matches a single digit. Adding a number in curly braces after a pattern matches that many copies of it. So \d{5} matches a string of five digits in a row. The search method finds a match if one exists.

import re
digitpattern = re.compile(r'\d{5}')
print digitpattern.search('abc')

-> None

 

print digitpattern.search('123')

-> None

 

print digitpattern.search('123456')

-> <_sre.SRE_Match object at 0x0632F5D0>

Note the weird return value when it does find a match. What we really want to return is the match itself. For that, we use the group(0) method. This will make more sense in a minute.

print digitpattern.search('123456').group(0)

-> 12345

Remember that regular expressions find matching patterns in a larger string. When you want to return the part that matches, but not any of the extraneous material, you put parentheses around the matching part. These matching parts are known as groups. Study the following examples, keeping in mind that .* stands for zero or more copies of any single character. What would each of the two expressions return if teststring were 'junkjunkjunk'? Try this in the shell to make sure you understand what is going on.

digitpattern2 = re.compile(r'.*(\d{5}).*')
teststring = 'junk12345junk'
print digitpattern2.search(teststring).group(0)

-> junk12345junk

 

print digitpattern2.search(teststring).group(1)

-> 12345

Grouping a regular expression with parentheses like this allows us to indicate which part of a matching string is important to us. To match the whole thing, we use group(0). The first part that is in parentheses is group(1), the second part group(2), and so on. Since we can have multiple groups, we are able to match both the BIOID and BIONAME. (Note that we have to escape the quotation marks within our test strings by preceding them with backslashes).

linkpattern = re.compile(r'(\d{5}).*>(.*)<', re.UNICODE)
print linkpattern.search('<a name=\"TOP\"></a>')

-> None

 

testurl = '<a href=\"ShowBio.asp?BioId=34590&query=iroquois\">PIESKARET, Simon</a>'
print linkpattern.search(testurl).group(1)

-< 34590

 

print linkpattern.search(testurl).group(2)

-> PIESKARET, Simon

If you'd like to learn more about Python regular expressions, A. M. Kuchling has written a good tutorial.

Working with accented characters

If you only work with English-language sources, you usually don't have to deal with accented characters. People who work with sources in languages that use non-Latin alphabets or non-alphabetic writing systems will have to know more about how to represent these characters. Our sources include some French characters, so we need to make sure to represent them in a uniform way. You can generalize our routine to include other characters from the latin-1 or utf-8 character sets as necessary. If you will need to do this on a regular basis, you should spend some time now getting more familiar with Unicode and read the section on Unicode strings in the Python tutorial. There is also a very useful reference on Computing with Accents, Symbols and Foreign Scripts from Penn State.

Our task is complicated by the fact that both HTML and Unicode provide different ways to represent accented characters, and our source mixes and matches the two. The following routine converts a string to lowercase and then maps each accented character from the French language to its lowercase Unicode equivalent. Copy it to the dh.py module.

# Given a string containing French accented characters
# in Unicode or HTML, return normalized lowercase.
 
def normalizeFrenchAccents(str):
newstr = unicode(str, 'utf-8').encode('latin-1', 'replace')
newstr = newstr.lower()
newstr = newstr.replace('&rsquo;', '\'')
newstr = newstr.replace('\xc0', '\xe0') # a grave
newstr = newstr.replace('&agrave;', '\xe0') # a grave
newstr = newstr.replace('\xc2', '\xe2') # a circumflex
newstr = newstr.replace('&acirc;', '\xe2') # a circumflex
newstr = newstr.replace('\xc4', '\xe4') # a diaeresis
newstr = newstr.replace('&auml;', '\xe4') # a diaeresis
newstr = newstr.replace('\xc6', '\xe6') # ae ligature
newstr = newstr.replace('&aelig;', '\xe6') # ae ligature
newstr = newstr.replace('\xc8', '\xe8') # e grave
newstr = newstr.replace('&egrave;', '\xe8') # e grave
newstr = newstr.replace('\xc9', '\xe9') # e acute
newstr = newstr.replace('&eacute;', '\xe9') # e acute
newstr = newstr.replace('\xca', '\xea') # e circumflex
newstr = newstr.replace('&ecirc;', '\xea') # e circumflex
newstr = newstr.replace('\xcb', '\xeb') # e diaeresis
newstr = newstr.replace('&euml;', '\xeb') # e diaeresis
newstr = newstr.replace('\xce', '\xee') # i circumflex
newstr = newstr.replace('&icirc;', '\xee') # i circumflex
newstr = newstr.replace('\xcf', '\xef') # i diaeresis
newstr = newstr.replace('&iuml;', '\xef') # i diaeresis
newstr = newstr.replace('\xd4', '\xf4') # o circumflex
newstr = newstr.replace('&ocirc;', '\xf4') # o circumflex
newstr = newstr.replace('&oelig;', 'oe') # oe ligature
newstr = newstr.replace('\xd9', '\xf9') # u grave
newstr = newstr.replace('&ugrave;', '\xf9') # u grave
newstr = newstr.replace('\xdb', '\xfb') # u circumflex
newstr = newstr.replace('&ucirc;', '\xfb') # u circumflex
newstr = newstr.replace('\xdc', '\xfc') # u diaeresis
newstr = newstr.replace('&uuml;', '\xfc') # u diaeresis
newstr = newstr.replace('\xc7', '\xe7') # c cedilla
newstr = newstr.replace('&ccedil;', '\xe7') # c cedilla
newstr = newstr.replace('&yuml;', '\xff') # y diaeresis
return newstr

Some helper functions

Given what we know, we can write some code to extract BIOID-BIONAME pairs to a dictionary.

# extract dictionary of bioid-name pairs
linkpattern = re.compile(r'(\d{5}).*\>(.*)\<', re.UNICODE)
biodict = {}
for i in linklist:
matchinglink = linkpattern.search(str(i))
if matchinglink:
bioid = matchinglink.group(1)
bioname = matchinglink.group(2)
biodict[bioid] = dh.normalizeFrenchAccents(bioname)

We are also going to want to be able to do a few things with our local file system. We're going to need a separate directory to store our downloaded files in, and if it doesn't exist, we're going to have to create it. The file system is part of the operating system on your computer. Python includes an os library to access it.

import os
 
# make directory to store downloaded pages if one doesn't exist
if os.path.exists('iroquois') == 0: os.mkdir('iroquois')
 
Before downloading a file, we will also want to be able to see if it already exists...
 
outfile = 'iroquois/dcb-' + str(b) + '.html'
if os.path.isfile(outfile) == 0:
# outfile doesn't already exist

We will want to introduce a time delay into our program, so that it waits for a while before trying to download the next file. Since programs can request pages from a web server much faster than human users, it is considered polite to separate automatic requests with a time delay. Python includes a time library which gives access to a number of different timing functions.

import time
 
# pause for two seconds
time.sleep(2)

Finally, we will be keeping track of our program's progress by sending messages to the "Command Output" pane of Komodo. Instead of sending one character at a time to the output, the Python interpreter uses a buffering strategy. It puts the characters in a temporary holding space until the space is full, then sends them all to the output at once. This is more efficient, but it means that you don't have immediate feedback about what your program is up to. Fortunately, the Python sys module gives us the ability to flush a buffer whenever we want.

import sys
 
# send feedback to the "Command Output" pane immediately
print "File already downloaded"
sys.stdout.flush()

Putting it all together

Our program will perform the following tasks

  1. Load a number of Python modules
  2. Load the search results from a local saved file into a string
  3. Parse the search results to extract all HTML a tags
  4. Scrape the BIOID-BIONAME pairs from the a tags and put them in a dictionary
  5. Make a directory to store the local copies of the files if one doesn't already exist
  6. For each BIOID, check to see if the file already exists
    1. If not, download it and save a local copy then wait two seconds
    2. Otherwise print a message that the file has already been downloaded
    3. Flush the "Command Output" buffer so the user gets immediate feedback
  7. Create a page of links to local copies of the biographies and open in Firefox

Here is the code that accomplishes these tasks.

# get-iroquois-bios.py
 
import dh
import re, os, sys, time, urllib2
from BeautifulSoup import BeautifulSoup
 
# load search results from saved file into string
searchresultfile = 'dcb-v01-iroquois.html'
f = open(searchresultfile, 'r')
searchresulthtml = f.read()
f.close()
 
# parse search results file to extract hyperlinks
searchresultsoup = BeautifulSoup(searchresulthtml)
linklist = searchresultsoup.findAll('a')
 
# extract dictionary of bioid-name pairs
linkpattern = re.compile(r'(\d{5}).*\>(.*)\<', re.UNICODE)
biodict = {}
for i in linklist:
matchinglink = linkpattern.search(str(i))
if matchinglink:
bioid = matchinglink.group(1)
bioname = matchinglink.group(2)
biodict[bioid] = dh.normalizeFrenchAccents(bioname)
 
# make directory to store downloaded pages if one doesn't exist
if os.path.exists('iroquois') == 0: os.mkdir('iroquois')
 
# download a local copy of each bio
urlprefix = 'http://www.biographi.ca/EN/ShowBioPrintable.asp?BioId='
for b in biodict:
print "Processing bioid: " + str(b)
url = urlprefix + str(b)
outfile = 'iroquois/dcb-' + str(b) + '.html'
if os.path.isfile(outfile) == 0:
response = urllib2.urlopen(url)
html = response.read()
f = open(outfile, 'w')
f.write(html)
f.close
time.sleep(2)
else:
print "File already downloaded"
sys.stdout.flush()
 
# create a page of links to local copies
outstring = ''
for b in biodict:
outfile = 'dcb-' + str(b) + '.html'
outstring += dh.undecoratedHyperlink('iroquois/'+outfile, str(b))
outstring += '&nbsp;' * 4
outstring += biodict[b]
outstring += "<br />"
dh.wrapStringInHTML("get-iroquois-bios", searchresultfile, outstring)

Copy the code to Komodo, save it as get-iroquois-bios.py and execute it. Sometimes the server will choke for some reason and your program will halt with an error message. You can simply re-run it. It will skip over the biographies that it has already downloaded and carry on where it stopped. If a downloaded file is garbled for some reason, you can delete the file and rerun this program. We've designed it to take into account the fact that things sometimes go wrong when you're automatically harvesting online data.

If all goes well, the program should download the 167 biographies from volume 1 of the DCB to a directory called iroquois on your local disk. It will also open a page in Firefox with links to each of the downloaded files. You're now ready to learn how to index a collection of documents.