William J. Turkel and Alan MacEachern, The Programming Historian, 1st ed. NiCHE: Network in Canadian History & Environment (2007-08).
Useful measures of a text
In a previous section, you wrote a Python program called html-to-list-1.py which downloaded a web page, stripped out the HTML formatting and metadata and returned a list of 'words', like the one shown below.
['Dictionary', 'of', 'Canadian', 'Biography', 'DOLLARD', 'DES', 'ORMEAUX', '(called', 'Daulat', 'in', 'his', 'death', 'certificate', 'and', 'Daulac', 'by', 'some', 'historians),', 'ADAM,', 'soldier,', '\x93garrison', 'commander', 'of', 'the', 'fort', 'of', 'Ville-Marie', '[Montreal]\x94;', 'b.', '1635,', 'killed', 'by', 'the', 'Iroquois', 'at', 'the', 'Long', 'Sault', 'in', 'May 1660.', '\xa0\xa0\xa0\xa0\xa0', 'Nothing', 'is', 'known', 'of', 'Dollard\x92s', 'activities', 'prior', 'to', 'his', 'arrival', 'in', 'Canada', 'except', 'that', '\x93he', 'had', 'held', 'some', 'commands', 'in', 'the', 'armies', 'of', 'France.\x94', 'Having', 'come', 'to', 'Montreal', 'as', 'a', 'volunteer,', 'very', 'probably', 'in', '1658,', 'he', 'continued', 'his', 'military', 'career', 'there.', 'In', '1659', 'and', '1660', 'he', 'was', 'described', 'as', 'an', '\x93officer\x94', 'or', '\x93garrison', 'commander', 'of', 'the', 'fort', 'of', 'Ville-Marie,\x94', 'a', 'title', 'that', 'he', 'shared', 'with', 'Pierre', 'Picot\xe9', 'de', 'Belestre.', 'We', 'do', 'not', 'however', 'know', 'what', 'his', 'particular', 'responsibility', 'was.']
By itself, this ability doesn't buy us much because we already know how to read. We can use the text, however, to do things that aren't usually possible without special software. We're going to start by computing the frequencies of words and other linguistic units, a classic measure of a text.
Cleaning up the list
It is clear that our list is going to need some cleaning up before we can use it to count frequencies. For one thing, we won't want the frequencies of words to depend on capitalization: "Dollard" and "DOLLARD" should count as the same word. Typically words are folded to lowercase when counting frequencies, so we'll do that using the string method lower.
print('Hello WORLD'.lower())
-> hello world
There are assorted punctuation marks that will throw off the frequency counts if they are left in. We want "soldier," to be counted as "soldier" and "[Montreal]" as "Montreal", of course. Looking through the output we also find " " which is an HTML ampersand character code for a non-breaking space. Using another string method, we can replace that code with a blank space, as in the following.
print('hello world')
-> hello world
print('hello world'.replace(' ',' '))
-> hello world
There are also a number of accented French characters which are represented with Unicode strings like "\xe9" (which stands for "é"). We'll learn more about working with Unicode characters later; for now we'll leave them as they are.
At this point, we might look through a number of other DCB entries and a wide range of other potential sources to make sure that there aren't other special characters that are going to cause problems later. We might also try to anticipate situations where we don't want to get rid of punctuation (e.g., distinguishing dollar amounts like "$1629" from dates, or recognizing that "1629-40" has a different meaning than "1629 40".) This is what professional programmers get paid to do: try to think of everything that might go wrong and deal with it in advance.
We're going to take a different approach. Our main goal is to develop techniques that a working historian can use during the research process. This means that we will almost always prefer approximately correct solutions that can be developed quickly. So rather than taking the time now to make our program robust in the face of exceptions, we're simply going to get rid of anything that isn't an accented or unaccented letter or an Arabic numeral. Programming is typically a process of stepwise refinement. You start with a problem and part of a solution, and then you keep refining your solution until you have something that works better.
Our first use of regular expressions
In order to eliminate special characters, we're going to make use of a very powerful mechanism called regular expressions. Regular expressions are provided by many programming languages in a range of different forms. To do what we want to do right now, we have to import the Python regular expression library and compile a pattern that matches anything that isn't an alphanumeric character. Copy the following function and paste it into the dh.py module.
# Given a text string, remove all non-alphanumeric # characters (using Unicode definition of alphanumeric). def stripNonAlphaNum(text): import re return re.compile(r'\W+', re.UNICODE).split(text)
The regular expression in the above code is the material inside the string, in other words \W+. The \W is shorthand for the class of non-alphanumeric characters. In a Python regular expression, the plus sign matches one or more copies of a given character. The re.UNICODE tells the interpreter that we want to include characters from the world's other languages in our definition of 'alphanumeric', as well as the A to Z, a to z and 0 to 9 of English. Regular expressions have to be compiled before they can be used, which is what the rest of the statement does. Don't worry about understanding the compilation part right now.
When we refine our html-to-list program, it now looks like this:
# html-to-list-2.py import urllib2 import dh url = 'http://niche.uwo.ca/programming-historian/dcb/dcb-34298.html' response = urllib2.urlopen(url) html = response.read() text = dh.stripTags(html).replace(' ', ' ') wordlist = dh.stripNonAlphaNum(text.lower()) print wordlist[0:500]
When you execute the program and look through its output in the "Command Output" pane, you'll see that it has done a pretty good job. As expected, it has left accented characters as codes, so words like "Picoté" appear as "picot\xe9". It has split hyphenated forms like "Ville-Marie" into two words and turned the possessive "s" into a separate word by losing the apostrophe. But it is a good enough approximation to what we want that we should move on to counting frequencies before attempting to make it better. (If you work with sources in more than one language, you need to learn more about the Unicode standard and about Python support for Unicode.)
Python dictionaries
Both strings and lists are sequentially ordered, which means that you can access their contents by using an index, a number that starts at 0. If you have a list containing strings, you can use a pair of indexes to first access a particular string, and then a particular character within that string. Study the examples below.
s = 'hello world' print s[0]
-> h
print s[1]
-> e
m = ['hello', 'world'] print m[0]
-> hello
print m[1]
-> world
print m[0][1]
-> e
print m[1][0]
-> w
To keep track of frequencies, we're going to need another type of Python object, a dictionary. The dictionary is an unordered collection of objects. That means that you can't use an index to retrieve elements from it. You can, however, look them up by using a key (hence the name 'dictionary'). Study the following example.
d = {'world': 1, 'hello': 0} print d['hello']
-> 0
print d['world']
-> 1
print d.keys()
-> ['world', 'hello']
Note that you use curly braces to define a dictionary, but square brackets to access things within it. The keys operation returns a list of keys that are defined in the dictionary.
Counting word frequencies
Now we want to count the frequency of each word in our list. You've already seen that it is easy to process a list by using a for loop. Try saving and executing the following example.
# count-list-items-1.py wordstring = 'it was the best of times it was the worst of times ' wordstring += 'it was the age of wisdom it was the age of foolishness' wordlist = wordstring.split() wordfreq = [] for word in wordlist: wordfreq.append(wordlist.count(word)) print "String\n" + wordstring +"\n" print "List\n" + str(wordlist) + "\n" print "Frequencies\n" + str(wordfreq) + "\n" print "Pairs\n" + str(zip(wordlist, wordfreq))
You should get something like this:
String
it was the best of times it was the worst of times
it was the age of wisdom it was the age of foolishness
List
['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was',
'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age',
'of', 'wisdom', 'it', 'was', 'the', 'age', 'of',
'foolishness']
Frequencies
[4, 4, 4, 1, 4, 2, 4, 4, 4, 1, 4, 2, 4, 4, 4, 2, 4, 1, 4,
4, 4, 2, 4, 1]
Pairs
[('it', 4), ('was', 4), ('the', 4), ('best', 1), ('of', 4),
('times', 2), ('it', 4), ('was', 4), ('the', 4),
('worst', 1), ('of', 4), ('times', 2), ('it', 4),
('was', 4), ('the', 4), ('age', 2), ('of', 4),
('wisdom', 1), ('it', 4), ('was', 4), ('the', 4),
('age', 2), ('of', 4), ('foolishness', 1)]
In the preceding program, we start with a string and split it into a list, as we've done before. We then go through each word in the list, and count the number of times that word appears in the whole list, and add the count to another list of word frequencies. Using the zip operation, we are able to match the first word of the word list with the first number in the frequency list, the second word and second frequency, and so on. We end up with a list of word and frequency pairs. The str statement converts any object to a string so that it can be printed.
Python also includes a very convenient tool called a list comprehension, which can be used to do the same thing as the for loop more economically.
# count-list-items-2.py wordstring = 'it was the best of times it was the worst of times ' wordstring += 'it was the age of wisdom it was the age of foolishness' wordlist = wordstring.split() wordfreq = [wordlist.count(w) for w in wordlist] print "String\n" + wordstring +"\n" print "List\n" + str(wordlist) + "\n" print "Frequencies\n" + str(wordfreq) + "\n" print "Pairs\n" + str(zip(wordlist, wordfreq))
At this point we have a list of pairs, where each pair contains a word and its frequency. Note that this list is redundant. If "the" occurs 500 times, then this list contains five hundred copies of the pair ('the', 500). This list is also ordered by the words in the original text. We can solve both problems by converting it into a dictionary. Then all we have to do is print out the dictionary in order from the most to the least commonly occurring item.
From HTML to a dictionary of word-frequency pairs
6 Apr 2008. Mac tested to here.
Building on what we have so far, we want a function that can convert a list of words into a dictionary of word-frequency pairs. The only new command that we will need is dict, which makes a dictionary from a list of pairs. Copy the following and add it to the dh.py module.
# Given a list of words, return a dictionary of # word-frequency pairs. def wordListToFreqDict(wordlist): wordfreq = [wordlist.count(p) for p in wordlist] return dict(zip(wordlist,wordfreq))
We are also going to want a function that can sort a dictionary of word-frequency pairs by descending frequency. Copy this and add it to the dh.py module, too.
# Sort a dictionary of word-frequency pairs in # order of descending frequency. def sortFreqDict(freqdict): aux = [(freqdict[key], key) for key in freqdict] aux.sort() aux.reverse() return aux
We can now write a program which takes a URL and returns word-frequency pairs for the web page, sorted in order of descending frequency. Copy the following program into Komodo Edit, save it as html-to-freq.py and execute it. Study the program and the output carefully before continuing.
# html-to-freq.py import urllib2 import dh # note that we are using a copy of the old DCB page again url = 'http://niche-canada.org/files/dcb/dcb-34298.html' response = urllib2.urlopen(url) html = response.read() text = dh.stripTags(html).replace(' ', ' ') wordlist = dh.stripNonAlphaNum(text.lower()) dictionary = dh.wordListToFreqDict(wordlist) sorteddict = dh.sortFreqDict(dictionary) for s in sorteddict: print str(s)
Removing stop words
When we look at the output of our html-to-freq.py program, we see that a lot of the most frequent words in the text are function words like "the", "of", "to" and "and".
(647, 'the') (310, 'of') (273, 'to') (202, 'and') (171, 'in') (134, 'a') (118, 'that') (91, 'dollard') (78, 'was') (78, 'their') (75, 'were') (72, 'they') (71, 'his')
These words are usually the most common in any English language text, so they don't tell us much that is distinctive about Dollard's biography. In general, we are more interested in finding the words that will help us differentiate this text from texts that are about different subjects. So we're going to filter out the common function words. Words that are ignored like this are known as stop words. We're going to use the following list, adapted from one posted online by computer scientists at Glasgow. Copy it and put it at the beginning of the dh.py library that you are building.
stopwords = ['a', 'about', 'above', 'across', 'after', 'afterwards'] stopwords += ['again', 'against', 'all', 'almost', 'alone', 'along'] stopwords += ['already', 'also', 'although', 'always', 'am', 'among'] stopwords += ['amongst', 'amoungst', 'amount', 'an', 'and', 'another'] stopwords += ['any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere'] stopwords += ['are', 'around', 'as', 'at', 'back', 'be', 'became'] stopwords += ['because', 'become', 'becomes', 'becoming', 'been'] stopwords += ['before', 'beforehand', 'behind', 'being', 'below'] stopwords += ['beside', 'besides', 'between', 'beyond', 'bill', 'both'] stopwords += ['bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant'] stopwords += ['co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de'] stopwords += ['describe', 'detail', 'did', 'do', 'done', 'down', 'due'] stopwords += ['during', 'each', 'eg', 'eight', 'either', 'eleven', 'else'] stopwords += ['elsewhere', 'empty', 'enough', 'etc', 'even', 'ever'] stopwords += ['every', 'everyone', 'everything', 'everywhere', 'except'] stopwords += ['few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first'] stopwords += ['five', 'for', 'former', 'formerly', 'forty', 'found'] stopwords += ['four', 'from', 'front', 'full', 'further', 'get', 'give'] stopwords += ['go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her'] stopwords += ['here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers'] stopwords += ['herself', 'him', 'himself', 'his', 'how', 'however'] stopwords += ['hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed'] stopwords += ['interest', 'into', 'is', 'it', 'its', 'itself', 'keep'] stopwords += ['last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made'] stopwords += ['many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine'] stopwords += ['more', 'moreover', 'most', 'mostly', 'move', 'much'] stopwords += ['must', 'my', 'myself', 'name', 'namely', 'neither', 'never'] stopwords += ['nevertheless', 'next', 'nine', 'no', 'nobody', 'none'] stopwords += ['noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of'] stopwords += ['off', 'often', 'on','once', 'one', 'only', 'onto', 'or'] stopwords += ['other', 'others', 'otherwise', 'our', 'ours', 'ourselves'] stopwords += ['out', 'over', 'own', 'part', 'per', 'perhaps', 'please'] stopwords += ['put', 'rather', 're', 's', 'same', 'see', 'seem', 'seemed'] stopwords += ['seeming', 'seems', 'serious', 'several', 'she', 'should'] stopwords += ['show', 'side', 'since', 'sincere', 'six', 'sixty', 'so'] stopwords += ['some', 'somehow', 'someone', 'something', 'sometime'] stopwords += ['sometimes', 'somewhere', 'still', 'such', 'system', 'take'] stopwords += ['ten', 'than', 'that', 'the', 'their', 'them', 'themselves'] stopwords += ['then', 'thence', 'there', 'thereafter', 'thereby'] stopwords += ['therefore', 'therein', 'thereupon', 'these', 'they'] stopwords += ['thick', 'thin', 'third', 'this', 'those', 'though', 'three'] stopwords += ['three', 'through', 'throughout', 'thru', 'thus', 'to'] stopwords += ['together', 'too', 'top', 'toward', 'towards', 'twelve'] stopwords += ['twenty', 'two', 'un', 'under', 'until', 'up', 'upon'] stopwords += ['us', 'very', 'via', 'was', 'we', 'well', 'were', 'what'] stopwords += ['whatever', 'when', 'whence', 'whenever', 'where'] stopwords += ['whereafter', 'whereas', 'whereby', 'wherein', 'whereupon'] stopwords += ['wherever', 'whether', 'which', 'while', 'whither', 'who'] stopwords += ['whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with'] stopwords += ['within', 'without', 'would', 'yet', 'you', 'your'] stopwords += ['yours', 'yourself', 'yourselves']
Now getting rid of the stop words in a list is as easy as using another list comprehension. Add this function to the dh.py module, too.
# Given a list of words, remove any that are # in a list of stop words. def removeStopwords(wordlist, stopwords): return [w for w in wordlist if w not in stopwords]
Putting it all together
Now we have everything we need to determine word frequencies for web pages. Copy the following to Komodo Edit, save it as html-to-freq-2.py and execute it.
# html-to-freq-2.py import urllib2 import dh # old copy of the page again url = 'http://niche-canada.org/files/dcb/dcb-34298.html' response = urllib2.urlopen(url) html = response.read() text = dh.stripTags(html).replace(' ', ' ') fullwordlist = dh.stripNonAlphaNum(text.lower()) wordlist = dh.removeStopwords(fullwordlist, dh.stopwords) dictionary = dh.wordListToFreqDict(wordlist) sorteddict = dh.sortFreqDict(dictionary) for s in sorteddict: print str(s)
If all went well, your output should look like this:
(91, 'dollard') (64, 'iroquois') (33, 'long') (27, 'sault') (24, 'enemy') (24, '1660') (20, 'time') (20, 'seventeen') (20, 'french') (19, 'new') (19, 'montreal') (19, 'army') (18, 'hurons') (18, 'fort') (17, 'france') (15, 'men') (14, 'marie') (14, 'companions') ...
Suggested Readings
Lutz, Learning Python
Ch. 9: Tuples, Files, and Everything Else
Ch. 11: Assignment, Expressions, and print
Ch. 12: if Tests
Ch. 13: while and for Loops
