Programming Historian Home

Tag Clouds

Visualizing term frequency

Web 2.0 has made the tag cloud an ubiquitous form of text visualization. To create a tag cloud for a text, you first remove stop words, then take a couple dozen of the most frequently occurring terms, alphabetize them, and render each so that its font size is proportional to the number of times that it occurs in the text. The result typically looks something like the following:

Chirag Mehta's US Presidential Speeches Tag Cloud
Chirag Mehta's US Presidential Speeches Tag Cloud

This example is taken from Chirag Mehta's US Presidential Speeches Tag Cloud website. The site shows tag clouds for various historic presidential speeches, beginning in 1776 and running through George W. Bush's State of the Union addresses. You move through speeches by adjusting a horizontal time-line slider along the top. The display responds by providing a tag cloud for each speech. As you move back and forth through time, you can see different words become more or less prominent. Mehta also uses color to indicate the relative newness of a given term, fading from white to brown over time. Take a few moments to familiarize yourself with Mehta's website, to get some idea of the potential of this kind of visualization. We're going to start by visualizing the term frequencies of a single text; later you'll learn how to extend the method to synchronous or diachronous collections of texts.

Mapping one range onto another

You already did most of the work that you'll need to make a tag cloud visualization when you learned how to compute word frequencies. We start by deciding how many terms we want to have in our cloud. Usually this will be somewhere on the order of 30 to 100. You want to show enough terms to capture the most interesting aspects of your text, but not so many that the distinctive features are obscured by noise. Once the size of the tag cloud is determined, you take that many elements from the top of the sorted dictionary of word-frequency pairs. Suppose we're working with Dollard's biography and we've decided that we want the size of our tag cloud to be 100 elements. Then we take the top 100 items from our dictionary. The most frequent term, 'dollard', occurs 91 times. The least frequent term occurs 6 times. Subtracting one from the other, we have a frequency range of 85.

cloudsize = 100
maxfreq = sorteddict[0][0]
minfreq = sorteddict[cloudsize][0]
freqrange = maxfreq - minfreq

Now we need to map this range onto a range of possible font sizes. We've chosen 24 pixels as our smallest font and 54 as our largest, giving us a range of 30 font sizes. So we need to map 85 different frequencies down to 30 different fonts. We could work with these exact values, but it makes more sense to come up with a general formula.

Let's start with the ends of our ranges. We want a frequency of 6 to be mapped to a font size of 24, or, more generally, we want minfreq to be mapped to minfont. Likewise, we want maxfreq to be mapped to maxfont.

Now consider the second-most frequent term, 'iroquois', which occurs 64 times in Dollard's biography. We first need to determine what proportion of the way it is between minfreq and maxfreq. In this case, it is (64-6)/85 = 58/85 = 0.68235. In other words, 'iroquois' occurs about 68% of the way between the least frequent term and 'dollard'. More generally, let's define a frequency scalingfactor for term k as:

scalingfactor = (kfreq - minfreq) / float(freqrange)

(We have to use the float function to tell Python that we're working with floating point numbers and it shouldn't convert everything to integers.)

That deals with the range on the frequency side of our mapping. On the font side, we will use a similar logic:

import math
minfont = 24
maxfont = 54
fontrange = maxfont - minfont
fontsize = int(minfont + math.floor(fontrange * scalingfactor))

The floor function in the math module rounds a floating point number down to the nearest integer.

A little bit of CSS

In the early days of the web, HTML tags specified both what something was (e.g., a title or a paragraph) and how something should look (e.g., italics or boldface). This made it difficult to write web pages that would look good and be usable on a variety of different computers. A very large font might make a good headline on a wide screen, but not fit on a narrow one. A page that looks good in color might be illegible when printed on a laser printer. To get a sense of the problem, compare the Google home page with the mobile version of the page that is designed to be used on devices like cellphones. The obvious solution was to separate the content of a page from its form. HTML tags are now generally used to specify what something is, and Cascading Style Sheets (CSS) are used to specify how it should look under various conditions.

You've already seen one example of CSS, when we used the style property of an HTML a tag to indicate that we didn't want the browser to underline hyperlinks. Putting CSS into HTML tags like this is known as inline CSS. It sort of defeats the original purpose of separating form and content, but it will make it easier for us to do so later on.

<a style="text-decoration:none" href=" ... "> ... </a>

We can bundle this into a function that returns the HTML a tag as a string, given a url and link name. Add the following function to the dh.py module.

# Given a url and link name, return a string containing
# HTML and inline CSS for an undecorated hyperlink.
 
def undecoratedHyperlink(url, linkname):
astr = """<a
style=\"text-decoration:none\" href=\"%s\">%s</a>
"""
return astr % (url, linkname)

Given the undecoratedHyperlink function, we can rewrite our keywordListToGoogleSearchLink function, too. Replace the old version in dh.py with the following version:

# Given a list of keywords and a link name, return an
# HTML link to a Google search for those terms.
 
def keywordListToGoogleSearchLink(keywords, linkname):
url = 'http://www.google.com/search?q='
url += '+'.join(keywords)
gsearch = undecoratedHyperlink(url, linkname)
return gsearch

In order to apply a style to a small region of your web page, you usually use a span tag. For example, if you wanted to print a word in red in HTML, you could use either of the following expressions. You can test these two HTML expressions online with the W3 Schools TryIt editor.

this is in <span style="color: red;">red</span><br />
this is also in <span style="color: rgb(255,0,0);">red</span>

The first example uses a predefined color name. The second uses an RGB function that defines a color in terms of how much red, green and blue it contains, each on a scale ranging from 0 to 255. It is also possible to change font sizes using inline CSS. Try copying the following example into the HTML editor:

this <span style="font-size:8px;">word</span> is in an 8 pixel font<br />
this <span style="font-size:10px;">word</span> is in a 10 pixel font<br />
this <span style="font-size:12px;">word</span> is in a 12 pixel font<br />
this <span style="font-size:18px;">word</span> is in a 18 pixel font<br />
this <span style="font-size:24px;">word</span> is in a 24 pixel font<br />
this <span style="font-size:36px;">word</span> is in a 36 pixel font

To use a particular style for a large region of your page, you use the HTML div tag. Think of div like a generalization of a paragraph. In the program that we're developing, we are going to use span tags for each term in the tag cloud, but the whole cloud will be sitting inside of a div. Try the following code in the HTML editor. The CSS properties set the width of the div to 560 pixels, set the background color to a very light grey, put a solid one-pixel grey border around it, and center any text within it. As long as you end each property-value pair with a semi-colon, you can include as many as you want in the style.

<div style="width: 560px;
background-color: rgb(250,250,250);
border: 1px grey solid;
text-align: center;">
This is a test.
</div>

Functions to write HTML divs and spans

We now have enough background information to write some Python functions that will automatically create HTML div and span tags. Copy the following function and add it to dh.py.

# Given the body of a div and an optional string of
# property-value pairs, return string containing HTML
# and inline CSS for default div.
 
def defaultCSSDiv(divbody, opt=''):
divstr = """<div style=\"
width: 560px;
background-color: rgb(250,250,250);
border: 1px grey solid;
text-align: center;
%s\">%s</div>
"""
return divstr % (opt, divbody)

If you copy this function to a Python shell and execute it, you can see that it allows you to add additional properties to a default div.

print defaultCSSDiv('This is a test', 'font-size: 24px;')

-> <div style=" width: 560px;
background-color: rgb(250,250,250);
border: 1px grey solid;
text-align: center;
font-size: 24px;">This is a test</div>

We will also bundle up our code to create a scaled font. Copy the following function into the dh.py module.

# Given the body of a span and a scaling factor, return
# string containing HTML span with scaled font size.
 
def scaledFontSizeSpan(body, scalingfactor):
import math
minfont = 24
maxfont = 54
fontrange = maxfont - minfont
fontsize = int(minfont + math.floor(fontrange * scalingfactor))
spanstr = '<span style=\"font-size:%spx;\">%s</span>'
return spanstr % (str(fontsize), body)

Other dimensions for visualization

Given our scaledFontSizeSpan function, we now have the ability to map any scaling factor between 0 and 1 onto a range of font sizes that we've chosen. Font size is not the only variable that we could use for visualization, of course. Any range of potentially

interesting differences in our data can be mapped onto any other range of perceptually salient properties. It is quite easy, for example, to create a function that adjusts the lightness or darkness of a greyscale font as well as the font size. Study the following and add it to dh.py.

# Given the body of a span and a scaling factor, return
# string containing HTML span with scaled font size and
# darkness of greyscale adjusted.
 
def scaledFontShadeSpan(body, scalingfactor):
import math
minfont = 24
maxfont = 54
fontrange = maxfont - minfont
fontsize = int(minfont + math.floor(fontrange * scalingfactor))
fontcolor = int(200 - math.ceil(200 * scalingfactor))
spanstr = """<span style=\"font-size:%spx;
color: rgb(%d,%d,%d);
\">%s</span>
"""
return spanstr % (str(fontsize), fontcolor, fontcolor, fontcolor, body)

When rgb is given three equal parameters, it returns a value between black (0, 0, 0) and white (255, 255, 255). In this case, we start with a relatively light grey (200, 200, 200) and subtract larger and larger values from it as the scaling factor increases. When the scaling factor is 1, our font color is black. Recall that the %d formatting character allows us to interpolate an integer into a string.

It is also possible to adjust each of the red, green and blue color components independently. The following function creates a heat map, shading from a cool blue when the scaling factor is 0, to a hot red when it is 1. Make sure that you understand how the code works, then add it to the dh.py module.

# Given the body of a span and a scaling factor, return
# string containing HTML span with scaled font size and
# shading from cool blue to hot red.
 
def scaledFontHeatmapSpan(body, scalingfactor):
import math
minfont = 24
maxfont = 54
fontrange = maxfont - minfont
fontsize = int(minfont + math.floor(fontrange * scalingfactor))
fontcolor = int(250 - math.ceil(250 * scalingfactor))
spanstr = """<span style=\"font-size:%spx;
color: rgb(%d,0,%d);
\">%s</span>
"""
return spanstr % (str(fontsize), 250-fontcolor, fontcolor, body)

Putting it all together

We can now write a program that makes use of these functions (and ones that we've written previously) to create a tag cloud for Dollard's biography. First add the following to dh.py:

# Given a dictionary of frequency-word pairs sorted
# in order of descending frequency, re-sort so it is
# in alphabetical order by word.
 
def reSortFreqDictAlpha(sorteddict):
import operator
aux = [pair for pair in sorteddict]
aux.sort(key=operator.itemgetter(1))
return aux

Now copy the following code to Komodo Edit, save it as html-to-tag-cloud.py and execute it.

# html-to-tag-cloud.py
 
import dh
 
# create sorted dictionary of word-frequency pairs
url = 'http://niche-canada.org/files/dcb/dcb-34298.html'
text = dh.webPageToText(url)
fullwordlist = dh.stripNonAlphaNum(text)
wordlist = dh.removeStopwords(fullwordlist, dh.stopwords)
dictionary = dh.wordListToFreqDict(wordlist)
sorteddict = dh.sortFreqDict(dictionary)
 
# create tag cloud and open in Firefox
cloudsize = 100
maxfreq = sorteddict[0][0]
minfreq = sorteddict[cloudsize][0]
freqrange = maxfreq - minfreq
outstring = ''
resorteddict = dh.reSortFreqDictAlpha(sorteddict[:cloudsize])
for k in resorteddict:
kfreq = k[0]
klabel = k[1]
scalingfactor = (kfreq - minfreq) / float(freqrange)
outstring += ' ' + dh.scaledFontSizeSpan(klabel, scalingfactor) + ' '
dh.wrapStringInHTML("html-to-tag-cloud", url, dh.defaultCSSDiv(outstring))

If you substitute dh.scaledFontShadeSpan or dh.scaledFontHeatmapSpan for dh.scaledFontSizeSpan you can create any of three different tag cloud visualizations like the ones show below. You can also modify the code to change the range of font sizes, the typeface, the background color of the div, the width of the border, and anything else you can think of.

Three Tag Cloud Visualizations
Three Tag Cloud Visualizations

Combining the tag cloud with KWIC

We're going to make one final refinement to this program, combining it with our ability to generate keyword in context (KWIC) displays. We want to create a web page with a tag cloud at the top and a number of KWIC displays in alphabetical order. When you click on a particular term, you're taken down the page to the KWIC display for that term. There is a link you can click to return to the tag cloud at the top. The code for this program is listed below. Much of it will be familiar, but it makes use of a few new techniques. Copy it to Komodo Edit, save as html-to-tag-cloud-kwic.py and execute it.

# html-to-tag-cloud-kwic.py
 
import dh
 
# create sorted dictionary of word-frequency pairs
url = 'http://niche-canada.org/files/dcb/dcb-34298.html'
text = dh.webPageToText(url)
fullwordlist = dh.stripNonAlphaNum(text)
wordlist = dh.removeStopwords(fullwordlist, dh.stopwords)
dictionary = dh.wordListToFreqDict(wordlist)
sorteddict = dh.sortFreqDict(dictionary)
 
# create dictionary of n-grams
n = 7
paddinglist = ('# ' * (n//2))
fullwordlist[:0] = paddinglist
fullwordlist.extend(paddinglist)
ngrams = dh.getNGrams(fullwordlist, n)
worddict = dh.nGramsToKWICDict(ngrams)
 
# create tag cloud
cloudsize = 40
maxfreq = sorteddict[0][0]
minfreq = sorteddict[cloudsize][0]
freqrange = maxfreq - minfreq
tempstring = ''
resorteddict = dh.reSortFreqDictAlpha(sorteddict[:cloudsize])
for k in resorteddict:
kfreq = k[0]
klabel = dh.undecoratedHyperlink('#'+k[1], k[1])
scalingfactor = (kfreq - minfreq) / float(freqrange)
tempstring += dh.scaledFontSizeSpan(klabel, scalingfactor)
outstring = dh.defaultCSSDiv(tempstring) + '<br />'  
# create KWIC listings for each item
for k in resorteddict:
klabel = k[1]
tempstring = ''
tempstring += '<a name=\"%s\">%s</a> ' % (klabel, klabel)
tempstring += dh.undecoratedHyperlink('#', '[back]')
outstring += dh.defaultCSSDiv(tempstring, opt='font-size : 24px;')
outstring += '<p><pre>'
for t in worddict[klabel]:
outstring += dh.prettyPrintKWIC(t)
outstring += '<br />'
outstring += '</pre></p>'
 
# open in Firefox
dh.wrapStringInHTML("html-to-tag-cloud-kwic", url, outstring)

The first section of the code creates a sorted dictionary of word-frequency pairs, as you've done before. In the second section, we need to add a list containing (n//2) padding characters (hash marks) to the beginning and end of our full list of words before creating a dictionary of n-grams. We use two new commands to do this. The first replaces the empty slice before the beginning of the list with the list of padding characters. The second extends the word list by adding the list of padding characters to the end of it. To experiment with this, open a Python shell and try the following:

testlist = 'this is a test'.split()
print testlist

-> ['this', 'is', 'a', 'test']

 

testlist[:0] = ['i', 'say']
print testlist

-> ['i', 'say', 'this', 'is', 'a', 'test']

 

testlist.extend(['is', 'it', 'not'])

-> None

 

print testlist

-> ['i', 'say', 'this', 'is', 'a', 'test', 'is', 'it', 'not']

Note that the extend function changes the list, but doesn't return a value (hence the 'None'). This is called changing a list in place.

In the third section we create a tag cloud. Each term in the cloud has a hyperlink that looks like this:

<a style="text-decoration:none" href="#1660">1660</a>

Putting a hash mark in front of a string creates a relative link, a link to a place within the current web page. (Don't confuse this with our use of the hash mark as a padding character above. The two are completely unrelated.) Each of these links has a corresponding named anchor farther down the page. The anchor looks like this:

<a name="1660">1660</a>

We also automatically create a number of links which take you back to the top of the page. Each of them looks like this:

<a style="text-decoration:none" href="#">[back]</a>

At this point it is probably a good idea to spend a few minutes studying the HTML file that this program created automatically. It is named html-to-tag-cloud-kwic.html.