How to Write a Zotero Translator:
A Practical Beginners Guide for Humanists

By: Adam Crymble

 

Chapter 11: XPath Objects (Containers)

 previous button  next button

This Chapter

XPaths (the container)

In Chapter 5 you learned how to use Solvent to create nice concise XPaths (directions) to point to specific content on a webpage. You now have enough background in JavaScript to learn how to make the containers that will hold all the information in those nodes.

The XPath that was created in Chapter 5 with Solvent doesn't actually do anything; it's just directions. You're now going to learn how to use the XPath with a simple JavaScript Function to actually GET the information the XPath points to and store it.

For the most part, using XPath containers is very standardized. Very few lines of code will change from one XPath to the next.

You've practiced declaring Simple Variables in Chapter 6; now let's take a look at the most complex-looking Variable declaration you're going to deal with while writing a translator: an XPath Object. Use Solvent to get the XPath (directions) for the first column on the Sample Page — the one that contains the headings. Then shorten the XPath as much as possible. If you forget how to do this, check back to Chapter 5.

You are going to use this shortened XPath to declare an XPath Object named "myXPathObject".

Example 11.1

var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null);

This may look extremely complex, but there are actually only two items here that will ever change when you declare these types of Variables:

The first is the name, "myXPathObject". This could have been any legal variable name.

The second is the XPath (directions), '//td[1]' in this case. Everything else, for your purposes, will always remain the same.

If the XPath (container) is quite long, it might make the code look cleaner to put the XPath (directions) in a Variable and use the Variable in the Object declaration.

For example:

Example 11.2

var myXPath = '//div[@id="Content"]/div/table[@class="Bibrec"]/tbody/tr/td[1][@class="Label"]';

var myXPathObject = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null);

Notice that when you replace the directions in the Object definition with a Variable containing the directions you remove the quotation marks from the Object declaration. Instead, the single quotes go around the directions in the Simple Variable declaration, since in essence the XPath directions is a String.

Example 11.3

var myXPath = '//div[@id="Content"]/div/table[@class="Bibrec"]/tbody/tr/td[1][@class="Label"]';

var myXPathObject = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null);

Assuming your XPath directions were correct, you have now created an Object that will capture all the nodes that match your XPath when this line of code is executed by Scaffold.

Namespace

As a human reader, you can use context to differentiate between words with the same name but different meanings. For example, "I watch my watch while I'm on watch." XPaths can potentially cause a problem with namespace. You can read more about namespace at W3Schools. Computers can't differentiate based on context, so you have to add a few lines of JavaScript code to ensure no conflict occurs.

Ensure the following block of code is included, exactly as it appears inside the top of every Function in which you have an XPath (container). I'll remind you again later when you start assembling your translator.

Example 11.4

var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

Practice

In Scaffold, declare a Function "detectWeb" and pass it the Arguments (doc, url). Then add the namespace code to prevent conflicts, and finally the XPath Object declaration. It should all look like this:

Example 11.5

function detectWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == "x" ) return namespace; else return null;
} : null;

var myXPath = '//td[1]';
var myXPathObject = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null); }

Recall that in Chapter 10 I mentioned that "detectWeb" does not need to be called, unlike other Functions.

Execute the code above and it will run automatically. Assuming Firefox and Scaffold are looking at the Sample 1 page, your myXPathObject now contains several pieces of information. However, unlike other Variables, you cannot simply use Zotero.debug() to see what an XPath Object contains. There are a couple of steps you must first take.

The first thing to know is that unlike other JavaScript Objects or Arrays, you cannot look at any item in the XPath object you like; you have to cycle through them in order, using the iterateNext() XPath Method. Once you have cycled past an item, it is gone. This is without a doubt inconvenient. The best solution is to use a Loop to save each item in the XPath to another JavaScript Object or Array.

Before you attempt to save the contents to another Variable, the second thing to know is that you must decide what information you want from the XPath. XPaths automatically save the text, links and image sources of each node they capture (if the node contains these properties). Since Zotero mainly deals with bibliographic information, 99% of the time you will want the text. To achieve this, append .textContent; after .iterateNext() . If you were interested in the link, you would change .textContent to .href. Likewise, you could insert .src if what you were looking for was the URL of an image. This can be useful if the only consistent way to distinguish between an entry for a book and an entry for an audio recording is an icon that appears somewhere on the page.

Here are a few examples for you to try that illustrate how to extract information from XPath Objects.

Single Item

If you know your XPath only contains one item or that you are only interested in the first item, you can extract the information in the same line in which you declare the XPath Object:

Example 11.6

var myXPathObject = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null).iterateNext().textContent;
Zotero.debug(myXPathObject);

myXPathObject is now equivalent to a Simple Variable holding, in this case, "Title: ". This is because you have performed the method iterateNext() in the Variable declaration and asked for the text held within that node using textContent.

While Loops

This is the best way to extract multiple items from the same XPath Object. Add the following after the XPath Object declaration that was originally provided in Example 11.5:

Example 11.7

var items = new Array();
var headers;

while (headers = myXPathObject.iterateNext()) {
items.push(headers.textContent);
}
Zotero.debug(items);

This While Loop uses a Simple Variable, "headers" to save the contents of each subsequent item in the XPath Object. As long as there is another item, the Loop will continue to run. Inside the Loop, the textContent of the item is pushed into the Array "items". Once the Loop is completed, Zotero.debug() will show the list of items now held in the "items" Array. This Array is much easier to manipulate than the XPath Object itself.

For Loops

While not as memory efficient as a While Loop, a For Loop can often give you more control. It requires an extra step, adding another variable that stores the number of items in the XPath Object and then acts as the limit for the counter in the For Loop.

Example 11.8

var myXPathObject = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null);
var counter = doc.evaluate('count (//td[1])', doc, nsResolver, XPathResult.ANY_TYPE, null);
Zotero.debug(counter.numberValue);

var items = new Array();

for (var i = 0; i < counter.numberValue; i++) {
items.push(myXPathObject.iterateNext().textContent);
}

Zotero.debug(items);

This should give you the exact same result as the While Loop. Notice that in examples 11.7 and 11.8, iterateNext() and textContent can be used on the same line of code, or split into two different lines depending on need. You will have to use the full text of the XPath when using "count." Using a Variable containing the XPath (such as "myXPath") will cause the process to return a count of 0 rather than the true value.

Into an Object

As mentioned in Chapter 6, it is best to use Objects to organize your bibliographic data, since this allows you to give each piece of data a descriptive name. In this example, the XPath Object contains those descriptive names, so it makes sense to use these as the headers of the Object. This is only a slight change from using an Array. Since the While Loop is the cleanest way to extract data from an XPath Object, we'll use a While Loop.

Example 11.9

var items = new Object();
var headers;

while (headers = myXPathObject.iterateNext()) {
items[headers.textContent]='';
}
Zotero.debug(items);

This will give you a JavaScript Object, "items" that contains no information, but contains labeled fields that can later be added to. In this case, the debug should look like this:

Example 11.10

12:00:00 'Title:' => ""
'Principal Author:' => ""
'Imprint:' => ""
'Subjects:' => ""
'' => ""
'ISBN-10:' => ""
'Collection:' => ""
'Pages:' => ""

And that looks like half of a bibliographic table to me.

Practice

Use Solvent to capture the rest of the information on the Sample 1 page. Create an XPath Object that will hold that information and then use a While Loop to transfer it to an Array.

Matching the 2 XPath Objects Together

After completing the practice problem, you should now have two XPath Objects: one containing the headers contained on the Sample 1 page, and one containing the bibliographic information. You can combine these into a well formatted JavaScript Object using the same principles as in Example 11.9.

Example 11.11

var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null);
var myXPathObject2 = doc.evaluate('//td[2]', doc, nsResolver, XPathResult.ANY_TYPE, null);

var items = new Object();
var headers;
var contents;

while (headers = myXPathObject.iterateNext()) {
contents = myXPathObject2.iterateNext().textContent;
items[headers.textContent]=contents;
}
Zotero.debug(items);

This should debug as follows:

12:00:00 'Title:' => "Method and Meaning in Canadian Environmental History"
'Principal Author:' => " Alan MacEachern; William J. Turkel "
'Imprint:' => "
[Toronto:
Nelson Canada,
2009] "
'Subjects:' => "Environment"
'' => "Tables."
'ISBN-10:' => "0176441166"
'Collection:' => "None"
'Pages:' => "
573
"

Congratulations, you have now extracted data from a webpage and saved it into a well organized JavaScript Object. This is a major step in understanding how to create a Zotero translator. What remains is essentially a process of cleaning up the data and saving it to Zotero. You may notice that some information from the webpage is missing; specifically, "History" and "Methodology" which should appear under "Subjects: ". You will learn how to fix this problem in Chapter 16 when creating a real Zotero translator.

Practice

Create two XPath Objects of your own using Solvent on a website of your choosing and save the contents of that XPath to a new Object. Best to look for a page that contains a table; archives and libraries are a good bet.

What you should understand before moving on

Further Reading