How to Write a Zotero Translator:
A Practical Beginners Guide for Humanists

By: Adam Crymble

 

Chapter 15: Scraping the Search Results Page: doWeb Function

 previous button  next button

This Chapter

This chapter builds upon the translator started in Chapter 13. If you have already created the translator you can find it again by opening Scaffold and clicking on the "Load From Database" icon. You can then scroll through the translators until you find yours.

If your site has a search results page with the same domain as the rest of the pages on the site, this chapter will explain how to scrape the titles and links off this page so users can select which items they would like to save. If your site does not have one of these pages, skip ahead to the last section, "No / Improper Search Results Page". All work in this chapter should be done in the "Code" tab of Scaffold.

Scraping the Search Results Page (Tutorial)

The search results page is often the easiest to scrape, since very few elements change from translator to translator. A little bit of quick success is always a good way to start a project. This section also contains the structure upon which the rest of the translator is built.

This code will be put in a Function named "doWeb". Like "detectWeb" this name is mandatory for Zotero. Once completed, it is this Function that will create the pop-up box that allows users to select any or all items to save to Zotero. Lucky for us, the pop-up is taken care of by parts of the program behind the scenes.

This Function involves six basic steps.

Step 1: Declare a function "doWeb"

The Function name is important in this case. Since you will be using XPaths, you also need to include the namespace code. In addition, there are a few Variables you are going to need in a moment, so you might as well declare them now. I've given these Variables the names "articles", "items" and "nextTitle" respectively, but you can give them any legal name you like.

Example 15.1

function doWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var articles = new Array();
var items = new Object();
var nextTitle;
}

Step 2: Check the Current Page to see if it is Search Results

Use an If Statement to check if the viewer is looking at a search results page.

The detectWeb Function can already do this for us. Rather than rewrite the code, simply "call" the detectWeb Function and use an If Statement to test if the return value was "multiple". The Function "doWeb" has no idea what "detectWeb" has detected. The two Functions run independently of one another. This is why we must rerun the detectWeb Function. Don't forget to include the Arguments (doc, url) when you call detectWeb.

Example 15.2

function doWeb(doc, url) {
namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var articles = new Array();
var items = new Object();
var nextTitle;

if (detectWeb(doc, url) == "multiple") {

}
}

Step 3: Use an XPath to Grab Titles & Links

Use Solvent to create an XPath that points to the article title. In the Sample, the title and link to the article are located in the same node. This allows you to grab both with only one XPath.

Shorten the XPath as much as possible. Do not worry if the XPath will capture nodes on your "book", or any other, page. The If Statement created in the last step will ensure this code will only run when the viewer is looking at the search results page.

The XPath in this example should be:

Example 15.3

'//td[2]/a';

The "a" at the end of the XPath tells you that this node includes a link.

Use this XPath to declare an XPath Object "titles" which will hold the nodes and add this to the If Statement.

Example 15.4

function doWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var articles = new Array();
var items = new Object();
var nextTitle;

if (detectWeb(doc, url) == "multiple") {
var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null);
}
}

Step 4: Use a While Loop to Save the Items to an Object

Currently, the titles and corresponding links are stuck in an XPath Object. As learned in Chapter 11, you can only get those out in order, and only one at a time. This step puts them into the "items" Object to make them easier to work with. This is not strictly necessary as you can save directly into Zotero, but it is good practice for later when you go to scrape individual entries.

This While Loop should appear directly after the XPath Object declaration, inside the If Statement.

Example 15.5

function doWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var articles = new Array();
var items = new Object();
var nextTitle;

if (detectWeb(doc, url) == "multiple") {
var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null);
while (nextTitle = titles.iterateNext()) {
items[nextTitle.href] = nextTitle.textContent;
}

}
}

This Loop gives the Variable "nextTitle" the value of the next item being held in our "titles" XPath Object. This Loop will run as long as the XPath contains another item; once you get to the end, it will stop. This makes it very useful since we want to save each item from the XPath into Zotero. On the first iteration through the While Loop, the contents would be the first title and link. Because we did not specify whether we want the link (href) or the text (textContent), the "nextTitle" will automatically contain both. You can add Zotero.debug() inside or after the While Loop if you would like to see how the titles and links are put into the items Object.

The items Object will follow the basic pattern:

Example 15.6

Items["URL 1"] = "Title 1";

Zotero has a predefined Method that you must use to move the information in the items Object into the program. It uses this Method to compile the list that appears when a user clicks on the folder icon in the address bar. This Method is part of the behind the scenes code. To get the information into the program, add this line of code after the While Loop has closed, but still inside the If Statement:

Example 15.7

items = Zotero.selectItems(items);

Save your work and click Execute. Assuming everything went well, the "Test Frame" should read "Translation Successful" and a popup window should appear with a list of the titles and checkboxes. If this did not appear, make sure Scaffold is currently looking at the Search Results page.

Step 5: Save the contents of that Object into Zotero.

Zotero expects the links to be provided in an Array, not an Object. Therefore, we have to put the links into the Array "articles" that we created. To do this, we use a For Loop, still inside the If Statement — since all this runs only if the current page is a search results page.

Example 15.8

function doWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var articles = new Array();
var items = new Object();
var nextTitle;

if (detectWeb(doc, url) == "multiple") {
var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null);
while (nextTitle = titles.iterateNext()) {
items[nextTitle.href] = nextTitle.textContent;
}
items = Zotero.selectItems(items);
for (var i in items) {
articles.push(i);
}
}
}

This completes the code needed for the search results page. What this does is saves a list of the URLs of all the items that the user has asked Zotero to save.

A few more lines do the same thing for the single entry pages. Add an Else clause to the If Statement. Therefore, this will run for all other pages on the site — single entry pages. Instead of using an XPath to capture the URLs stored in links, the URLs can be accessed directly on single entry pages and stored in the articles Array, since it is the current URL with which you are concerned.

Example 15.9

function doWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var articles = new Array();
var items = new Object();
var nextTitle;

if (detectWeb(doc, url) == "multiple") {
var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null);
while (nextTitle = titles.iterateNext()) {
items[nextTitle.href] = nextTitle.textContent;
}
items = Zotero.selectItems(items);
for (var i in items) {
articles.push(i);
}
} else {
articles = [url];
}
}

Step 6: Tell Zotero you are done.

The last step for this stage is to call some built-in functions that tell Zotero the translator is complete. It is not actually complete, since you have not scraped the bibliographical information yet, with all of the necessary structural elements the translator will crash when you are testing it, so its best to put them in now.

Example 15.10

function doWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var articles = new Array();
var items = new Object();
var nextTitle;

if (detectWeb(doc, url) == "multiple") {
var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null);
while (nextTitle = titles.iterateNext()) {
items[nextTitle.href] = nextTitle.textContent;
}
items = Zotero.selectItems(items);
for (var i in items) {
articles.push(i);
}
} else {
articles = [url];
}

Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();});
Zotero.wait();

}

"scrape" is what you will call the Function created in the next chapter. This is the Function that will collect, organize and store the bibliographic information. You can call this Function whatever you like, however if you do decide to change it, make sure you change the "scrape" Argument in the Zotero.Utilities.processDocuments line above to reflect the new name you choose.

Save your work and relaunch Firefox and Scaffold. You should now be able to click on the Folder Icon in the address bar on the Search Results page and select from the items.

Once you have this working, if you are only following the tutorial you may now move on to the next chapter.

The doWeb Template

The following section of code works for 99% of websites; therefore you can use it as a template. For websites with search results pages on which you can capture both the title and the link of an item with a single XPath, there is only one line in this whole Function that ever needs changing: the XPath (directions). Therefore, you can use this Function as a template for websites that meet the criteria described here.

Example 15.11

function doWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var articles = new Array();
var items = new Object();
var nextTitle;

if (detectWeb(doc, url) == "multiple") {
var myXPath = xxxxx;
var titles = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null);
while (nextTitle = titles.iterateNext()) {
items[nextTitle.href] = nextTitle.textContent;
}
items = Zotero.selectItems(items);
for (var i in items) {
articles.push(i);
}
} else {
articles = [url];
}

Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();});
Zotero.wait();
}

Sites that do not meet the criteria often only need one or two additional lines of code. This will be discussed below.

Scraping the Search Results Page (Advanced)

Not all pages are created as cleanly as the Sample 1 page. Here are some solutions to common problems when trying to scrape a search results page.

The Title and Link to the Single Entry Page are not stored in the same node

If you were not lucky enough to have a search result page that let you grab all the necessary information in one XPath, you might need to use a second XPath.

Some pages are formatted to give the title and a link in a separate location. For example:

Example 15.12

a) My Article
This article is about dogs that chase cats. [read full text].

This is not terribly difficult to account for. You can still use the six steps above, except instead of one XPath, you will use two; one captures the title, and one captures the link embedded in [read full text]. Use Solvent to get the XPaths and instead of creating one XPath Object, create two.

When you get to step four, change the While Loop to save from both XPath Objects rather than from only one. (This assumes you have named your 2nd XPath Object "link". If this is not the case, change "link" to whatever you have used instead).

Example 15.13

var nextLink;
while (nextTitle = titles.iterateNext()) {
nextLink = link.iterateNext();
items[nextLink.href]=nextTitle.textContent;
}

This essentially does the same thing as the first example. Use Zotero.debug() to see the contents of each Variable as the While Loop iterates so that you can make sure the information is going to where you expect.

The search results page links to articles with different domains

Some sites split up their content into several different domains, which they market separately. For instance, the Toronto Star has:

All five of these domains are searchable from the Toronto Star search page. This is convenient for their customers but inconvenient when writing a translator. As mentioned, you cannot cross domains using JavaScript due to security reasons. If your search results page has items from different domains, the easiest solution is to leave them off the list given to users. If you do not do this, users who try to save these items will get an error message and will likely become frustrated with Zotero.

The easiest way to leave the items off is to use an If Statement just before putting the entries into the "items" Object.

Example 15.14

if (detectWeb(doc, url) == "multiple") {
var myXPath = xxxxx;
var titles = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null);
while (nextTitle = titles.iterateNext()) {
if (nextTitle.href.match(xxx)) {
items[nextTitle.href] = nextTitle.textContent;
}
}
items = Zotero.selectItems(items);
for (var i in items) {
articles.push(i);
}
}

Replace the xxx with the target you used on the Scaffold Metadata tab. This way, if the link doesn't match the proper domain, Zotero will skip over it. It's not a perfect solution, but the only other way around this problem is to create separate translators for each domain found on the search results page, and have Zotero call the appropriate translator as required. This is much more work and is exactly why Google and Yahoo do not yet have Zotero translators.

The title includes unwanted information

If the title includes extra characters you can easily clean them away. You have saved the title into a Variable. In the example above, it is located in nextTitle.textContent; you can use String Methods (Chapter 7) and RegExps (Chapter 12) to change the title to anything you like. You can use an If Statement to check if the title needs changing, or just change all titles, depending on your needs. Do this just before saving the String into the "items" Object. More examples of cleaning Strings of unwanted characters can be found in the next chapter.

No / Improper Search Results Page

If your page does not have a search option, or the search results page uses a different domain than the single entry pages, you will likely have to forego including the "multiple" option for your translator. To do this, simply remove all the code from the "doWeb" template that refers to the search results pages and make sure that your detectWeb function does not have a "multiple" option.

Example 15.15

function doWeb(doc, url) {

var articles = new Array();

articles = [url];

Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();});
Zotero.wait();
}

Save your work, relaunch Firefox and Scaffold and move on to the next chapter.