Chapter 14: Scaffold DetectWeb Tab

This Chapter

DetectWeb
Determining the Content Types (Tutorial)
Determining the Content Types (Advanced)
The Pages you Don't want to Scrape
Filling in detectWeb (Tutorial)
Filling in detectWeb (Advanced)
Further Reading

This chapter builds upon the translator started in Chapter 13. If you have already created the translator you can find it again by opening Scaffold and clicking on the "Load From Database" icon. You can then scroll through the translators until you find yours.

DetectWeb

Click on the Detect Code tab in your Scaffold window. It should be blank. This is where you will do your JavaScript coding. The code in this section of the translator will run as soon as the program has determined that the "target" from the last section matches the current page. Zotero determines what type of content is contained on that page — i.e. whether it is a book, journal article, search results page, etc. — based on a series of If Statements constructed by you. The proper book icon newspaper icon document icon folder icon icon is then placed in the address bar for users to click on to grab the citation.

But, before you start writing code, you need to figure out exactly what you want to write.

Determining the Content Types (Tutorial)

The Sample Pages in this tutorial only contain books and a search results page. Technically, this is only one content type (book), but it requires two icons: one for the individual book entry and one for the search results page, which allows users to save any or all entries at the same time. book icon folder icon

To write this section you will have to determine — in plain English — what conclusively tells you, as a human reader, how you can tell whether you are looking at an individual entry or a search results page. This criteria can be anything as long as it is consistently true (or false), across all pages. In the case of the Sample Pages there are quite a few good clues.

On the single entry, I note the following that is not true for the search results page:

The title of the page (found at the top) includes "Single Item" on all pages
The URL contains a substring "sample"
The title of the book appears in green font, made by an <h1> tag
The table containing bibliographic data always contains a field labeled "Title: "
The attribute connected to the table node in the XPath is: [@class="Bibrec"]

On the search results page, I note the following:

The title at the top of the page includes "Search Results"
The URL contains a substring "searchresults"
The "Search Results: "Environmental History" is made up of an <h2> tag
A subheading appears which includes "Displaying results" in an <h3> tag
The attribute connected to the table node in the XPath is: [@class="thick"]

There are a few more consistent differences, but that's more than enough to differentiate the two types. If you are doing the Sample Tutorial, feel free to skip down to "Filling in detectWeb". If you are working on a translator for a different site, here are a few more tips:

Determining the Content Types (Advanced)

Before you start anything, poke around your database extensively and write down a list of all types of content you find. Be thorough or your users who discover a content type you missed will have problems using your translator. Many repositories contain multiple content types such as books, audio recordings, video recordings, journal articles and blog posts. You may be tempted only to concern yourself with the most prevalent content type, but to ensure maximum usability, this is discouraged.

Once you have completed your list, jot down some notes beside each content type that tell you, as a human reader, how you can tell each apart. The fact that you were able to differentiate the content types to make the original list means there must be a few things that tipped you off. Here are a few places to look that are often fruitful:

Title of the Page
A character or String in the title of the page (top left corner of the Firefox window). Often the long dash ("—") or ("|") characters are useful in this case. Maybe you notice that on an individual journal article the long dash always appears as the tenth character in the document title. Write this down. This often works well for search results pages and is very easy to make use of when coding.
URL
A substring in the URL. Does "article" always appear on your single newspaper article pages? If so, this would be good to note. The URL can be very useful to distinguish multiple types. However, it can be difficult to ensure that it is consistent, especially for URLs that end in .jsp or .php or for URLs that contain long strings of seemingly garbled characters. URLs are very easy to use when coding.
In some cases, the URL is comprised of elements that show us where we are in the site. For example, the Toronto Star website:

The Homepage:
http://www.thestar.com

The News Section:
http://www.thestar.com/News

A News Subsection:
http://www.thestar.com/News/GTA

An article in that subsection:
http://www.thestar.com/News/GTA/article/482613

Each step away from the homepage adds an easy to understand segment to the URL. By knowing the way the site generates its URL, you can easily make an If Statement that will check whether you are looking at an article.
A Headline
Does the headline only appear in <h3> tags on the search results page? Perhaps it always contains the String "Results: ". Using headlines if you can is one of the better techniques for differentiating content and is fairly easy to use when coding.
An Attribute
Use Solvent to grab some XPaths. Web designers add Attributes (i.e. colours, fonts, styles) to nodes to determine how they look. Perhaps you will notice an Attribute connected to a node that only shows up on your video recording page. Chances are if the node looks unique to your eye there's an Attribute to be found. If you're struggling to find an Attribute, try taking a look at any menus there might be. Often items will be bolded or in a different colour to reflect where you are in the site. Generally this is highly consistent. This only requires a few extra lines when coding.
An Image
Is there a little book icon on the pages that contain book entries? Is it always there? This can serve as an indicator and if it's available it's by and large highly consistent. This requires a few extra lines when coding.
A Link
This could be a "Next Entry" link, or a link that says "Click here to see full record." Either way, as long as it's consistent and can help you differentiate between content types, or search results pages, it's good. This requires a few extra lines when coding.
An XPath
If an XPath will only pick up nodes on a certain type of content page, this is a good criterion. This is particularly helpful for search results pages which are generally formatted considerably differently than single entry pages. Try creating an XPath that will grab all the titles found on a search results page. This requires a few extra lines when coding.

In 99% of websites, there IS a consistent difference. You may just have to think outside the box a little to find it. Ask someone else around you if you are stuck; you might just be overlooking something obvious.

Before moving on, you should have a list of all content types that can be found in your repository, along with criteria, written in plain English, which can definitely distinguish between the pages containing each content type. You certainly don't have to check every single entry in a database, but do check enough that you can be confident you are choosing reliable criteria. Twenty pages are plenty.

Tip:

If your site isn't quite so obvious with its title, that doesn't mean it doesn't include some useful characteristics. The Toronto Star's newspaper article titles don't say anything as clear as "Search Results", but they do follow a particular format.

"TheStar.com | World | Italian priest to organize 'beauty contest' for nuns"

No, I didn't make that title up. And at first glance, it may not look useful. "Article" doesn't appear, and we can't just use "World" as our criterion because the front page of the "World" section also contains this name in its title, and we can't scrape that. However, the title does include two "|" characters; something that is unique to articles.

Take a look at the titles on your webpage. Do they provide enough information to use? Be careful when using the title as your criterion, as it is not always the most consistently formatted aspect of a webpage. If you have another option to choose, it is probably best to take it.

The pages you don't want to scrape

Not all pages should be recognized by Zotero. You don't want your users to be able to scrape a splash page, or a contact form since there isn't likely any bibliographic information on the page. Therefore, seek out these pages on your site and make sure that the criteria you wrote down for each entry type will not also be true for the pages you do not want Zotero to use. Your list should contain criteria for each content type that will only be true (or false) when looking at that particular type of entry.

If the only criterion you had that distinguished your book pages from your journal article pages would also cause Zotero to consider the contact page a "book," fear not; there is a solution. Make a similar list that helps you distinguish between the book page and the contact page. Chances are this won't be difficult. You can then combine the two criteria later when you write the If Statement.

Example 14.1

If ((doc.title.match("—")) &&!(url.match("contact"))){
...
}

This will check if the title matches a long dash and the URL does not match the String "contact".

Filling in detectWeb (Tutorial)

In Chapter 10 you learned that Functions can be given any legal Variable name; however, translators are part of a much larger body of code that is kept out of sight. For this to work, Zotero must have certain pieces of information formatted in specific ways. The detectWeb Function is one of these. Therefore, you must create a Function named "detectWeb".

Example 14.2

function detectWeb(doc, url) {

}

Now, take the handwritten list of criteria you made in the last section and choose one that will be used in the Function to determine what type of page the user is viewing.

I am going to use the first item: the title contains either "Single Item" or "Search Results". You can choose another if you would like practice creating an If Statement of your own.

Using the String Methods from Chapter 7 (and possibly the RegExps from Chapter 12), turn your criteria into an If / Else Statement and add it to the Function.

Example 14.3

function detectWeb(doc, url) {
if (doc.title.match("Single Item") {

} else if (doc.title.match("Search Results") {

}
}

Finally, you must tell Zotero what to do if either of those conditions proves true. You do this by returning a value.

Example 14.4

function detectWeb(doc, url) {
if (doc.title.match("Single Item")) {
return "book";
} else if (doc.title.match("Search Results")) {
return "multiple";
}
}

This "return" is the value that is passed to Zotero if the above Statement is true and ultimately determines which Icon to display for the user to click on. The terms to use are part of a standard list, based on all the entry types Zotero can create. For a search results page, the correct term is "multiple"; for all other entry types, create the term using camelCase. For a book, you would: return "book"; for a magazine article: return "magazineArticle";

You can see all the entry types by opening Zotero in your browser, clicking on one of your saved entries (or creating a new one) and clicking on the drop—down box just under "View Snapshot."

Save your entry and click Execute. Depending on the page you are viewing, the test frame should show:

Example 14.5

12:00:00 detectCode returned type "multiple"

Or

12:00:00 detectCode returned type "book"

If you would like to see the Icon in the address bar, you will likely have to relaunch Firefox.

Filling in detectWeb (Advanced)

Select one criterion from your handwritten list for each content type present on your site and turn them into If / Else Statements. It is best to use If / Else Statements because with individual If Statements, your Function will return the last If Statement to match. Using If / Else will instead return the first match. If you notice when you execute your detectWeb function that an incorrect value is being returned, chances are one of your If Statements is not specific enough.

If you are having trouble, make sure you consult the String Method and RegExp chapters. Don't be afraid to use negative checks; for instance if something is NOT true, you can use ! along with your String Method.

Using the Title or the URL

If you are using the title of the page, you do so like this:

Example 14.6

if (doc.title...

If you are using the URL:

Example 14.7

if (url. ...

or

If (doc.location.href...

You can then add the relevant String Method and Arguments.

Using XPaths as Criteria

The rest of the time you will likely have to use an XPath Object. If this is the case it is always wise to use the namespace code.

Example 14.8

function detectWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

}

If you are testing for the presence of a node on the page, do so like this:

Example 14.9

If (doc.evaluate(myXPath, doc, null, XPathResult.ANY_TYPE, null).iterateNext()) {
return...
}

This will test as true if the node found at myXPath exists on the current page.

Using the Contents of an XPath as Criteria

If you need is to check what is contained in a particular node, simply add another method to the If Statement.

Example 14.10

If (doc.evaluate(myXPath, doc, null, XPathResult.ANY_TYPE, null).iterateNext().textContent.match("Search Results")) {
return...
}

This will return true if the first node captured by this XPath Object matches "Search Results". If you need the 3rd node, try refining your XPath or following the steps outlined in Chapter 11 to extract XPath content to another Variable, then use that Variable in the If Statement instead. Don't forget you can substitute href or src for textContent if it is a link or an image that is contained in the XPath Object, and you can use RegExps as your Arguments for String Methods.

Example 14.11

If (doc.evaluate(myXPath, doc, null, XPathResult.ANY_TYPE, null).iterateNext().src.match(/\d\daudio.jpg/)) {
return...
}

This would check if the URL location of the image found at myXPath matched two digits followed by "audio.jpg".

Multiple Criteria for one Item Type

Sometimes you need multiple criteria to ensure an item type will be properly isolated. The Glenbow Museum translator is an excellent example of how to do this:

Example 14.12

function detectWeb(doc, url) {

if (doc.title.match("Library Main Catalogue Search Results") && doc.location.href.match("GET_RECORD")) {
return "book";
} else if (doc.title.match("Library Map Collection Search Results") && doc.location.href.match("GET_RECORD")) {
return "map";

} else if (doc.title.match("Library Main Catalogue Search Results") && !(doc.location.href.match("GET_RECORD"))) {
return "multiple";
} else if (doc.title.match("Map Collection Search Results") && !(doc.location.href.match("GET_RECORD"))) {
return "multiple";
}
}

In plain English this checks if the title matches x AND doesn't match y.

You can also use the || feature to check if one OR the other statement is true.

Example 14.13

function detectWeb(doc, url) {
if (doc.title == "Item Record" || doc.title == "Notice") {
return "book";
}
}

Search Results Pages

If your website can be searched and the search results page has a URL with the same domain as the rest of the pages on the site, as well as links to the articles found in the search, then you need to include a "multiple" content type in your detectWeb Function. If not, you'll have to skip this stage — or treat it like you would a contact form or other unscrapable page.

Testing Across Entries

Once you have a detectWeb Function which you believe works for all content types on your site, spend five minutes searching the site and ensuring that the correct content icon always appears. Make sure that pages which do not contain bibliographical data return nothing.

If neither "true" nor "false" appears in the bottom window when you click the Execute button your statement has not worked. Check your spelling and syntax — especially if you get an error. If you are having trouble, work on one entry type at a time and get that working before you move on. Often the Search Results page is the easiest.

Totally stuck? Make liberal use of Zotero.debug() to find out what is working and what is not. You can have more than one Zotero.debug() in your code at once. You can comment out a line of code by surrounding it with /* and */ so that the program will ignore it. This is particularly helpful when debugging to isolate problem code. You should be able to debug and figure out the exact line of code where something goes awry, since debug will tell you the value of a variable exactly at the point in the code where the debug appears. If the value is not what you expect it to be, something before that point went wrong.

Still stuck? Walk away for an hour or two and then come back to look at it afresh.

Go back through your String Method or XPath section to make sure you are creating the elements properly. Perhaps take a look at other String Methods that might better isolate the criteria in question. Copy and paste the code into Komodo edit to make sure your syntax is correct. If all else fails, try posting your code and a question on the Zotero forums (last resort). The more practice you have, the easier it will be; the first one is always the most difficult.

Note: After creating the content criteria you will have to restart Firefox and Scaffold before any Icons will appear in the address window.

Make sure you save your work (you should have been doing this constantly) and move on to the "Code" tab. Congratulations, you have created your first working, useful Function!

Keep in mind that you need not use the match() Method. Perhaps you have noticed that the 10th character in the title is always a "|" on newspaper articles. In that case, your function would look like this:

Example 14.14

function detectWeb(doc, url) {
if (doc.title.indexOf("|")== 10) {
return "newspaperArticle";
}
}

Another Example

You can find dozens of examples of working detectWeb Functions by using Scaffold. Click on the "Load from Database" button and load up a few different translators. See how they have done it and see if you can follow along with what they have created.

Here is an example of a detectWeb Function for a website that contains three different content types that were distinguished by a graphic which appears on the page, either of a paintbrush, a book or a CD. Since each graphic had a url containing a descriptive word that allowed consistent distinction between the images, I was able to use this as my identifying criteria.

Example 14.15

function detectWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI; var nsResolver = namespace ? function(prefix) {
if (prefix == "x" ) return namespace; else return null;
} : null;

var xPath = '//td[2]/a/img';

if (doc.evaluate(xPath, doc, null, XPathResult.ANY_TYPE, null).iterateNext().src.match("paint")) {
return "artwork";
} else if (doc.evaluate(xPath, doc, null, xPathResult.ANY_TYPE, null).iterateNext().src.match("book")) {
return "book";
} else if (doc.evaluate(xPath', doc, null, XPathResult.ANY_TYPE, null).iterateNext().src.match("disc")) {
return "audioRecording";
}
}