Chapter 16: Scraping the Individual Entry Page: Scrape Function

This Chapter

Scraping the Individual Entry Page (Tutorial)
The Finished DoWeb Function
Submit the Translator

This chapter builds upon the translator started in Chapter 13. If you have already created the translator you can find it again by opening Scaffold and clicking on the "Load From Database" icon. You can then scroll through the translators until you find yours.

Scraping the Individual Entry Page (Tutorial)

This part of the lesson will take the bibliographic items from the Sample 1 page, organize the data and then save it into Zotero. This is the last step in coding a translator.

Strictly speaking, you could put all the remaining code into the "doWeb" Function, but this tutorial uses a more modular approach. By using a more modular approach you will make it less likely that you will mess up parts of your code that have already been perfected.

If you are not already at the Sample 1 page with Scaffold open and your translator loaded, do so now.

Above "doWeb" declare a new Function "scrape" and pass it the Arguments doc and url. You will be using XPaths in this function so don't forget to include the namespace code.

Example 16.1

function scrape(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;
}

You could also place this code below the "doWeb" Function; remember, computer programs do not run from top to bottom so in this case the order of your Functions is not important. If you prefer to put it underneath, feel free to.

There is one other step required for setup. You must declare a new Zotero.Item(). This Method, created specifically for use with Zotero will create the file into which everything is saved. It is a good idea to declare this just after the namespace code. In the Sample 1, the Argument for this method is "book" since our bibliographic entry is book. This could be any Zotero entry type, written using camelCase. To see the possible entry types, open Zotero and click on a saved entry (or create a new one). Click on the drop down menu just under the "View Snapshot" button and you can see a list of all entry types.

Example 16.2

function scrape(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var newItem = new Zotero.Item("book");
}

This "newItem" works much like an Object. The descriptor contains the name of the bibliographic field (see a Zotero entry for a complete list). The contents are the piece of information you want to save. You can see how this works by saving the URL of the entry in the URL field.

Example 16.3

function scrape(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var newItem = new Zotero.Item("book");
newItem.url = doc.location.href;
}

On the right side of the declaration, doc.location.href is the URL of the current page. On the left side, notice that instead of using newItem[url] the example uses newItem.url . This is known as "dot notation" and you have been using it every time you have used a string method. You do not need to understand the specifics. Just remember when adding an item to Zotero's newItem Variable (or whatever you have named it), you use dot notation rather than square brackets.

Example 16.4

newItem.title = "abc";
newItem.notes = "something interesting";

To make the translator work, add newItem.complete(); just inside the bottom of the "scrape" Function. Regardless of what other code you add to this Function, the newItem.complete(); should be the last line.

Example 16.5

function scrape(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var newItem = new Zotero.Item("book");
newItem.url = doc.location.href;

newItem.complete();
}

Save your work and click the execute button. You should see this:

Example 16.6

12:00:00 Returned item:
'itemType' => "book"
'creators' ...
'notes' ...
'tags' ...
'seeAlso' ...
'attachments' ...
'url' => "/member-projects/zotero-guide/sample1.html"
'repository' => "How to Write a Zotero Translator"
'complete' => function(...){...}

12:00:00 Translation successful

Congratulations, you have successfully created a program that will save a URL to Zotero. Now it's time to add the rest of the data.

Capturing the Bibliographic Data

Not all the information is as easy to snag as the url, so you will have to use some XPath Objects. Add the XPath declarations and the Loop created in Chapter 11 to save both the bibliographic headers and the bibliographic information to an Object. In the example I have named this Object "items".

Example 16.7

function scrape(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var newItem = new Zotero.Item("book");
newItem.url = doc.location.href;

var items = new Object();
var header;
var contents;

var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null);
var myXPathObject2 = doc.evaluate('//td[2]', doc, nsResolver, XPathResult.ANY_TYPE, null);

while (headers = myXPathObject.iterateNext()) {
contents = myXPathObject2.iterateNext().textContent;
items[headers.textContent]=contents;
}
Zotero.debug(items);

newItem.complete();

}

The debug output should read:

Example 16.8

12:00:00 'Title:' => "Method and Meaning in Canadian Environmental History"
'Principal Author:' => "
Alan MacEachern; William J. Turkel
"
'Imprint:' => "
[Toronto:
Nelson Canada,
2009]
"
'Subjects:' => "Environment"
'' => "Tables."
'ISBN-10:' => "0176441166"
'Collection:' => "None"
'Pages:' => "
573
"

You should notice one flaw in this. Under Subjects: only Environment and Tables appear. This is the first and last of the four items. What happened? Consider how the While Loop and Object Variables work. Values are placed into the "items" Object one at a time. When the code gets to "Subjects:" it saves "Environment"

Example 16.9

items["Subjects:"] = "Environment"

Next, "History" is saved into items[""] since the table contains no string at the next position.

Example 16.10

items[""] = "History"

On the following iteration, Methodology is saved into items[""]. The program has written over "History" because the code told it to. On the last iteration, Methodology is replaced with "Tables".

This can be fixed with a simple If Statement that checks for an empty String and changes it to something unique.

Example 16.11

function scrape(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var newItem = new Zotero.Item("book");
newItem.url = doc.location.href;

var items = new Object();
var headers;
var contents;
var blankCell = "temp";
var headersTemp;

var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null);
var myXPathObject2 = doc.evaluate('//td[2]', doc, nsResolver, XPathResult.ANY_TYPE, null);

while (headers = myXPathObject.iterateNext()) {
headersTemp=headers.textContent;
if (!headersTemp.match(/\w/)) {
headersTemp = blankCell;
blankCell = blankCell + "1";
}

contents = myXPathObject2.iterateNext().textContent;
items[headersTemp]=contents;
}
Zotero.debug(items);

newItem.complete();

}

This will only do anything if the table contains a blank cell in the left hand column. The match() String Method checks if "headersTemp" does not match a word character (essentially, is it empty?). If it is empty, the cell is filled with blankCell, blankCell1, blankCell11, etc for each empty cell.

The debug output shows you what this has done.

Example 16.12

12:00:00 'Title:' => "Method and Meaning in Canadian Environmental History"
'Principal Author:' => "
Alan MacEachern; William J. Turkel
"
'Imprint:' => "
[Toronto:
Nelson Canada,
2009]
"
'Subjects:' => "Environment"
'temp' => "History"
'temp1' => "Methodology"
'temp11' => "Tables."
'ISBN-10:' => "0176441166"
'Collection:' => "None"
'Pages:' => "
573
"

Removing Whitespace

You now have all the bibliographic information. Before you save it into Zotero, some of it needs to be cleaned up. One thing you may notice is that there is excessive whitespace around some of the data. For instance, "Pages: ", "Imprint: " and "Principal Author: " all contain whitespace before and after the entry. Likewise, "Principal Author: " contains a space in it; as mentioned in Chapter 4, Object descriptors cannot contain spaces. If you try to use Zotero.debug() on "Principal Author:" you will get:

===>undefined<===(undefined).

Both of these problems can be solved using String Methods and RegExps.

First, remove all the whitespace from the before and after the content. The easiest combination of Methods and RegExps to use to remove whitespace is the replace() Method and a RegExp that checks the beginning and end of a String for whitespace.

Example 16.13

replace(/^\s*|\s*$/g, '');

In plain English this reads, if there are zero or more instances of a whitespace character at the beginning of the String or the end of the String replace the whitespace characters with nothing (remove them). Do this for all matches, not just the first match. This can be done after the While Loop has finished, but it would require another Loop. To save time, simply add this method to line of code where "contents" is changed in each iteration.

Example 16.14

function scrape(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var newItem = new Zotero.Item("book");
newItem.url = doc.location.href;

var items = new Object();
var headers;
var contents;
var blank = temp;
var headersTemp;
var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null);
var myXPathObject2 = doc.evaluate('//td[2]', doc, nsResolver, XPathResult.ANY_TYPE, null);

while (headers = myXPathObject.iterateNext()) {
headersTemp=headers.textContent;
if (!headersTemp.match(/\w/)) {
headersTemp = blankCell;
blankCell = blankCell + "1";
}
contents = myXPathObject2.iterateNext().textContent.replace(/^\s*|\s*$/g, '');
items[headersTemp]=contents;
}
Zotero.debug(items);

newItem.complete();

}

Note that this will not; however, remove spaces between words in an entry, only spaces at the beginning or end of the String.

In Chapter 7 you learned that Methods like this will act in essence like an If Statement. If the String in question contains the whitespace characters, they will be removed. If it doesn't then the Method will remove zero instances of whitespace. Therefore you do not have to worry about Strings getting changed that didn't need to be, such as the title.

Since you also need to remove the spaces from "headersTemp" wherever they exist, you can use a similar Method. This time the Method is added to the line of code where the "items" Object is populated.

Example 16.15

while (headers = myXPathObject.iterateNext()) {
if (!headersTemp.match(/\w/)) {
headersTemp=headers.textContent;
headersTemp = blankCell;
blankCell = blankCell + "1";
}

contents = myXPathObject2.iterateNext().textContent.replace(/^\s*|\s*$/g, '');
items[headersTemp.replace(/\s+/g, '')]=contents;
}
Zotero.debug(items);

newItem.complete();

Using these two replace Methods in your While Loop when creating the "items" Object is a good way to ensure your data doesn't have unwanted whitespace which can cause errors. Feel free to do this on all future translators you make. Make sure you save your work before continuing.

Reformatting the Authors

Authors are added to Zotero a bit differently than most types of data.

There are two things that must be done to the pair of Authors found in items["PrincipalAuthor:"].

First, the names of multiple authors must be split up and cleaned of unwanted characters. Second, each author must be "pushed" into the newItem.creators field of the Zotero entry. Unlike entry fields such as the title, Zotero allows for more than one author or creator.

Before anything else, you must check to see if your data contains any authors at all. To do this, use an If Statement after the While Loop which created your "items" Object:

Example 16.16

if (items["PrincipalAuthor:"]) {

}

This checks if items["PrincipalAuthor:"] has been declared. It would have been automatically declared when your While Loop put something into it. This If Statement merely prevents an error that would occur if you attempted to change the contents of items["PrincipalAuthor:"] but it did not exist on that page. This is not particularly likely to occur with authors, but in many repositories it is not uncommon for some entries to not contain data for all fields. It is quite common to find an "abstract" field for one entry and none for another. By using an If Statement to check for the presence of this field, the code needed to fix a problem will run only when needed.

On the Sample 1 page, the two authors are connected with a semi-colon and a single space. However, on Sample 2 there is only one author and no semi-colon. This means there are two cases for which the code must account. This will require an If / Else statement within the first If Statement to check whether or not the String containing the names of the authors has a semi-colon followed by a space.

Example 16.17

if (items["PrincipalAuthor: "]) {
var authors = items["PrincipalAuthor: "];
if (authors.match("; ")) {

} else {

}
}

The contents of items["PrincipalAuthor:"] have been transferred to a Simple Variable "author" because it will prevent the data in items["PrincipalAuthor"] from being changed permanently when testing the code, until you are sure you no longer need the original value.

In this case, if the String matches a semi-colon followed by a space, you need to split it. Recall that the split() method will convert the String to an Array and will remove the characters used as the Arguments in the method: in this case, the semi-colon and the trailing space.

Example 16.18

if (items["PrincipalAuthor:"]) {
var author = items["PrincipalAuthor:"];
if (author.match("; ")) {
var authors = author.split("; ");
Zotero.debug(authors);
} else {

}
}

Make sure you are looking at Sample 1 and execute your code. Your debug should show:

Example 16.19

12:00:00 '0' => "Alan MacEachern"
'1' => "William J. Turkel"

Since there are no more unwanted characters in this case, you can now save the item into Zotero. If there had been additional unwanted characters, more String Methods would have been required to get the data in the desired format.

Example 16.20

if (items["PrincipalAuthor:"]) {
var author = items["PrincipalAuthor:"];
if (author.match("; ")) {
var authors = author.split(";");
for (var i in authors) {
newItem.creators.push(Zotero.Utilities.cleanAuthor(authors[i], "author"));
}
} else {

}
}

This For Loop runs once for each item in "authors", which should be the same as the number of authors for the book. The line of code within the For Loop looks confusing, but it's really not too bad and is made up of pre-defined Zotero code that accepts Arguments. The only things that change from translator to translator are the Arguments. "authors[i]" is the Array that was defined in the line above the For Loop and it will point to each author successively as the Loop progresses. "author" tells Zotero what type of creator the person is. The choices here are "author", "editor", "contributor", "seriesEditor" and "translator ".

The last step here is to fill in the "Else" part of the If / Else statement. This is what you want Zotero to do if the author String does not contain a semi-colon and space. In other words: for single authors. This code is almost identical but a little less complicated than that for multiple authors.

Example 16.21

if (items["PrincipalAuthor:"]) {
var author = items["PrincipalAuthor:"];
if (author.match("; ")) {
var authors = author.split(";");
for (var i in authors) {
newItem.creators.push(Zotero.Utilities.cleanAuthor(authors[i], "author"));
}
} else {
newItem.creators.push(Zotero.Utilities.cleanAuthor(author, "author"));
}
}

The only difference here is that the Variable containing the single author's name is the "author" Variable rather than "authors[i]" since the "authors" Array was defined within an If Statement that will only run for pages containing multiple authors.

Save your work and execute the code. Your translator should now save the URL and all the authors. If your page contained authors formatted in a different way, the advanced section might have the information you need to figure out how to sort them.

Cleaning the Imprint Fields

It is not uncommon for repositories to list the publisher, place and date all in one field. In Sample 1, this is labeled "Imprint:". For this information to be properly saved into three different Zotero fields, you have to do some rearranging and cleaning of the data.

Currently the contents of items["Imprint:"] looks like this:

Example 16.22

12:00:00 [Toronto:
Nelson Canada,
2009]

It needs to look like this:

Example 16.23

newItem.place = "Toronto";
newItem.publisher = "Nelson Canada"
newItem.date = "2009";

This is going to require some creative use of String Methods. Start by creating an If Statement that checks if the Imprint data was present on the page. This is exactly the same as was done for "PrincipalAuthor:" This can go after the code you wrote to save the authors into Zotero. Though it does not matter where you put this If Statement as long as it is after the While Loop which populated the "items" Object, before newItem.complete(), and is not inside a block of code used to clean another piece of data.

Example 16.24

if (items["Imprint:"]) {

}

Next, the data needs to be broken into three parts in a way that it will be consistently formatted. Looking at the data, you may notice that the Place appears before a colon. Likewise, you may notice that the date appears after the last comma. (In this case there is only one comma in the string, but commas are common. Best be safe and go with the last comma and avoid possible problems with multiple matches.) You can use these criteria to split up the string using the substr(), indexOf() and lastIndexOf() Methods. There are other ways to do this, but the following is quite robust.

This example will be done in three parts so you can see each piece of data being formatted and saved. But first, there's still far too much whitespace in the Imprint data. Zotero.debug(items["Imprint:"]); to see what I mean. This can be quickly removed with a replace() String Method and a RegExp that checks for multiple spaces in succession.

Example 16.25

if (items["Imprint:"]) {
items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, '');
Zotero.debug(items["Imprint:"]);
}

The debug should return:

Example 16.26

12:00:00 [Toronto:Nelson Canada,2009]

Place

Now the "place" will be stored, since it is at the front of the string and easy to access. What you have to do is calculate the index of the colon in the long items["Imprint:"] String. A substring of the original long String then needs to be saved to newItem.place, which is the Zotero "place" field. This substring starts at index position 1 (thereby skipping the "[" character at the start of the String), and going to colonLoc - 1, which would be index position 7 (one before the colon ).

Example 16.27

if (items["Imprint:"]) {
items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, '');
Zotero.debug(items["Imprint:"]);

if (items["Imprint:"].match(":")) {

var colonLoc = items["Imprint:"].indexOf(":");
Zotero.debug(colonLoc);

newItem.place = items["Imprint:"].substr(1, colonLoc-1);
Zotero.debug(newItem.place);
}
}

The debug of this should show:

Example 16.28

12:00:00 [Toronto:Nelson Canada,2009]
12:00:00 ===>8<===(number)
12:00:00 Toronto

Date

Next, isolate the date from the long String. Getting the date first is easiest since it requires the least reformatting. But first, remove the extra Zotero.debug()s from your code so that you do not get bombarded with too many debug messages.

Example 16.29

if (items["Imprint:"]) {
items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, '');
Zotero.debug(items["Imprint:"]);

if (items["Imprint:"].match(":")) {
var colonLoc = items["Imprint:"].indexOf(":");
newItem.place = items["Imprint:"].substr(1, colonLoc -1);

var commaLoc = items["Imprint:"].lastIndexOf(",");
Zotero.debug(commaLoc);

var date1 =items["Imprint:"].substr(commaLoc + 1);

newItem.date = date1.substr(0, date1.length-1);
Zotero.debug(newItem.date);
}
}

The debug of this should show:

Example 16.30

12:00:00 [Toronto:Nelson Canada,2009]
12:00:00 ===>22<===(number)
12:00:00 2009

This works much like the "place" code above, except instead of getting a substring from the start of the original String, you use the position of the last comma to get a String that contains: "2009]". The last step removes the "]" before saving the date into Zotero. It would also have been possible to remove the right square bracket in the second line by changing it to:

Example 16.31

var date1 =items["Imprint:"].substr(commaLoc + 1, 4);

This would have resulted in the exact same string: "2009", however if you came across an entry that showed the date as "2008-09" the above technique would only return "2008". By using the technique in Example 16.31, this problem is avoided.

Publisher

The publisher is collected last because it is in the middle and in order to get to it you need to know where it starts and where it ends. The previous two steps have provided this information in the form of the "colonLoc" and "commaLoc" variables. Don't forget to remove unneeded Zotero.debug()s from your code to make it easier to follow. Then use the "colonLoc" and "commaLoc" variables to isolate the Publisher name with a substr() String Method.

Example 16.32

if (items["Imprint:"]) {
items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, '');
Zotero.debug(items["Imprint:"]);

if (items["Imprint:"].match(":")) {
var colonLoc = items["Imprint:"].indexOf(":");
newItem.place = items["Imprint:"].substr(1, colonLoc -1);

var commaLoc = items["Imprint:"].lastIndexOf(",");
var date1 =items["Imprint:"].substr(commaLoc + 1);
newItem.date = date1.substr(0, date1.length-1);

newItem.publisher = items["Imprint:"].substr(colonLoc+1, commaLoc-colonLoc-1);
Zotero.debug(newItem.publisher);
}
}

The debug of this should show:

Example 16.33

17:40:59 [Toronto:Nelson Canada,2009]
17:40:59 Nelson Canada

To see what is really going on, count the number of characters (including whitespace) in the original String. The newItem.publisher was created from a substring of the original String, starting at position 9 (colonLoc + 1) and moving forward 13 spaces (commaLoc-colonLoc-1 ). So you have used Variables you had already defined. Since you have used relative positions (created custom for each page), rather than absolute positions (ie., 4), this should work for any page that stores the Imprint data in a format: [Place: Publisher, Date]

Else

The code above uses an If Statement to check for a colon. But what if there was a data entry mistake, or perhaps a slightly differently formatted Imprint: that you didn't come across? By adding a simple Else to the original If Statement, you can create a catch-all that isn't perfect but that will still save the data in a way that should create a proper citation (in most cases).

Example 16.34

if (items["Imprint:"]) {
items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, '');
Zotero.debug(items["Imprint:"]);

if (items["Imprint:"].match(":")) {
var colonLoc = items["Imprint:"].indexOf(":");
newItem.place = items["Imprint:"].substr(1, colonLoc -1);

var commaLoc = items["Imprint:"].lastIndexOf(",");
var date1 =items["Imprint:"].substr(commaLoc + 1);
newItem.date = date1.substr(0, date1.length-1);

newItem.publisher = items["Imprint:"].substr(colonLoc+1, commaLoc-colonLoc-1);
} else {
newItem.publisher = items["Imprint:"];
}
}

Your users might not get a perfect citation, but at least they won't have to look up the entry again later when they realize the place, publisher, and date weren't saved at all. If you've got this section working, remove all of the Zotero.debug()s.

Adding tags (Subject:) field

Tags are helpful when you have a large collection of Zotero entries. All entries that share the same tag can quickly be sorted and isolated. If you were studying sharks, you could tag all entries related to sharks and not worry about losing track of them.

As with authors, you can create multiple tags for a single entry. Therefore, it is easiest to put all the items you want to turn into a tag in an Array, and then save those items one at a time into Zotero.

Before the first While Loop declare an Array named "tagsContent" to hold this information. Then, add the following code to the While Loop to save the contents of the blank subject fields to your Array.

Example 16.35

var tagsContent = new Array();
while (headers = myXPathObject.iterateNext()) {
headersTemp=headers.textContent;
if (!headersTemp.match(/\w/)) {
headersTemp = blankCell;
blankCell = blankCell + "1";
}
contents = myXPathObject2.iterateNext().textContent.replace(/^\s*|\s*$/g, '');;

if (headersTemp.match("temp")) {
tagsContent.push(contents);
}
Zotero.debug(tagsContent);

items[headersTemp.replace(/\s+/g, '')]=contents;
}

Now, before you save the items into Zotero, you must also put the contents of items["Subjects:"] into the array if it exists, since they too should be a tag. Do this after the While Loop. You could do this below the Imprint code, or any place that is not within an If Statement or Loop designed for other use. Don't forget to use an If Statement to make sure items["Subjects:"] has been defined.

Example 16.36

if (items["Subject:"]) {
tagsContent.push(items["Subject:"]);
}

Now you can use a For Loop to save each item into Zotero.

Example 16.37

if (items["Subjects:"]) {
tagsContent.push(items["Subjects:"]);
}

for (var i = 0; i < tagsContent.length; i++) {
newItem.tags[i] = tagsContent[i];
}
Zotero.debug(newItem.tags);

Your entry should now be saving the URL, both authors, place, publisher, date and four (4) tags.

Saving the Rest

Take a look at what's left to save:

Example 16.38

12:00:00 'Title:' => "Method and Meaning in Canadian Environmental History"
'ISBN-10:' => "0176441166"
'Collection:' => "None"
'Pages:' => "573"

There are four items and all of them are already well formatted. If they had needed more tweaking, you could use If Statements and String Methods as with the other data before saving them into Zotero. Now you will learn a great Function that makes saving well formatted data a piece of cake. It works for any website where you have already cleaned the data so feel free to reuse this one.

Declare a new Function outside of both "scrape" and "doWeb". I have called this Function "associateData".

Example 16.39

function associateData (newItem, items, field, zoteroField) {
if (items[field]) {
newItem[zoteroField] = items[field];
}
}

This Function takes four Arguments and is a reorganizer. It takes the data from your "items" Object and puts it into the format used to save into Zotero . This is useful because you will want to do this over and over again. Rather than rewrite multiple If Statements you can simply call this Function for each type of possible data that your webpage might have. If the data does not appear nothing will happen. If it does, it will be saved to Zotero.

Just above the newItem.complete() line, call this Function four times; once for each remaining item type.

Example 16.40

associateData (newItem, items, "Title:", "title");
associateData (newItem, items, "ISBN-10:", "ISBN");
associateData (newItem, items, "Collection:", "extra");
associateData (newItem, items, "Pages:", "pages");

newItem.complete();

You are almost done. One important thing to remember is the ONLY required field for a Zotero entry is the "Title" field. If this field is left blank users will receive an error when they attempt to save an item. Since this field is so important, you should create a failsafe way to make it impossible for a blank title to be saved. Put this at the top of the "scrape" Function, right under the line where you set the URL of the entry. That way, a temporary title is in place and will be used in the case where there is no title, but if the entry contains a real title, that temporary title will be overwritten.

Example 16.41

function scrape(doc, url) {
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

var newItem = new Zotero.Item('book');
newItem.url = doc.location.href;
newItem.title = "No Title Found";

One final point: Zotero automatically saves the "repository" field with the name of your translator. If this makes sense, it is fine to leave this. However, if your translator is called "Example Database Entry" or something similar, you might want to change the repository. You do not need to scrape anything to do this; you can just add it directly. Put it anywhere you like, as long as it is outside any Loops or If Statements.

Example 16.42

newItem.repository = "NiCHE";

Save your work and relaunch Firefox. Now it's time to try it out for real. Go to the Sample 1 page and click on the book icon . If you have followed along properly Zotero should save an accurate entry for the page. Try it out on Sample 2 as well. Try it out on the Search Results Page. Try it lots. Try it on every page that it should scrape. Does it work?

Congratulations! You have finished the tutorial section of the guide. You should now have a working translator. So that others can benefit from this guide, I would ask that you do not submit this translator to others . It ruins the challenge if the Sample Pages are already Zotero enabled.

In the next chapter you will learn some more tricks for more difficult, but common translating problems.

The Finished DoWeb Tab

There are lots of ways to accomplish the same thing with code. If your translator looks different but works consistently, that's great.

Example 16.43

function doWeb(doc, url) {
//namespace code
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

//variable declarations
var articles = new Array();
var items = new Object();
var nextTitle;

//If Statement checks if page is a Search Result, then saves requested Items
if (detectWeb(doc, url) == "multiple") {
var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null);
while (nextTitle = titles.iterateNext()) {
items[nextTitle.href] = nextTitle.textContent;
}
items = Zotero.selectItems(items);
for (var i in items) {
articles.push(i);
}
} else {
//saves single page items
articles = [url];
}

//Tells Zotero to process everything. Calls the "scrape" function to do the dirty work.
Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();});
Zotero.wait();
//Translator is FINISHED after running this line. Note: code doesn't run from top to bottom only.
}

//The function used to save well formatted data to Zotero
function associateData (newItem, items, field, zoteroField) {
if (items[field]) {
newItem[zoteroField] = items[field];
}
}

function scrape(doc, url) {
//namespace code
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

//variable declarations
var newItem = new Zotero.Item('book');
newItem.url = doc.location.href;
newItem.title = "No Title Found";

var items = new Object();
var headers;
var contents;
var blankCell = "temp";
var headersTemp;
var tagsContent = new Array();

var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null);
var myXPathObject2 = doc.evaluate('//td[2]', doc, nsResolver, XPathResult.ANY_TYPE, null);

//While Loop to populate "items" Object and save tags to an Array.
while (headers = myXPathObject.iterateNext()) {

headersTemp = headers.textContent;
if (!headersTemp.match(/\w/)) {
headersTemp = blankCell;
blankCell = blankCell + "1";
}

contents = myXPathObject2.iterateNext().textContent;
if (headersTemp.match("temp")) {
tagsContent.push(contents);
}

items[headersTemp.replace(/\s+/g, '')]=contents.replace(/^\s*|\s*$/g, '');
}

//Formatting and saving "Author" field
if (items["PrincipalAuthor:"]) {
var author = items["PrincipalAuthor:"];
if (author.match("; ")) {
var authors = author.split("; ");
for (var i in authors) {
newItem.creators.push(Zotero.Utilities.cleanAuthor(authors[i], "author"));
}
} else {
newItem.creators.push(Zotero.Utilities.cleanAuthor(author, "author"));
}
}

//Formatting and saving "Imprint" fields
if (items["Imprint:"]) {
items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, '');

if (items["Imprint:"].match(":")) {
var colonLoc = items["Imprint:"].indexOf(":");
newItem.place = items["Imprint:"].substr(1, colonLoc-1);

var commaLoc = items["Imprint:"].lastIndexOf(",");
var date1 =items["Imprint:"].substr(commaLoc + 1);
newItem.date = date1.substr(0, date1.length-1);
newItem.publisher = items["Imprint:"].substr(colonLoc+1, commaLoc-colonLoc-1);
} else {
newItem.publisher = items["Imprint:"];
}
}

//Saving the tags to Zotero
if (items["Subjects:"]) {
tagsContent.push(items["Subjects:"]);
}
for (var i = 0; i < tagsContent.length; i++) {
newItem.tags[i] = tagsContent[i];
}

//Associating and saving the well formatted data to Zotero
associateData (newItem, items, "Title:", "title");
associateData (newItem, items, "ISBN-10:", "ISBN");
associateData (newItem, items, "Collection:", "extra");
associateData (newItem, items, "Pages:", "pages");

newItem.repository = "NiCHE";

//Scrape is COMPLETE!
newItem.complete();
}

Submit the Translator to Zotero

If you were successful and made a working translator that you have thoroughly tested, you can submit it to the Zotero team. If you don't submit it, no one will be able to use it but you, since it is currently saved locally on your computer. To submit a translator, open it in Scaffold and click on the "Save to Clipboard" Icon at the top of the window. You can then paste the entire translator into a word document and save it. Post that file to Zotero's Google Group, which is where they prefer to receive translators. It probably wouldn't hurt to advertise your contribution on the Zotero Forums as well.

Previous Page || Next Page