Chapter 17: Common Problems when Scraping an Individual Entry Page

This Chapter

Saving to the "abstract" or "Loc. in Archive" fields
Common and useful RegExps
Common Author reformatting issues
The site has more than one language
Multiple content types possible. Gelling it all together.
Tips for Finding Good Sites to Translate.

Saving to the "abstract" or "Loc. in Archive" fields

For whatever reason, these two fields don't follow the same pattern as other fields. This was likely an oversight made when Zotero was first coded. Use the following pattern to save to these fields:

newItem.abstractNote = "content here";
newItem.archivelocation = "content here";

Common and Useful RegExps

You can create a RegExp to single out anything imaginable in Strings. However, some are more common than others. Here are a few that you might find particularly useful.

(/^s+/g, ''); //finds all instances of white space at the start of a string.
(\s*$/g, ''); //finds all instances of white space at the end of a string.
(/^\s*|\s*$/g, ''') //finds all instances of white space at the beginning or end of a string.
(/\s+/g, '') //finds any spaces.
(/\s\s+/g, '') //finds any instances of more than one space side by side.
(/\d+/g, '') //finds any digits.
(/\d\d\d\d+/g, '') //finds any instances of four digits in a row. Useful for dates.
(/\W+/g, '') //finds any non-alphanumeric character
(/n+/g, ''); //finds any carriage returns (new lines).
(/\[|\]+/g. ''); //finds any square bracket character.
(/\;+/g, ''); //finds any semicolons.

You can use these as building blocks to create RegExps of your own that meet your site's specific needs. RegExps can be tricky. If you are having a lot of trouble, try posting a question on a message board. Sometimes you will have better results by using a String rather than a RegExp. If you decide to do this, make sure what you come up with will work across all pages, not just the first one you work on.

x.replace(": ", '');

rather than

x.replace(/\:\s/,'');

Reformatting Author Names: Common Problems

Websites seem to have a million different ways to write the names of their authors. This often requires reformatting on your part. Here are a few common problems and their solutions.

Name all in Caps

If the website posts the name of the author ALL IN CAPS, you need to convert this to a more suitable format before saving into Zotero, else your users will have problems with their saved citations.

This requires cleaning using a couple of Loops and some String Methods.

var authorName = "ADAM CRYMBLE";
var words = authorName.split(/\s/);
var authorFixed = '';

for (var i in words) {
words[i] = words[i][0].toUpperCase() + words[i].substr(1).toLowerCase();
Zotero.debug(words[i]);
authorFixed = authorFixed + words[i] + ' ';
Zotero.debug('authorFixed = ' + authorFixed);
}
newItem.creators.push(Zotero.Utilities.cleanAuthor(authorFixed, 'author'));

This code splits the author's name into an Array containing one item for each word in the name. A For Loop then takes each word individually, capitalizes the first letter and prints the rest in lower case. This reformatted word is then saved to the end of "authorFixed." Once all the words have been reformatted the author is then saved into Zotero.

Last Name is Before First Name

What do you do if your author's name is "Crymble, Adam"? Reorder the words before saving. In this case, split the name at the comma, then resave it into a Simple Variable starting at the last name.

var authorName = "Crymble, Adam";
var words = authorName.split(", ");
var authorFixed = '';
for (i = words.length-1; i > -1; i--) {
authorFixed = authorFixed + words[i] + ' ';
}
newItem.creators.push(Zotero.Utilities.cleanAuthor(authorFixed, 'author'));

Use Zotero.debug() at various points in this block of code to see what is contained in each Variable at a given moment. This will help you understand exactly what is happening at each stage so you can tailor the code to your specific needs.

Manually Entering First and Last Name

There is an alternative way to enter an author's name into Zotero than the standard :

newItem.creators.push(Zotero.Utilities.cleanAuthor(author, "author"));

If you need to be able to enter the first or last name separately, you can do so like this:

newItem.creators.push({lastName: x, firstName: y , creatorType: "author"});

The variables "x" and "y" would hold the values you wanted put into the lastName and firstName fields respectively. This can be particularly helpful for authors with more than one last name. When Zotero comes across a name with more than three words, it automatically assumes only the last word is the surname. In cases like "Peter Van der Meer" it does this:

'creators' ...
'0' ...
'firstName' => "Peter Van der"
'lastName' => "Meer"
'creatorType' => "author"

Zotero has assumed Peter has a middle name: "Van der". When the citation is created, Peter will be going by "Meer, Peter Van der". If this isn't what you want, you can use an If Statement to check for cases like Peter and then save them using the manual method.

var authorName = "Peter Van der Meer";

if (authorName.match("Van ")} {
var w = authorName.indexOf("Van ");
var x = authorName.substr(0, w-1);
var y = authorName.substr(w);
newItem.creators.push({lastName: x, firstName: y , creatorType: "author"});
}

You could use a similar technique for other situations like this.

The Site Has Versions in More than One Language

This is common for repositories created in Canada or Europe where there is more than one official language. Often you will be able to toggle between English and foreign language versions of the site while looking at the single entry page. This might involve clicking a flag, or a "version francaise" link. Making your translator bi-lingual or multi-lingual is easier than you might think. As long as only the language changes on the various versions of the site and the format stays the same, you can use If Statements to convert the headers to English, then translate the site as normal. The easiest place to do this is in the While Loop that you populate the "items" Object. (This explanation builds upon the template used in the Sample Page tutorial. If you are having difficulty following along, check out Chapter 16 to see how the template was created).

while (headers = myXPathObject.iterateNext()) {

headersTemp = headers.textContent.replace(/\s+/g, '');
contents = myXPathObject2.iterateNext().textContent.replace(/^\s*|\s*$/g, '');

if (headersTemp == "titre:") {
headersTemp = "title:";
} else if (headersTemp == "AuteurPrincipal:") {
headersTemp = "PrincipalAuthor:"
}

items[headersTemp]=contents;
}

Continue to add Else If to the If Statement for each case where the spelling differs between English and the foreign language version of the site. Assuming your translator worked in English, it should now work for the foreign language version as well. If you do not see the proper Icon in the address bar when viewing the foreign language version of the page, your "target" on the "Meta Data" tab of Scaffold and possibly your "detectWeb" Function will have to be adjusted so that the foreign versions are picked up by Zotero.

Site has more than one Content Type

If your repository has more than one content type that the "detectWeb" Function can distinguish between, you must make a few quick changes to your "scrape" Function. In the Sample Page tutorial, you created a "book" item type:

var newItem = new Zotero.Item('book');

If your page has multiple content types, you must use an If / Else If statement to create the proper Zotero.item for the current page.

if (detectWeb(doc, url) == "book") {
var newItem = new Zotero.Item("book");
} else if (detectWeb(doc, url) == "audioRecording") {
var newItem = new Zotero.Item("audioRecording");
} else if (detectWeb(doc, url) == "videoRecording") {
var newItem = new Zotero.Item("videoRecording");
} else if (detectWeb(doc, url) == "newspaperArticle") {
var newItem = new Zotero.Item("newspaperArticle");
}

This will open up the proper fields for that entry type. You may have noticed that a "book" entry in Zotero does not have a "Publication" field, but a "journalArticle" does. Having the correct entry type also ensures users get the proper citation style. Note that if the different entry types have different formats for types of data, you will have to use If Statements in the "scrape" function that performs the correct actions depending on what entry type is being saved.

For example, if the "book" pages display the title all in CAPITAL LETTERS and the "audioRecording" pages displays the title all in lower case letters, use an If Statement to check which entry type you are currently saving, then add the appropriate code for each.

if (detectWeb(doc, url) == "book"){
//insert code to reformat title from "CAPS" to "Proper Format"
} else if (detectWeb(doc, url) == "audioRecording")
//insert code to reformat title from "lower case" to "Proper Format"
}

This will have to be done for all cases where format changes across entry type.

Finding a Good Site to Scrape

Because a Zotero "scraper" relies heavily on a website having consistent format, it will save you a lot of stress to choose the site you want to translate carefully. Just because a site has several content types or formats does not mean you cannot translate it, it just might mean that you will require a lot more code. And just because a site looks new does not mean it will be easy to translate. For your sanity's sake, your first translator should probably adhere to most, if not all of the following criteria. You can start to translate more difficult sites once you've got the hang of things.

The website must be a database, that is, it must contain many entries of a finite number of content types. An archive, newspaper, journal repository or blog are examples of these. Your bank's website is not.
You must be able to view a single entry page. (for example, an entry in Amazon or a newspaper article as opposed to the front page of a newspaper website that contains many articles). This is ok: http://www.thestar.com/article/552008 . This is not: http://www.thestar.com/ .
Single entries must be consistently formatted. They should contain the same types of information which can be found in the same DOM nodes, across multiple entries. For instance, if the title appears in <h3> tags in one entry, it should do so for all entries. The more variation you encounter, the more "cases" or if statements you will need to incorporate to make sure the data is in the correct format for Zotero.
This single entry page must have a stable and different URL from the other pages on the site. If the URL ends with .jsp, chances are the URL is created dynamically and if you were to cut and paste that URL into another browser window it would not return the same entry.
You must be able to search the contents of the page with a search box and the search results page must contain a list of results with a link to each individual entry on the list. For example: http://www.thestar.com/search?&q=environment&r= .
The URL of the search page must start with the same characters as the URL for the single entry pages. For example: http://mywebsite.com/search and http://mywebsite.com/article is good. http://search.mywebsite.com/ and http://mywebsite.com/article is bad. There is a work-around for sites like this, but it's a bit too complicated for a first try.

If you do not have a choice about which webpage you need to translate, it may still be possible. One solution might be to find another website that has the same problem that already has a translator built for it and take a look at its translator. You can do this by opening Scaffold (under Tools in Firefox) and click on the "Load from Database" button in the top left corner.

If you are the administrator of the site in question and your site fails to meet more than one of these criteria, you might consider making some changes to your site or adding a metadata system, discussed in Chapter 1.

Newspapers generally make excellent first attempts. Consider writing a translator for your local paper before moving on to journal repositories.

Good Luck!