Chapter 13: Scaffold Metadata Tab

This Chapter

Filling in the Metadata Tab (Tutorial)
Filling in the Metadata Tab (Advanced)
Further Reading

From now on, each chapter will have sections for those who are following along with the tutorial for the Sample Pages, and for those who are writing their own translator for another site. These are marked (Tutorial) and (Advanced) respectively. You can choose to read either or both as you see fit.

Filling in the Metadata Tab (Tutorial)

This is the easiest part.

Open the Sample 1 page in a new window and launch Scaffold. Ensure you do not have a translator loaded.

On the MetaData Tab, fill in the fields as follows:

Translator ID

This is a series of letters and numbers that are unique to each translator. Every time you launch Scaffold you should get a different ID. You can also get a new ID by clicking the "Generate" button to the right of the field.

Do not try to use the ID of a translator already saved to the Zotero database as this will save over an already working translator.

Clicking "Generate" when a translator is already loaded will not clear the other fields in Scaffold. If you are starting a new translator, it is easiest to re-launch Scaffold and an empty entry will appear for you to use.

Label

This is the title of the site or the content management system that you are writing the translator for. Be as descriptive as possible so others can tell for which site your translator has been written. Zotero does not use this line of code to run, so there is no wrong answer; however, humans who are looking through their list translators will appreciate a straight forward label that relates to the site for which it was built.

In this case, the Sample Pages are part of "How to Write a Zotero Translator" so use that as your label.

Creator

Type your name here.

Target

For most cases, the part of the website's URL up to and including .com goes in the Target. The Sample Pages are not created quite in this manner so you will use:

/member-projects/zotero-guide

When a viewer goes to a website, Zotero checks the current URL against its database of targets. If one of the targets matches the current URL, Zotero knows to run that specific translator.

You could try to use http://niche-canada.org/ as your target but since the NiCHE website has many different sections that are not formatted in the same way as this guide the translator would not run correctly on all pages of the website. It's best to make the target short, but it should also be long enough to only point to relevant pages. Making it too long can also cause problems. For example, using /member-projects/zotero-guide/sample1.html would only work for that single page and therefore sample2 could not be translated.

Once you have entered the target, click the "Test Regex" button in Scaffold. If you have created a good target, you should see "true" appear below in the test frame. Test your target on multiple entries to make sure it is consistently working.

Make sure you save your work, and then move on to the DetectWeb tab.

Filling in the Metadata Tab (Advanced)

There are a few things you might need to keep in mind when creating a translator for another website. These occur in the Label and Target fields.

Label

It is possible to write a single translator that works for many sites, as long as all the sites share the same formatting or metadata systems. If you plan to create a translator for many sites, choose a label that reflects the type of content system you are translating. i.e. "Library (SIRSI)" for the SIRSI Library catalogue system.

Target

Your target can be any combination of alphanumeric characters, or in cases where you need something more complicated, you can also use RegExps in combination with alphanumeric characters.

This is helpful for some major sites, which have slightly different, but identically formatted sites for various countries. In the US, http://www.google.com is the URL for Google. However, in Canada users are automatically redirected to http://www.google.ca when they type the .com address. If you were to use the .com version as your target, users in all countries outside the US will not be able to use your translator. The RegExp * symbol, which stands for wildcard, can be used to correct this problem. By using the * instead of .com, all versions of Google should be recognized by Zotero:

Example 13.1

http://www.google.*

Similarly, some repositories have versions in multiple languages. If your site offers this service to users, make sure that your target will be general enough to pick up all versions of the site. Common parts of the URL to watch out for in this case are /EN, /en, /eng, /english or something similar. If you navigate to a foreign language version of the site, you will likely find that the URL has remained identical, except that the /EN has been replaced by a /FR for French, /SP for Spanish, or /DE for German. By using the * in your target instead of the /EN you can make your translator useful to many more people.

If possible, just make the target shorter. You only need make it just long enough so that Zotero recognizes it as unique.

Example 13.2

Instead of
http://www.canadiana.org/ECO/?Language=en
try
http://www.canadiana.org

Not all pages on the website in question need to return "True" for your target, but all pages that contain information you want to scrape must. A common problem is sites that have different domains for search results pages and regular pages.

Example 13.3

http://search.mydatabase.com/abc123///search results URL
http://www.mydatabase.com/abc123/ //individual entry URL

Unfortunately, websites that use this system have put up one of the only roadblocks for which there is not a good way around. For security reasons, you cannot use JavaScript to access a server with a different domain name than the one in which you are working. This was a decision made by the powers that control JavaScript and is totally out of your hands. In the above example, though the URLs look similar to us, they are actually considered to be in different domains. This generally only affects search results pages; and you should still be able to write a translator that will scrape individual entries.

The only real solution to the above problem is quite advanced. I won't get into it in detail, but if you find that you must have this information, you will have to use HTTP GET to request the entire HTML of the page, save that into a variable, then parse the content in which you are interested. Chances are, if you know what this means, you didn't need this guide to begin with. If you don't know what this means, resolve yourself to making a translator that will scrape individual entries only, or select a different database to translate.

You can load translators made by others into Scaffold to see how they have used RegExps to ensure the best possible target.

If you get a "false" message when you try to "Test Regex", double-check your typing. Even a small mistake will cause an error. Try shortening your target to include only the parts of the URL that appear on ALL pages. What we need is the smallest part of the URL that is unique to all pages we want to translate on your site. http://www.thestar.com/News/GTA/article/482613 will test as "True" if you are currently looking at the article about Water Bottles, but it is too specific to work on any other Toronto Star article. Keep it short and keep it simple.

Chapter 13: Scaffold Metadata Tab

This Chapter

Filling in the Metadata Tab (Tutorial)

Translator ID

Label

Creator

Target

Filling in the Metadata Tab (Advanced)

Label

Target

Further Reading