How to Write a Zotero Translator:
A Practical Beginners Guide for Humanists

By: Adam Crymble


Chapter 12: Regular Expressions

 previous button  next button

This Chapter

Regular Expressions (RegExp) are the last major hurdle you will have to jump over before writing a Zotero translator. RegExps are the greatest invention since the wheel. They make it infinitely easier to sort through the data you are going to scrape and to clean it up so that it's in the correct format to be saved into Zotero.

If you wanted to remove all the letters from a to z from a string, you could use 26 remove() String Methods; or you could use a single remove() with a RegExp for an Argument, to achieve the same thing. You can search for things such as a carriage return, or any digit, or a letter between a and g, or any alpha-numeric character or non-alphanumeric character. Even better, you can specify if you want to look only at the start of a String, the end of a String, look for one instance, or find every instance of something. These tasks would be much more difficult without RegExps.

Regular Expressions

It is a good idea to familiarize yourself with RegExps as you will come across them. Being able to recognize them and differentiate them from XPaths, HTML and other JavaScript is a skill that will save you some frustration.

RegExps use special characters to do some pretty handy things that would otherwise take several lines of code to accomplish. For example, you can check for the first instance of any lowercase letter between j and p by including /[j-p]/ as an Argument in a String Method.

Example 12.1

var x = "I like ducks";

Here are some of the most common building blocks for RegExps.

/  / RegExps are always found between two slashes. This is akin to the <> found in HTML or the " " which always surround Strings.
[h-p] Finds any characters within the range you define in the brackets. Note, upper case A is not between a and z.
[^h-p] Finds any character not within the range defined in the brackets.
\w Any word character (a-z, A-Z, 0-9, _ )
\W Any non-word character (everything else)
\s Any whitespace character
\d Any digit (0-9)
^ Looks at the first character or substring of the String only
$ Looks at the last character or substring of the String only
+ Looks for one or more occurrences of the match
* Looks for zero or more occurrences of the match
g Global match: finds all matches, not just the first instance. Should always appear at the end of the RegExp outside the / / brackets.


Explain the following RegExps in plain English and decide which of Strings satisfy the matches. More than one may be correct.

Example 12.2

A) var x = y.match(/^\s+/g);
 y =

  1. "a simple string"
  2. "a simple  string   "
  3. " a simple string"
  4. "asimplestring"

B) var x = y.match(/^\s*|\s*$/g);
 y =

  1. "a simple string"
  2. "a simple string  "
  3. " a simple string"
  4. "asimplestring"

C) var x = y.match(/\d\d\d\d|\;+/g);
 y =

  1. "193, James St."
  2. "I like rain; I like snow"
  3. "The winter of 1983 was very cold"
  4. "The summer of 1997 was hot; too hot!"

One thing you should remember about RegExps is that when writing Zotero translators, a RegExp will almost always be to the best way to match something to a known pattern or to clean away unwanted characters from Strings so that you can save the data into Zotero. This makes them particularly handy to use in conjunction with the replace() Method.

For example, if you scrape a piece of bibliographic information and debug it to find it looks like this:

Example 12.3

var x = "                     346                       "

You can get rid of the whitespace like this:

Example 12.4

x = x.replace(/^\s*|\s*$/g, '');

If you ever need to remove all the whitespace from the start and end of a String, you can always come back here and snag this piece of code. You can also use the same replace() on any String; even ones without any whitespace at all. Because the RegExp looks for zero or one instances, in a string like "Adam" the program will match zero times and make zero replacements. This is extremely handy when some entries have unwanted characters and others do not, such as in the case of multiple authors of a work, separated by a new line or semicolon. It acts much as an If Statement would operate; only making the change if necessary.

It may seem confusing the first few times you try to use RegExps so keep a good reference sheet handy and don't be afraid to post questions to a JavaScript forum if you get stuck. This is exactly the type of thing you will be able to get help with.

What You Should Understand Before Moving On

Further Reading