Fixing APA citations from Pandoc with stringr

Pandoc is awesome. It’s the universal translator for plain-text documents. I especially like that it can do inline citations. I write @Jones2005 proved aliens exist and pandoc produces “Jones (2005) proved aliens exist”.

But it doesn’t quite do APA style citations correctly. A citation like @SimpsonFlanders2006 found... renders as “Simpson & Flanders (2006) found…”. Inline citations are not supposed to have an ampersand. It should be “Simpson and Flanders (2006) found…”.

In the grand scheme of writing and revising, these errors are tedious low-level stuff. But I have colleagues who will read a draft of a manuscript and write unnecessary comments about how to cite stuff in APA. And the problem is just subtle and pervasive enough that it doesn’t make sense to manually fix the citations each time I generate my manuscript. My current project has 15 of these ill-formatted citations. That number is just big enough to make manual corrections an error-prone process— easy to miss 1 in 15.

Find and replace

I wrote a quick R function that replaces all those inlined ampersands with “and”s.

library("stringr")

fix_inline_citations <- function(text) {

Let’s assume that an inline citation ends with an author’s last name followed by a parenthesized year: SomeKindOfName (2001). We encode these assumptions into regular expression patterns, prefixed with re_.

The year is pretty easy. If it looks weird, it’s because I prefer to escape special punctuation like ( using brackets like [(]. Otherwise, a year is just four digits: \\d{4}.

  re_inline_year <- "[(]\\d{4}[)]"

What’s in a name? Here we have to stick our necks out a little bit more about our assumptions. I’m going to assume a last name is any combination of letters, hyphens and spaces (spaces needed for von Name).

  re_author <- "[[:alpha:]- ]+"
  re_author_year <- paste(re_author, re_inline_year)

We define the ampersand.

  re_ampersand <- " & "

Lookaround, lookaround. Our last regular expression trick is positive lookahead. Suppose we want just the word “hot” from the larger word “hotdog”. Using just hot would match too many things, like the “hot” in “hoth”. Using hotdog would match the whole word “hotdog”, which is more than we asked for. Lookaround patterns allow us to impose more constraints on a pattern. In the “hotdog”” example, positive lookahead hot(?=dog) says find “hot” if it precedes “dog”.

We use positive lookahead to find only the ampersands followed by an author name and a year. We replace the strings that match this pattern with and’s.

  re_ampersand_author_year <- sprintf("%s(?=%s)", re_ampersand, re_author_year)  
  str_replace_all(text, re_ampersand_author_year, " and ")
}

We can now test our function on a variety of names that it should and should not fix.

do_fix <- c(
  "Jones & Name (2005) found...",
  "Jones & Hyphen-Name (2005) found...",
  "Jones & Space Name (2005) found...",
  "Marge, Maggie, & Lisa (2005) found...")

fix_inline_citations(do_fix)
#> [1] "Jones and Name (2005) found..."         
#> [2] "Jones and Hyphen-Name (2005) found..."  
#> [3] "Jones and Space Name (2005) found..."   
#> [4] "Marge, Maggie, and Lisa (2005) found..."

do_not_fix <- c(
  "...have been found (Jones & Name, 2005)",
  "...have been found (Jones & Hyphen-Name, 2005)",
  "...have been found (Jones & Space Name, 2005)",
  "...have been found (Marge, Maggie, & Lisa, 2005)")  

fix_inline_citations(do_not_fix)
#> [1] "...have been found (Jones & Name, 2005)"         
#> [2] "...have been found (Jones & Hyphen-Name, 2005)"  
#> [3] "...have been found (Jones & Space Name, 2005)"   
#> [4] "...have been found (Marge, Maggie, & Lisa, 2005)"

By the way, our final regular expression re_ampersand_author_year is & (?=[[:alpha:]- ]+ [(]\d{4}[)]). It’s not very readable or comprehensible in that form, so that’s why we built it up step by step from easier sub-patterns like re_author and re_inline_year. (Which is a micro-example of the strategy of managing complexity by combining/composing simpler primitives.)

Steps towards production

These are complications that arose as I tried to use the function on my actual manuscript:

Placing it in a build pipeline. My text starts with an RMarkdown file that is knitted into a markdown file and rendered into other formats by pandoc. Because this function post-processes output from pandoc, I can’t just hit the “Knit”” button in RStudio. I had to make a separate script to do rmarkdown::render to convert my .Rmd file into a .md file which can then be processed by this function.

Don’t fix too much. When pandoc does your references for you, it also does a bibliography section. But it would be wrong to fix the ampersands there. So I have to do a bit of fussing around by finding the line "## References" and processing just the text up until that line.

Accounting for encoding. I use readr::read_lines and stringi::stri_write_lines to read and write the text file to preserve the encoding of characters. (readr just released its own write_lines today actually, so I can’t vouch for it yet.)

False matches are still possible. Suppose I’m citing a publication by an organization, like Johnson & Johnson, where that ampersand is part of the name. That citation would wrongly be corrected. I have yet to face that issue in practice though.