Partway through her LTER Postdoc at SESYNC, ecologist Meghan Avolio ran into trouble manipulating her data on plant communities with dplyr functions. I had encouraged Meghan to modularize her scripts by writing functions for common steps in her pipeline (such as converting count data into rank-abundance curves). “You’ll love writing functions!” I said wrongly.
Meghan quickly ran up against the trickery of non-standard evaluation (NSE) as employed by most of the dplyr functions. In calls to mutate, filter, group_by and friends, column names are supplied without quotation marks. An expression like mutate(df, rank = rank(abund)) fails when abund is not literally a column in the data frame. And how could you possibly know the column names in a function that handles arbitrary data frames?
NSE can be glimpsed elsewhere in R – it’s what lets you leave package names unquoted in the command library(dplyr). Without NSE, that command would fail with the message Error: object 'dplyr' not found. After all, nothing called dplyr exists in the global environment. If the unquoted dplyr in library(dplyr) doesn’t exist in the global environment, how can this expression achieve the same result as the standard evaluation call library("dplyr")? The absence of quotation marks is the proverbial tip of an iceberg, so buckle up your crampons and read on.
As a contrived example of when the savvy dplyr user needs to know about NSE, suppose you want to write a function that does for arbitrary data frames what the following does for plots (the script at the end of this post will create plots):
plots %>%
filter(Plot == 2) %>%
summarize(avg = mean(Cover))
avg
1 2.25
An incorrect attempt at such a function is:
level_mean_nope <- function(data, factor, level, value) {
data %>%
filter(factor == level) %>%
summarize(avg = mean(value))
}
The intent above is to keep rows that have the value of level in whichever column is named factor, and subsequently apply the function mean to whichever column is named value. But try it and be warned:
level_mean_nope(plots, 'Plot', 2, 'Cover')
Warning in mean.default(value): argument is not numeric or logical:
returning NA
avg
1 NA
What’s worse than the warning is that the answer given, NA, is wrong! Without the summarize part, the result would still not be what you’d expect, and there would be no warning at all.
By the way, have you read the Programming with dplyr vignette? Did it help? Yeah, me neither … but this stuff is hard to understand!
Let’s go back to library(dplyr) and talk about R’s lazy evaluation mechanism, or more specifically, about promise objects. Very briefly, when you use a function, its arguments are not evaluated before handing off to the function’s internals. The call to library does not try to evaluate dplyr before starting through the code within the library function – the interpreter is “lazy”. If I were a lazy interpreter, I’d accept any note you handed me and promise to translate it when the time comes. Moreover, I’d promise to read it in the context it was given, so references in the note to ‘this’ or ‘those’ would be to things present when you gave me the note. By using the stored context, R functions appear to evaluate the function’s arguments in a standard way … it just won’t unless absolutely necessary (hence “lazy”).
Technically, when the R interpreter encounters a function call, each argument gets embedded in an object representing itself as an expression (i.e. a bit of code) along with a pointer to the environment in which the function was called. This promise object can evaluate the expression in that environment when needed, so it conveys the correct value like any normal variable would. But you can perform sneaky tricks with a promise object, too. The library(dplyr) command examines the expression dplyr and infers the string "dplyr", without ever attempting to evaluate the expression (which would fail in the global environment!). All this to save you the trouble of typing quotation marks.
Like the library function, filter and summarize look at the embedded expression rather than instruct the lazy interpreter to get on with evaluating it. You can do the same, but should you?
“Yes, I can!”
You too can write a NSE function using substitute, a function that will modify code itself.
Come again, now? The purpose of substitute(expr) is to modify expr using variables found in the current environment. Within a function, the environment includes the promise objects created from each argument, and substitute(expr) will replace any code in expr referencing a promise object with the promised expression (still unevaluated!).
level_mean_nse <- function(data, factor, level, value) {
only_this_level <- substitute(factor == level)
mean_of_value <- substitute(mean(value))
data %>%
filter(!!only_this_level) %>%
summarize(avg = !!mean_of_value)
}
Focus on the second line, where substitute gets the expression factor == level. The first part of that expression will be substituted for the expression embedded in the promise object factor, whatever it may be when level_mean_nse gets used. The same happens for level, and the result is stored as only_this_level. A couple lines below that, notice the !! in the call to filter, and again in summarize. Since we are implementing our own NSE, we have to bypass the dplyr mechanism for lazy evaluation, and that’s exactly what the !! prefix does. Note you could also use rlang’s “quosure”, but c’mon this isn’t lolcode.
Now try using your function with dplyr style unquoted arguments:
level_mean_nse(plots, Plot, 2, Cover)
avg
1 2.25
See how it works? The application of filter ended up equivalent to filter(plots, Plot == 2). Sound good enough? Feel free to stop reading if you want to skip the lecture.
“No, you shouldn’t.”
When you program with substitute (or quo) you are writing code to write code to do what you want. Whenever possible, just write code to do what you want. That way, the code is more readable for your collaborators, yourself two weeks from now, and the consumers of your open source contribution. The code is also easier to develop and debug. You cannot, for example, get the body of level_mean_nse working first and wrap it up in a function() {...} block last, because substitute only works right inside a function call.
You should know how to avoid NSE altogether by treating column names as what they are, strings. Newer versions of dplyr that import rlang provide a .data object to get around NSE. The .data object is something like the original data but as processed “so far”. Subset .data in the normal way, referencing columns by name, within any of the dplyr verbs:
data <- plots
factor <- 'Plot'
level <- 2
value <- 'Cover'
data %>%
filter(.data[[factor]] == level) %>%
summarise(avg = mean(.data[[value]]))
avg
1 2.25
With that working, wrap it up in level_mean <- function(data, factor, level, value) {...} and you are good to go.
tl;dr
library(dplyr)
plots <- read.csv(textConnection("
Plot, Genus, Species, Cover
1, androp, gerar, 4
1, schiza, scopa, 1
1, panicu, virga, 6
2, androp, gerar, 3
2, panicu, virga, 4
2, sorgha, nutan, 1
2, sporob, compo, 1
"), strip.white = TRUE)
# goal: a function to do the following for arbitrary data frames
plots %>%
filter(Plot == 2) %>%
summarize(avg = mean(Cover))
# this won't work
level_mean_nope <- function(data, factor, level, value) {
data %>%
filter(factor == level) %>%
summarize(avg = mean(value))
}
level_mean_nope(plots, Plot, 2, Cover)
# you could write your own NSE function
level_mean_nse <- function(data, factor, level, value) {
only_this_level <- substitute(factor == level)
mean_of_value <- substitute(mean(value))
data %>%
filter(!!only_this_level) %>%
summarize(avg = !!mean_of_value)
}
level_mean_nse(plots, Plot, 2, Cover)
# you should avoid NSE altogether
level_mean <- function(data, factor, level, value) {
data %>%
filter(.data[[factor]] == level) %>%
summarize(avg = mean(.data[[value]]))
}
level_mean(plots, 'Plot', 2, 'Cover')