A few weeks ago, I had a bad debugging session. The code was just not doing what I expected, and I went down a lot of deadends trying to fix or simplify things. I could not get the problem to happen in a reproducible example (reprex) or interactively (in RStudio). Eventually, the most minimal example of the problem completely broke my mental model for how the code should work.
The problem had to do with names and what they mean. select()
is a
function the lives in the MASS package and the dplyr package, and I
always intend for select()
to point to
dplyr::select()
.
But sometimes a statistics package will load in MASS and overwrite
select()
to point to
MASS::select()
. And in
this case, my attempts to use select()
in a
source()
-ed file kept reverting
to MASS::select()
instead of dplyr::select()
. A tweet from the
session shows the minimal example and my wracked brain. (I will describe
the example in more detail below.)
i'm dry heaving here wtf is going pic.twitter.com/KIeRJT6kwY
— tj mahr ππ (@tjmahr) July 21, 2021
Hereβs what happens:
- I explicitly assign
select
todplyr::select()
. - I make a function
f()
that prints the environment ofselect
(where the name/function is defined), store the function in a.R
text file andsource()
in the text file. (source()
runs the code in an R script.) - I print the value of
select
and see that it is indeed from the dplyr environment. - I call my function, and it says that
select
is actually in the MASS package. - I check the value of
select
, and it reports the dplyr environment once again.
A similar problem using functions
This problem only happened while knitting one of my analysis notebooks (which was a clue). Right now, itβs proving difficult for me to write examples of this problem for this blogpost, so Iβm going to show the source π of the problem using functions.
First, letβs set up things so that select
belongs to the MASS package.
We are also going to use the conflicted package which normally prevents
package name conflicts from happening. This part isnβt necessary or
helpful; I just want to illustrate that this is not a simple name
conflict problem.
library(conflicted)
library(MASS)
environment(select)
#> <environment: namespace:MASS>
We are going to make a function that does what my original code example tried to do:
- set
select
to dplyr explicitly source()
in a file that gives the environment ofselect
- return the environment of
select
, both using thesource()
-ed function and directly.
source_in_my_code <- function(...) {
# set dplyr select
select <- dplyr::select
# write a script to temporary file
temp_script <- tempfile(fileext = ".R")
my_code <- "
f <- function() environment(select)
"
writeLines(my_code, temp_script)
# run the script
source(temp_script, ...)
list(
source_select_environment = f(),
function_select_environment = environment(select)
)
}
default_results <- source_in_my_code()
What do you think the select
environment should be? dplyr, right?
Thatβs what select
means everywhere else inside of the function.
source()
is just like dropping in some R code and running it, right?
Thatβs what I thought.
default_results
#> $source_select_environment
#> <environment: namespace:MASS>
#>
#> $function_select_environment
#> <environment: namespace:dplyr>
No, itβs the MASS environment. π
Local and parent environments
In order to understand whatβs happening, letβs first note that R works by evaluating expressions in an environment. The environment defines the values of names. If a name is not found in an environment, R searches parent environment for the name (or the parentβs parent, and so on). This idea is illustrated beautifully in Advanced R using diagrams.
For an analogy, you might think of environments as looking up someone in an office, a building directory, then an area directory:
I like the multi-company building analogy. If you want to call Jim, first you look in your company directory. If there isnβt a Jim there, you look in the all-building maintenance dir. If not there, you look in the city services dir. You donβt look in another company-specific dir
— Brenton Wiernik π³οΈβπ (@bmwiernik) April 27, 2021
Here is small example showing a local function environment, its parent environment and how a name will take different values depending on the context.
where_am_i <- "outside of the function"
where_are_you <- "outside of the function too"
where_is_everyone <- function() {
where_am_i <- "inside of the function"
list(
where_am_i = where_am_i,
where_are_you = where_are_you
)
}
where_am_i
#> [1] "outside of the function"
where_is_everyone()
#> $where_am_i
#> [1] "inside of the function"
#>
#> $where_are_you
#> [1] "outside of the function too"
where_am_i
#> [1] "outside of the function"
Outside of the function, where_am_i
is "outside of the function"
,
but in the body of the function, it is defined to "inside of the
function"
. The variable where_are_you
is only defined "out of the
function too"
, so the function has to search for the variable in its
parent environment.
"parent" environment suggests a family metaphor. if you cant find what a symbol means, ask a parent.
— tj mahr ππ (@tjmahr) April 27, 2021
Locally sourced R code
Reading the documentation to source()
, we find the solution to the
original problem:
Arguments
local
TRUE
,FALSE
or an environment, determining where the parsed expressions are evaluated.FALSE
(the default) corresponds to the userβs workspace (the global environment) andTRUE
to the environment from whichsource
is called.
By default, the code evaluated by source()
runs in the global
environmentβthat is, βoutsideβ of the body of the function. The code
breaks out of the function environment and runs at the higher
environment.
My mental model for source()
was completely wrong. source()
is not
like dropping in the R code from a file and running it. It is more like
pausing everything that youβre doing in your current context, backing
out to the highest level context, running that code, and then resuming
what youβre doing.
Fortunately, if we ask source to run locally (local = TRUE
), select
has the same environment inside the function and in the code run using
source()
.
# I defined the function so it could pass arguments to source()
source_in_my_code(local = TRUE)
#> $source_select_environment
#> <environment: namespace:dplyr>
#>
#> $function_select_environment
#> <environment: namespace:dplyr>
When weβre using source()
as one of the first few lines of an R
script, the default global environment for source()
doesnβt really
matter. But in contexts like the function example or code stored in a
custom knitr/RMarkdown setup (my original problem), this difference is
a problem. Therefore, in the future, Iβm going to abide by the motto
Keep it locally sourced. This way fits my mental model for source()
as something that drops in R code and runs it in place.
And by the way, yes, even though I cited Advanced R above, I clearly did not do all of the exercises:
- Carefully read the documentation for
source()
. What environment does it use by default? What if you supplylocal = TRUE
? How do you provide a custom environment?
Last knitted on 2022-05-27. Source code on GitHub.1
-
.session_info #> β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> setting value #> version R version 4.2.0 (2022-04-22 ucrt) #> os Windows 10 x64 (build 22000) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.utf8 #> ctype English_United States.utf8 #> tz America/Chicago #> date 2022-05-27 #> pandoc NA #> #> β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> conflicted * 1.1.0 2021-11-26 [1] CRAN (R 4.2.0) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0) #> dplyr 1.0.9 2022-04-28 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> emo 0.0.0.9000 2022-05-25 [1] Github (hadley/emo@3f03b11) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0) #> git2r 0.30.1 2022-03-16 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> here 1.0.1 2020-12-13 [1] CRAN (R 4.2.0) #> knitr * 1.39 2022-04-26 [1] CRAN (R 4.2.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> MASS * 7.3-56 2022-03-23 [2] CRAN (R 4.2.0) #> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> ragg 1.2.2 2022-02-21 [1] CRAN (R 4.2.0) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0) #> rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0) #> systemfonts 1.0.4 2022-02-11 [1] CRAN (R 4.2.0) #> textshaping 0.3.6 2021-10-13 [1] CRAN (R 4.2.0) #> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.2.0) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0) #> #> [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.0/library #> #> ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Leave a comment