Unevaluated expressions

Updated on Tuesday, January 30, 2024 09:44 AM

TJ Mahr

Notes to self and other noodlings. Main site.

January 2024

Source: 2024-01-19.Rmd

One-off knitr engine for blockquoting things

Here’s how easy it is to create a knitr language engine:

# create a custom language block that blockquotes its contents
knitr::knit_engines$set("blockquote" = function(options) {
  if (isFALSE(options$echo)) return()
  
  output <- paste0("> ", options$code, collapse = "\n")
  
  if (!is.null(options$source_url)) {
    output <- sprintf("%s\n[[source]](%s)", output, options$source_url)
  }
  
  output
})

Writing a chunk like:

```{blockquote}
#| source_url: https://tjmahr.com
Look at this **markdown**!
  
Here is a nested blockquote:
  
  > Cool!

Yes.
```

Will yield the following output:

Look at this markdown!

Here is a nested blockquote:

Cool!

Yes. [source]

I wanted to make this a hook function but couldn’t figure it out.

Simulation for a probability paradox

This paradox is a flavor of the Sleeping Beauty paradox or the Boy or Girl paradox.

You are given an urn containing 100 balls; n of them are red, and 100-n are green, where n is chosen uniformly at random in [0, 100]. You take a random ball out of the urn—it’s red—and discard it. The next ball you pick (out of the 99 remaining) is: [poll omitted] [source]

And here is some code I wrote to run a simulation to get the unexpected solution.

f <- function(n = 100) {
  n_red <- sample.int(n, 1)
  p1 <- n_red / n
  p2 <- (n_red - 1) / (n - 1)
  draw <- sample(c("r", "g"), 1, prob = c(p1, 1 - p1))
  if (draw == "g") {
    "skip"
  } else {
    sample(c("r", "g"), 1, prob = c(p2, 1 - p2))
  }
}
counts <- replicate(100000, f()) |> 
  table()
counts
#> 
#>     g     r  skip 
#> 16808 33780 49412

# p(color | first red)
counts[1:2] |> proportions()
#> 
#>         g         r 
#> 0.3322527 0.6677473

Sentence similarity

I read an article I had bookmarked for a while. Suppose you have word embedding (vector) for each word. Or suppose you have a sentence embedding.

The Cohere embedding assigns a vector of length 4096 (i.e., a list of 4096 numbers) to each sentence. Furthermore, the multilingual embedding does this for sentences in more than 100 languages. In this way, the sentence “Hello, how are you?” and its corresponding French translation, “Bonjour, comment ça va?” will be assigned very similar numbers, as they have the same semantic meaning. [article]

They have toy data about where some movies are assigned vectors:

df_movies <- data.frame(
  row.names = c("you've got mail", "rush hour", "rush hour 2", "taken"),
  action = c(0, 6, 7, 7),
  comedy = c(5, 5, 4, 0)
) 
movies <- df_movies|> as.matrix()
movies
#>                 action comedy
#> you've got mail      0      5
#> rush hour            6      5
#> rush hour 2          7      4
#> taken                7      0

One option for similarity is the dot product of the vectors. (Why?)

# the two rush hour movies
sum(movies[2, ] * movies[3, ])
#> [1] 62

# all at once
dot_products <- movies %*% t(movies)

Check this out. as.dist() will clean up the redundant output, but it discards the diagonal. Suggestion from StackOverflow.

as.dist(dot_products)
#>             you've got mail rush hour rush hour 2
#> rush hour                25                      
#> rush hour 2              20        62            
#> taken                     0        42          49

Another option is cosine similarity. They plot the points on a coordinate space and observe:

[…] we want a measure of similarity that is high for sentences that are close to each other, and low for sentences that are far away from each other. [Euclidean] Distance does the exact opposite. So in order to tweak this metric, let’s look at the angle between the rays from the origin (the point with coordinates [0,0]), and each sentence. Notice that this angle is small if the points are close to each other, and large if the points are far away from each other. Now we need the help of another function, the cosine. The cosine of angles close to zero is close to 1 […] [article]

According to Wikipedia, we can derive the cosine similarity from the dot-product:

\[ \begin{aligned} \mathbf{A}\cdot\mathbf{B}&=\left\|\mathbf{A}\right\|\left\|\mathbf{B}\right\|\cos\theta \\ {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} &= \cos(\theta) \\ \|\mathbf{m}\| &:= \sqrt{\mathbf{m} \cdot \mathbf{m}} ~~~~~\text{(Euclidean norm)} \end{aligned} \]

We have a matrix with all of the dot products ready to go, so we need a matrix of all norms multiplied together.

dot_products
#>                 you've got mail rush hour rush hour 2 taken
#> you've got mail              25        25          20     0
#> rush hour                    25        61          62    42
#> rush hour 2                  20        62          65    49
#> taken                         0        42          49    49

norms <- sqrt(rowSums(movies ^ 2))
norm_products <- norms %*% t(norms)

(dot_products / norm_products) |> as.dist() |> round(3)
#>             you've got mail rush hour rush hour 2
#> rush hour             0.640                      
#> rush hour 2           0.496     0.985            
#> taken                 0.000     0.768       0.868

This value is cosine similarity. If we use 1 - similarity, it’s cosine distance. That would be more appropriate for anything piped into as.dist().

Still, two points could be very different distances from the origin, but they could have an angle of 0.

How I got targets to handle a bunch of bootstrapping

Here is a summary:

My bootstraps would instantly fill up my computer’s memory during a build. So now I dump stuff out of memory with "transient". (You can do this on a by-target level too.) I also keep running the build if one target/branch fails:

tar_option_set(
  error = "null",
  memory = "transient"
)

I also split a single target into smaller ones. I declare a global partition value like

N_PARTITIONS <- 80

Then I used tarchetypes::tar_group_by() to tell targets to split a giant dataframe into groups that can be dynamically branched over. Here is some actual code in the _targets.R file. The first targets defines 2000 bootstrap replicates to be split into 80 partitions/branches and the second fits the models on each partition/branch.

  tar_group_by(
    straps,
    data_iwpm |>
      filter(!is.na(mean_wpm)) |>
      prepare_bootstrap_data() |>
      bootstrap_by_slpg(
        times = n_boot,
        col_child_id = subject_id,
        seed = 20220621
      ) |>
      mutate(partition = assign_partition_numbers(id, N_PARTITIONS)),
    partition
  ),
  tar_target(
    models_intelligibility,
    fit_bootstrap_models(straps, "mean_intel", "intelligibility", 3, 2),
    pattern = map(straps),
    format = "qs"
  ),
# btw the targets here themselves are built serially bc fit_bootstrap_models()
# and friends use furrr to fit the models in each branch in parallel

But to be honest, as I’m looking at this code, I’m realizing that this “making a giant thing and split into branches” have been built as “make a small table of branch IDs and grow them within each branch”. Analogy would be to read in a dozen files at once and split up the work so each file is processed separately versus take a dozen filenames and handle from each filename separately.

I have to use tarchetypes::tar_group_by() instead of tarchetypes::tar_group_count() because I have multiple datasets that I am bootstrapping and they need to be kept together within a branch. Like, to split the following into two branches, I would want bootstrap_number 1 and 2 in one branch and 3 in a separate branch so that I can compare a, b and c within each bootstrap.

data_set,bootstrap_number
a,1
b,1
c,1
a,2
b,2
c,2
a,3
b,3
c,3

For serializing the data, I used format = "qs" on models or lists of models and format = "fst_tbl" on plain, inoffensive data tables.

I also started using targets::tar_read(blah, n) where n is some selection of branches like 1 or c(1, 2, 4), to read a subset of the overall data. This speeds up interactive development by giving me a smaller subset of data to write code around without having to load gigantic objects into R.

targets + crew

In targets, by default, targets are dispatched to be built one at a time, but we can use crew so that multiple workers can build targets in parallel. Here is the code to add to _targets.R to get 2 workers.

targets::tar_option_set(
  controller = crew::crew_controller_local(workers = 2)
)

And then you can view how much work each worker did. (I’m hard-coding the returns to show what happened with a full sit build.)

targets::tar_crew()
#>   controller               worker seconds targets
#>   <chr>                     <int>   <dbl>   <int>
#> 1 8f28e966dd6adaf60895eb01      1    49.4      32
#> 2 8f28e966dd6adaf60895eb01      2    25.1       9

MFA note: don’t touch the solver

My colleague and I independently ran into an issue where the Montreal Forced Aligner installation with conda seemed to take forever during the solving environment stage. We both tried to fix this problem by changing the default solver. This change did lead to improved installation, but the MFA could not run at all. The correct thing to do is do nothing. Just use the default solver and wait it out.

Revelation while reading withr 3.0.0 release

I was reading the withr 3.0.0 release notes and noticed a pleasing symmetry the with_ and local_ functions.

The local_ functions set a temporary value within some local scope, like a local() or a function().

local({
  withr::local_language("fr")
  plot[[1]]
})
#> Error in plot[[1]]: objet de type 'closure' non indiçable
plot[[1]]
#> Error in plot[[1]]: object of type 'closure' is not subsettable

The with_ functions set a temporary value around some block of code.

withr::with_language("fr", {
  plot[[1]]
})
#> Error in plot[[1]]: objet de type 'closure' non indiçable
plot[[1]]
#> Error in plot[[1]]: object of type 'closure' is not subsettable

So where are we working from?

Oh, and the notes for withr 3.0.0, indicates that I should keep an eye on how source() interacts with withr.

A silly function

I got sick of writing null-guard clauses like:

my_cool_function <- function(data, ...) {
  if (is.null(data)) { 
    return(NULL)
  }
  
  #  rest of the function
  nrow(data)
}

So why can’t I just make that a higher order function:

nullsafe_map <- function(x, f, ...) {
  if (is.null(x)) NULL else f(x, ...)
}

mtcars |> 
  nullsafe_map(my_cool_function) |> 
  nullsafe_map(log)
#> [1] 3.465736

NULL |> 
  nullsafe_map(my_cool_function) |> 
  nullsafe_map(log)
#> NULL

It feels like I reinvented something vaguely monad-y here, and I posted the question: “What have I reinvented here?” Tan suggested “purrr::map_if() with .p = Negate(is.null)?” which is cool because I had forgotten about purrr::map_if().

The pipe placeholder works better now

Andrew Heiss pointed out to me that the placeholder since 4.3.0 is a lot better.

quote(
  model |> _$coef |> _[1]
) 
#> model$coef[1]

Previously, I had been circumventing these functions with things like getElement() or making my own pipe-friendly wrapper functions like:

vec_element <- function(xs, i) {
  xs[[i]]
}

vec_index <- function(xs, i) {
  xs[i]
}

vec_index_into <- function(i, xs) {
  xs[i]
}

vec_replace_na <- function(xs, replacement) {
  xs[is.na(xs)] <- replacement
  xs
}

vec_remove_na <- function(xs) {
  xs[!is.na(xs)]
}

vec_set_names <- function(xs, ns) {
  names(xs) <- ns
  xs
}

There were two other functions alongside these helpers that weren’t meant to solve any specific piping problem. I came to adore them when I was doing some Advent of Code puzzles for fun in base R.

vec_which_value <- function(xs, value, negate = FALSE) {
  (xs %in% value) |>
    xor(negate) |>
    which()
}

vec_which_name <- function(xs, name, negate = FALSE) {
  vec_which_value(names(xs), name, negate)
}

letters |> 
  vec_which_value(c("a", "e", "i", "o", "u"), negate = TRUE) |> 
  vec_index_into(letters)
#>  [1] "b" "c" "d" "f" "g" "h" "j" "k" "l" "m" "n" "p" "q" "r" "s" "t" "v" "w" "x"
#> [20] "y" "z"

purrr::pluck() is better than I had realized

From ?purrr::pluck():

[pluck()] also accepts arbitrary accessor functions, i.e. functions that take an object and return some internal piece. […] Compare: accessor(x[[1]])$foo to pluck(x, 1, accessor, "foo").

Mutual recursion demo

I toyed with the experimental Tailcall feature in R-devel. I can’t actually run this code when rendering the notebook, but the basic idea is that sys.nframe() shows the height of the stack when the recursion bottoms out and it is much smaller in tha Tailcall examples.

is_even <- function(n) {
  if (n == 0) list(TRUE, sys.nframe()) else Tailcall(is_odd, n - 1)
}
is_odd <- function(n) {
  if (n == 0) list(FALSE, sys.nframe()) else Tailcall(is_even, n - 1)
}
is_odd(30)

naive_is_even <- function(n) {
  if (n == 0) list(TRUE, sys.nframe()) else naive_is_odd(n - 1)
}
naive_is_odd <- function(n) {
  if (n == 0) list(FALSE, sys.nframe()) else naive_is_even(n - 1)
}
naive_is_odd(30)

Speaking of tail calls, how to write an “iterative” recursive function

I revisited SICP during the Christmas break. I don’t know why this stuck out to me this time—it’s a chapter 1 topic, I have definitely seen it—but SICP differentiates between two kinds of recursive functions. Here is the example in R with factorial. First, let’s make sure that multiplication grows the stack by making it a function call instead of whatever kind of primitive * is.

`%times%` <- function(a, b) a * b

fact1 <- function(n) {
  if (n == 0) {
    message("nframe of fact1: ", sys.nframe())
    1
  } else {
    n %times% fact1(n - 1)
  }
}

fact2 <- function(n) {
  fact_iter <- function(n, accumulator) {
    if (n == 0) {
      message("nframe of fact2: ", sys.nframe())
      1 %times% accumulator
    } else {
      accumulator <- n %times% accumulator
      n <- n - 1
      fact_iter(n, accumulator)
    }
  }
  fact_iter(n, 1)
}

message("nframe of this code block: ", sys.nframe())
#> nframe of this code block: 39
fact1(10)
#> nframe of fact1: 60
#> [1] 3628800
fact2(10)
#> nframe of fact2: 51
#> [1] 3628800

The difference is that n %times% fact(n - 1) in fact() grows into a giant single multiplication but fact_iter(n - 1, n %times% accumulator) in fact2() does not.

The SICP authors describe the difference between the two like so:

Consider the first process. […] The expansion occurs as the process builds up a chain of deferred operations (in this case, a chain of multiplications). […] This type of process, characterized by a chain of deferred operations, is called a recursive process. Carrying out this process requires that the interpreter keep track of the operations to be performed later on. In the computation of \(n!\), the length of the chain of deferred multiplications, and hence the amount of information needed to keep track of it, grows linearly with \(n\) (is proportional to \(n\)), just like the number of steps. […]

[In the second process, at] each step, all we need to keep track of, for any \(n\), are the current values of the variables [n and accumulator in my code]. We call this an iterative process. In general, an iterative process is one whose state can be summarized by a fixed number of state variables, together with a fixed rule that describes how the state variables should be updated as the process moves from state to state and an (optional) end test that specifies conditions under which the process should terminate.

With something like Fibonacci, where two recursively defined things are added together, means that the number of deferred additions grows exponentially.

May 2023

Source: 2023-05-04.Rmd

Gompertz functions

The Gompertz function “describes growth as being slowest at the start and end of a given time period. The right-side or future value asymptote of the function is approached much more gradually by the curve than the left-side or lower valued asymptote.”

\[ f(t) = a\mathrm{e}^{-b\mathrm{e}^{-ct}} \\ a: \textrm{asymptote} \\ b: \textrm{translation left or right} \\ c: \textrm{growth factor} \\ \]

So we have two different growth rates but a single parameter relating to growth rate. R’s implementation (stats::SSgompertz()) “simplifies” the double exponential into Asym * exp(-b2 * b3 ^ x).

The scaling factor c or b3 makes sense in the original Gompertz equation. I couldn’t understand what the R version was doing because it was changing the base of exponentiation.

library(tidyverse)

gompertz <- function(xs, asym, b2, b3) {
  asym * exp(-b2 * exp(-b3 * xs))
}

# Make a quick ggplot2 layer of the gompertz function
stat_gompertz <- function(focus, asym, b2, b3, ...) {
  # show what i typed, not the values
  b2 <- rlang::enexpr(b2)
  asym <- rlang::enexpr(asym)
  b3 <- rlang::enexpr(b3)
  labs <- list(
    asym = rlang::expr_deparse(asym), 
    b2 = rlang::expr_deparse(b2), 
    b3 = rlang::expr_deparse(b3)
  )
  rlang::dots_list(asym, b2, b3, .named = TRUE)
  color <- sprintf("%s = %s", focus, labs[[focus]])
  stat_function(
    aes(color = color),
    fun = gompertz, 
    args = rlang::inject(list(asym = !! asym, b2 = !! b2, b3 = !! b3)),
    ...
  )
}

ggplot() + 
  xlim(-5, 5) +
  stat_gompertz("b3", 1, 1, 1) +
  stat_gompertz("b3", 1, 1, 0) +
  stat_gompertz("b3", 1, 1, .1) + 
  stat_gompertz("b3", 1, 1, -.5) +
  ggtitle(
      expression(asym %*% exp(-b[2] %*% exp(-b[3] %*% x)))
  )

We have this weird feature where \(b = 1/2\) and \(b = 2\) are the same distance from \(b = 1\), and the distance between the \(b = 2, 3, 4\) gets smaller.

ggplot() + 
  xlim(-5, 5) +
  stat_gompertz("b2", 1, 1/2, 1) +
  stat_gompertz("b2", 1, 2, 1) +
  stat_gompertz("b2", 1, 3, 1) +
  stat_gompertz("b2", 1, 4, 1) +
  stat_gompertz("b2", 1, 5, 1) +
  stat_gompertz("b2", 1, 1, 1) + 
  ggtitle(
    expression(asymptote %*% exp(-b[2] %*% exp(-b[3] %*% x)))
  ) +
  labs(
    y = "gompertz(x)",
    color = expression(b[2]),
    subtitle = expression(paste(asymptote == 1, ", ", b[3] == 1))
  )

But the horizontal spacing of the lines is more regular when I pass in exp() expressions.

ggplot() + 
  xlim(-5, 5) +
  stat_gompertz("b2", 1, exp(-3), 1) +
  stat_gompertz("b2", 1, exp(-2), 1) +
  stat_gompertz("b2", 1, exp(-1), 1) +
  stat_gompertz("b2", 1, exp(0), 1) +
  stat_gompertz("b2", 1, exp(1), 1) +
  stat_gompertz("b2", 1, exp(2), 1) +
  stat_gompertz("b2", 1, exp(3), 1) +
  ggtitle(
    expression(asymptote %*% exp(-b[2] %*% exp(-b[3] %*% x)))
  ) +
  labs(
    y = "gompertz(x)",
    color = expression(b[2]),
    subtitle = expression(paste(asymptote == 1, ", ", b[3] == 1))
  )

Suppose I want to set priors for intelligibility for this function. We want the age of steepest growth to be around age 4–5.

library(tidyverse)

gompertz <- function(xs, asym, b2, b3) {
  asym * exp(-b2 * exp(-b3 * xs))
}

asym <- 1
b2 <- rnorm(50, 5, 2)
b3 <- 1
gompertz_grid <- function(xs, asym, b2, b3, expand = TRUE) {
  if (expand) {
  d <- expand.grid(asym = asym, b2 = b2, b3 = b3) |> 
    tibble::rowid_to_column(".draw")
  } else {
    d <- data.frame(asym = asym, b2 = b2, b3 = b3) |> 
      tibble::rowid_to_column(".draw")
  }
  
  expand.grid(.draw = d$.draw, x = xs) |> 
    dplyr::left_join(d, by = ".draw", relationship = "many-to-many") |> 
    dplyr::mutate(
      y = gompertz(x, asym, b2, b3)
    )
}


xs <- 0:192
x <- (xs / 12)

d <- gompertz_grid(
  x, 
  asym = plogis(rnorm(500, 0, 1.5)),
  b2 = rlnorm(500, 8, 1),
  b3 = rgamma(500, 2, 1),
  expand = FALSE
)

d_draw <- d |> 
  distinct(.draw, asym, b2, b3) |> 
  mutate(
    steepest_growth = log(b2) / b3,
    `log(b2)` = log(b2)
  ) |> 
  select(-b2) |> 
  tidyr::pivot_longer(-.draw)

library(patchwork)
ggplot(d) + 
  aes(x = x, y = y) + 
  geom_line(aes(group = .draw), alpha = .1) +
  xlim(0, 18) 
    
ggplot(d_draw) + 
  aes(x = value) + 
  geom_histogram() +
  facet_wrap("name", scales = "free_x")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

b2 and b3 jointly determine the age of steepest growth, so I can’t put a simple prior on it.

Postscript from 2024. I never did use this function to model intelligibility.

December 2022

Source: 2022-12-01.Rmd

duckdb

duckdb is awesome. I used it in a project to read in all of control files (tsv files) in a folder into a database with the following code. There is a lot to unpack here, but here are the main ideas:

library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(dbdir = "data-raw/data.duckdb"))

# Generates a SQL query from all the control files in the
# working directory
run_control_file_query <- function() {
  tbl(con, "read_csv_auto('*Control*File-*.txt', FILENAME = TRUE)") |>
    # this all gets converted to SQL statements that will be executed by the
    # database, so there is a mix of R functions and duckdb functions here.
    select(
      control_file = filename,
      repeat_no = Repeat,
      sentence = Sentence,
      response = Response,
      s_word = SWord,
      m_word = MWord,
      m_word_a = MWordA,
      phonetic_sentence = `Phonetic Sentence`,
      phonetic_response = `Phonetic Response`,
      audio_file = File,
    ) |>
    mutate(
      control_file_number = control_file |>
        regexp_extract('\\d{7}\\d+'),
      item = audio_file |>
        toupper() |>
        # fix files with a subscript letter (S7T10a)
        regexp_replace("[ABCDE].WAV", ".WAV") |>
        # fix files with an extra 0(WT010)
        regexp_replace("0(\\d\\d).WAV", "\\1.WAV") |>
        regexp_extract('(S|S\\d|W)(T)\\d\\d.WAV') |>
        left(-4L),
      tocs_type = ifelse(
        substr(item, 1L, 1L) == "W",
        "single-word",
        "multiword"
      )
    )
}

# Create a table from the first batch of files
withr::with_dir("../files-to-import/2022-10-27-batch-1/", {
  DBI::dbRemoveTable(con, "control_files")
  run_control_file_query() |>
    compute(name = "control_files", temporary = FALSE)
})

September 2022

Source: 2022-06-06.Rmd

The BIC

I was wondering about the so-called Bayesian information criterion (BIC) today. What’s the deal with it? Both the AIC and BIC have the form:

\[\text{penalty}\times k\ \text{parameters} - 2\times \text{model log-likelihood}\]

For the AIC, the penalty is 2, and for the BIC, the penalty is \(\ln(n\ \text{observations})\). If we note that ln(20) = 2.996 (the penalty for 20 observations), then the BIC penalty is always larger than the AIC penalty in practice.

See Aho, Derryberry, & Peterson (2014): The AIC prioritizes out-of-sample prediction (asymptotic efficiency). BIC is an approximation of a Bayesian hypothesis test (so, asymptotic consistency). They note that the BIC is ideal when “(1) only a few potential hypotheses are considered and (2) one of the hypotheses is (essentially) correct,” and the AIC worldview is for when there are “(1) numerous hypotheses and (2) the conviction that all of them are to differing degrees wrong.”

June 2022

Source: 2022-06-06.Rmd

Getting downlit to work on this notebook

notestar, which I use to create this notebook, produces an HTML file. I can then post-process this HTML file using downlit::downlit_html_path() so that R function calls are automatically linked to documentation pages. This feature is great, but out of the box, it ignores the syntax highlighting of the output document. The solution, it turns out, is buried in ?rmarkdown::html_document (look at how nicely that is autolinked). I make sure that the following css file is included in the build process, so that links are underlined and the style defers to the default syntax highlighting coloring.

code a:any-link {
 color: inherit; /* use colour from syntax highlighting */
 text-decoration: underline;
 text-decoration-color: #ccc;
}

Windows Terminal settings for Miniconda

Here is how to add the Miniconda prompt to Windows Terminal. Analogous steps probably will work for an Anaconda prompt.

First, we need to know where the Miniconda prompt lives.

Look at the target field. For my Miniconda shortcut, I have:

%windir%\System32\cmd.exe "/K" C:\Users\trist\miniconda3\Scripts\activate.bat C:\Users\trist\miniconda3

We don’t need to hardcode in my user profile path. I also don’t think that last path is necessary, so this can simplify to:

%windir%\System32\cmd.exe /K %USERPROFILE%\miniconda3\Scripts\activate.bat

In Windows Terminal, create a terminal profile:

Interactive setup

We can use the Windows Terminal app to set up the miniconda prompt. Things to change include:

JSON

Instead of the interactive setup, we can open the JSON settings file (Settings > Open JSON file) and modify/paste in the settings. Here are my settings. Here the guid field was created by Windows Terminal so we should use the one it provides for us.

{
  "commandline": "%windir%\\System32\\cmd.exe /K %USERPROFILE%\\miniconda3\\Scripts\\activate.bat",
  "guid": "{2679ff34-f6b9-5fcd-9b81-08b50df61bae}",
  "icon": "\ud83d\udc0d",
  "name": "Miniconda",
  "startingDirectory": null
}

April, May 2022

Source: 2022-04-02.Rmd

.Rprofile contents

I am now adding the following to my .Rprofile file. By default, renv::restore() will undo the entire restore attempt if any of the package installs fails. (It wants the the whole transaction to succeed.) The undoes the default so that successful package installs are kept.

options(renv.config.install.transactional = FALSE)

My full .Rprofile tends to be something like:

source("renv/activate.R")
options(
  repos = c(
    tjmahr = 'https://tjmahr.r-universe.dev',
    # stan = "https://mc-stan.org/r-packages/",
    getOption("repos")
  )
)
options(WrapRmd.width = 72)
options(styler.addins_style_transformer = "grkstyle::grk_style_transformer()")
options(renv.config.install.transactional = FALSE)

Note that I put the renv activation first because it does some work with the repos option and I don’t want to get in the way of it.

R 4.2.0 alpha notes

These are some notes about workarounds etc. for the devel/alpha version of R 4.2.0 that I have been using on Windows for the past few weeks.

February 2022

Source: 2022-02-01.Rmd

Phonotactic probability

I asked the PhonCorps Slack what the state of the art was for computing phonotactic probability. I said I was using the IPhod database with SUBTLEXus frequencies. Some response:

R code to parse the KU phonotactic probability calculator output

# Parse the output of https://calculator.ku.edu/phonotactic/about
# in base R

# TJ Mahr

test_lines <- c(
  "b@d - English", "0.0512 0.0794 0.0380", "0.0059 0.0024", "1.1686 1.0083",
  "bcl - English", "0.0512 0.0165 0.0737", "0.0008 0.0035", "1.1414 1.0043",
  "bi - English", "0.0512 0.0318", "0.0022", "1.0830 1.0022", 
  "bIg - English", "0.0512 0.0962 0.0179", "0.0041 0.0035", "1.1653 1.0076", 
  "but - English", "0.0512 0.0221 0.0660", "0.0012 0.0027", "1.1393 1.0039", 
  "bW - English", "0.0512 0.0097", "0.0006", "1.0609 1.0006", 
  "bO - English", "0.0512 0.0034", "0.0003", "1.0546 1.0003", 
  "b^di - English", 
  "0.0512 0.0392 0.0380 0.0432",
  "0.0034 0.0010 0.0037", 
  "1.1716 1.0081", 
  "b^s - English", "0.0512 0.0392 0.0788", "0.0034 0.0039", "1.1692 1.0073", 
  "Cu - English", "0.0089 0.0221", "0.0002", "1.0310 1.0002"
  )

split_ku_chunks <- function(lines) {
  starts <- grep(" - ", lines)
  ends <- c(starts[-1] - 1, length(lines))
  Map(seq, starts, ends) |>
    lapply(function(i) lines[i])
}

parse_ku_chunk <- function(lines) {
  # How the output is documented in the paper.

  # Vitevitch, M.S. & Luce, P.A. (2004) A web-based interface to calculate
  # phonotactic probability for words and nonwords in English. Behavior Research
  # Methods, Instruments, and Computers, 36, 481-487.

  # "The output of the Phonotactic Probability Calculator
  # (which appears in the field on the right side of your
  # browser) typically consists of four lines of information
  # for each item entered in the left field. The first line con-
  # tains the phonemic transcription you originally entered.
  # The second line contains the position-specific probabil-
  # ity for each segment in the item entered by the user. If the
  # (non)word you entered contains more than seven pho-
  # nemes, the position-specific probabilities for each pho-
  # neme in the item will wrap around to the next line.
  #
  # "The third line (assuming the entry is seven phonemes
  # or less) contains the position-specific biphone probabil-
  # ities for each of the biphones in the word."
  pronunciation <- lines[1] |>
    strsplit(" - ") |>
    unlist() |>
    utils::head(1)

  phones <- pronunciation |>
    strsplit("")  |>
    unlist()

  biphones <- paste0(phones[-length(phones)], phones[-1])

  values <- lines[-1] |>
    paste0(collapse = " ") |>
    strsplit(" ") |>
    unlist() |>
    as.numeric()

  i_position_probs <- seq(1, length(phones))
  i_biphone_probs <- seq(1, length(biphones)) + length(phones)

  positional <- data.frame(
    pronunciation = pronunciation,
    class = "positional_prob",
    token = phones,
    value = values[i_position_probs]
  )

  biphone <- data.frame(
    pronunciation = pronunciation,
    class = "biphone_prob",
    token = biphones,
    value = values[i_biphone_probs]
  )

  rbind(positional, biphone)
}

do_call_rbind <- function(xs, ...) do.call(rbind, xs, ...)

test_lines |>
  split_ku_chunks() |>
  lapply(parse_ku_chunk) |>
  do_call_rbind()
#>    pronunciation           class token  value
#> 1            b@d positional_prob     b 0.0512
#> 2            b@d positional_prob     @ 0.0794
#> 3            b@d positional_prob     d 0.0380
#> 4            b@d    biphone_prob    b@ 0.0059
#> 5            b@d    biphone_prob    @d 0.0024
#> 6            bcl positional_prob     b 0.0512
#> 7            bcl positional_prob     c 0.0165
#> 8            bcl positional_prob     l 0.0737
#> 9            bcl    biphone_prob    bc 0.0008
#> 10           bcl    biphone_prob    cl 0.0035
#> 11            bi positional_prob     b 0.0512
#> 12            bi positional_prob     i 0.0318
#> 13            bi    biphone_prob    bi 0.0022
#> 14           bIg positional_prob     b 0.0512
#> 15           bIg positional_prob     I 0.0962
#> 16           bIg positional_prob     g 0.0179
#> 17           bIg    biphone_prob    bI 0.0041
#> 18           bIg    biphone_prob    Ig 0.0035
#> 19           but positional_prob     b 0.0512
#> 20           but positional_prob     u 0.0221
#> 21           but positional_prob     t 0.0660
#> 22           but    biphone_prob    bu 0.0012
#> 23           but    biphone_prob    ut 0.0027
#> 24            bW positional_prob     b 0.0512
#> 25            bW positional_prob     W 0.0097
#> 26            bW    biphone_prob    bW 0.0006
#> 27            bO positional_prob     b 0.0512
#> 28            bO positional_prob     O 0.0034
#> 29            bO    biphone_prob    bO 0.0003
#> 30          b^di positional_prob     b 0.0512
#> 31          b^di positional_prob     ^ 0.0392
#> 32          b^di positional_prob     d 0.0380
#> 33          b^di positional_prob     i 0.0432
#> 34          b^di    biphone_prob    b^ 0.0034
#> 35          b^di    biphone_prob    ^d 0.0010
#> 36          b^di    biphone_prob    di 0.0037
#> 37           b^s positional_prob     b 0.0512
#> 38           b^s positional_prob     ^ 0.0392
#> 39           b^s positional_prob     s 0.0788
#> 40           b^s    biphone_prob    b^ 0.0034
#> 41           b^s    biphone_prob    ^s 0.0039
#> 42            Cu positional_prob     C 0.0089
#> 43            Cu positional_prob     u 0.0221
#> 44            Cu    biphone_prob    Cu 0.0002

Some old random walk code

I had some old code that asked what a 12-step random walk looked like if it started at the ceiling and could on go left or stay put on the first move.

par(mar = c(4, 4, 1, 1))
set.seed(20220209)

take_one_step <- function(x, y, ceiling = 0) {
  this_step <- if (x == ceiling) {
    sample(c(-1, 0), 1)
  } else {
    sample(c(-1, 0, 1), 1)
  }
  x + this_step
}

no_ceiling <- replicate(
  30000,
  Reduce(function(x, y) take_one_step(x, y, ceiling = 1000), 0:12, init = 0)
)

yes_ceiling <- replicate(
  30000,
  Reduce(function(x, y) take_one_step(x, y, ceiling = 0), 0:12, init = 0)
)

library(patchwork)
p1 <- wrap_elements(
  full = ~ plot(table(no_ceiling), lwd = 4)
) 
p2 <- wrap_elements(
  full = ~ plot(table(yes_ceiling), lwd = 4)
)

p1 + p2

That ain’t a truncated Gaussian. 👀

January 2022

Source: 2022-01-21.Rmd

I have decided to get rid of “how to” from heading names. I’m thinking about creating a separate links/bookmarks section.

New interesting R packages.

Easy way to sample from a multinomial.

sample() takes a prob argument, so you can skip rmultinom entirely!

sample(n_comp, n, prob = mixture_probs, replace = TRUE)

— David Robinson ((drob?)) October 27, 2021

Conditionally provide tibbles. The newsletter by rOpenSci highlights this trick by the palmerpenguins package.

delayedAssign("penguins", {
  if (requireNamespace("tibble", quietly = TRUE)) {
    tibble::as_tibble(readRDS("data/penguins.rds"))
  } else {
    readRDS("data/penguins.rds")
  }
})

delayedAssign("penguins_raw", {
  if (requireNamespace("tibble", quietly = TRUE)) {
    tibble::as_tibble(readRDS("data/penguins_raw.rds"))
  } else {
    readRDS("data/penguins_raw.rds")
  }
})

This rOpenSci newsletter also detailed how to get new words past the package description spellchecker.

Manually delete a facet cell

Suppose you have some ragged data.

library(tidyverse)
data <- tibble(
  id = letters[1:14],
  group = rep(c("g1", "g2", "g3"), times = c(5, 4, 5)),
  data = lapply(1:14, function(x) {
    # just making some data
    n_points <- sample(c(2:5), size = 1) 
    ages <- sort(sample(c(10:24), size = n_points))
    data.frame(age = ages, score = rbeta(n_points, 1, 4))
  })
) %>% 
  tidyr::unnest(data)

ggplot(data, aes(x = age, y = score)) + 
  geom_point() + 
  geom_line(aes(color = group)) + 
  facet_wrap("id", ncol = 5)

We would like to have one row per group. So we actually would want four panels on the middle row. If we add an extra panel to fill that space and then remove it, we can achieve this goal.

First add the dummy panel.

data <- data %>% 
  tibble::add_row(id = "i2", group = "g2", age = median(data$age), score = 0) %>% 
  # two points so that geom_line() doesn't warn
  tibble::add_row(id = "i2", group = "g2", age = median(data$age), score = 0)

ggplot(data, aes(x = age, y = score)) + 
  geom_point(na.rm = TRUE) + 
  geom_line(aes(color = group), na.rm = TRUE) + 
  facet_wrap("id", ncol = 5) 

My overly complicated approach was to manually delete the grob using the following. Finding the pattern name is very tedious because you basically have to guess the strip and panel numbers.

p <- last_plot()
g <- ggplotGrob(p)
# guess and check
pattern <- "panel-4-3|strip-t-5-2"
to_drop <- str_which(g$layout$name, pattern)
to_keep_names <- str_subset(g$layout$name, pattern, negate = TRUE)
# delete grob data, clean up layout table
g$grobs[to_drop] <- NULL
g$layout <- g$layout[g$layout$name %in% to_keep_names, ]
grid::grid.newpage()
grid::grid.draw(g)

Although I am going to give better solution, I did write this entry precisely to have this code here for my reference.

A better approach is to blank the cell with a ribbon. We have to eliminate strip titles from the theme for this approach to work.

ggplot(data, aes(x = age, y = score)) + 
  geom_point(na.rm = TRUE) + 
  geom_line(aes(color = group), na.rm = TRUE) + 
  geom_ribbon(
    aes(ymin = -Inf, ymax = Inf),
    data = tibble(
      id = "i2", 
      age = c(-Inf, Inf), 
      group = "g2", 
      score = NA_real_
    ),
    fill = "white",
    color = "white"
  ) +
  facet_wrap("id", ncol = 5) +
  theme(strip.text.x = element_blank())

Brenton on Twitter suggested numbering IDs within groups and using facet_grid(). That works, and it also can give labeled rows.

data <- data %>% 
  group_by(group) %>% 
  mutate(col_num = match(id, unique(id))) %>% 
  ungroup()

ggplot(data, aes(x = age, y = score)) + 
  geom_point(na.rm = TRUE) + 
  geom_line(aes(color = group), na.rm = TRUE) + 
  geom_ribbon(
    aes(ymin = -Inf, ymax = Inf),
    data = tibble(
      col_num = 5,
      age = c(-Inf, Inf), 
      group = "g2", 
      score = NA_real_
    ),
    fill = "white",
    color = "white"
  ) +
  facet_grid(group ~ col_num) +
  theme(strip.text.x = element_blank())

Compute spline predictions manually from GAMLSS

My goal is write down some reusable code for how to get spline predictions for unobserved x values by hand from a GAMLSS model.

Problem setup

Here are 98 random points from a subrange of the intelligibility dataset.

library(tidyverse)
suppressMessages(library(gamlss))
select <- dplyr::select

data_intel <- tibble::tribble(
   ~id, ~age_months, ~mean_intelligibility,
    1L,          44,                  0.74,
    2L,          49,                 0.779,
    3L,          56,                 0.686,
    4L,          51,                 0.675,
    5L,          42,                 0.785,
    6L,          42,                 0.708,
    7L,          38,                 0.557,
    8L,          44,                 0.756,
    9L,          56,                  0.64,
   10L,          50,                 0.767,
   11L,          48,                 0.858,
   12L,          53,                  0.94,
   13L,          49,                 0.673,
   14L,          56,                 0.928,
   15L,          54,                 0.827,
   16L,          51,                 0.868,
   17L,          53,                 0.784,
   18L,          50,                 0.908,
   19L,          37,                 0.458,
   20L,          37,                 0.184,
   21L,          51,                 0.688,
   22L,          53,                 0.753,
   23L,          43,                 0.802,
   24L,          47,                 0.841,
   25L,          45,                 0.523,
   26L,          37,                 0.288,
   27L,          42,                 0.672,
   28L,          50,                 0.777,
   29L,          54,                 0.749,
   30L,          53,                 0.901,
   31L,          55,                 0.806,
   32L,          45,                 0.646,
   33L,          45,                 0.689,
   34L,          45,                 0.806,
   35L,          46,                 0.859,
   36L,          36,                 0.557,
   37L,          38,                 0.606,
   38L,          50,                 0.804,
   39L,          44,                 0.853,
   40L,          51,                 0.747,
   41L,          37,                 0.659,
   42L,          48,                 0.883,
   43L,          45,                 0.798,
   44L,          39,                 0.483,
   45L,          59,                 0.976,
   46L,          58,                 0.786,
   48L,          48,                 0.775,
   49L,          47,                 0.896,
   50L,          38,                 0.921,
   51L,          41,                 0.686,
   53L,          54,                  0.72,
   54L,          51,                 0.965,
   55L,          44,                 0.777,
   56L,          42,                  0.53,
   57L,          36,                 0.707,
   58L,          60,                 0.908,
   59L,          37,                 0.566,
   60L,          51,                 0.598,
   61L,          58,                 0.938,
   62L,          54,                 0.925,
   63L,          47,                 0.757,
   64L,          42,                  0.63,
   65L,          45,                 0.764,
   66L,          59,                  0.85,
   67L,          47,                 0.786,
   68L,          38,                 0.781,
   69L,          44,                 0.929,
   70L,          36,                 0.356,
   71L,          55,                 0.802,
   72L,          54,                 0.864,
   73L,          43,                 0.864,
   74L,          49,                 0.848,
   75L,          38,                 0.574,
   76L,          53,                 0.916,
   77L,          41,                  0.85,
   78L,          59,                 0.851,
   79L,          46,                 0.444,
   80L,          55,                 0.778,
   81L,          54,                 0.931,
   82L,          50,                  0.32,
   83L,          58,                 0.855,
   84L,          39,                  0.64,
   85L,          50,                 0.727,
   86L,          50,                 0.857,
   87L,          39,                 0.731,
   88L,          46,                 0.624,
   89L,          53,                 0.951,
   90L,          46,                 0.873,
   91L,          47,                 0.897,
   92L,          46,                 0.748,
   93L,          59,                 0.936,
   94L,          50,                 0.881,
   95L,          43,                 0.815,
   96L,          44,                 0.768,
   97L,          43,                 0.789,
   98L,          50,                 0.728,
   99L,          38,                 0.794,
  100L,          44,                 0.509
)
data_intel <- data_intel %>% 
  arrange(age_months)

But note that we are missing data from three ages (x values).

ggplot(data_intel) + 
  aes(x = age_months, y = mean_intelligibility) + 
  geom_point() +
  geom_vline(xintercept = c(40, 52, 57), linetype = "dashed") +
  scale_x_continuous(breaks = c(36, 48, 60))
Intelligibility data but there are no observations at 40, 52, and 57 months
Intelligibility data but there are no observations at 40, 52, and 57 months

Here is how we would fit a beta-regression model as we did in the growth curve paper.

fit_beta_gamlss <- function(data, mu_df = 3, sigma_df = 2) {
  BE <- gamlss.dist::BE
  gamlss.control <- gamlss::gamlss.control
  ns <- splines::ns

  model <- wisclabmisc::mem_gamlss(
    mean_intelligibility ~ ns(age_months, df = mu_df),
    sigma.formula = ~ ns(age_months, df = sigma_df),
    family = BE(),
    data = data,
    control = gamlss.control(trace = FALSE)
  )

  model$.user$mu_df <- mu_df
  model$.user$sigma_df <- sigma_df

  model$.user$mu_basis <- ns(data$age_months, mu_df)
  model$.user$sigma_basis <- ns(data$age_months, sigma_df)

  model
}

model <- fit_beta_gamlss(data_intel, mu_df = 3, sigma_df = 2)

Then we can get centiles pretty easily for the existing x values. But note the three gaps on the percentile line.

centiles <- data_intel %>% 
  select(age_months) %>% 
  wisclabmisc::predict_centiles(model)

ggplot(data_intel) + 
  aes(x = age_months, y = mean_intelligibility) + 
  geom_point() +
  geom_point(
    aes(y = c10), 
    data = centiles, 
    color = "skyblue3",
    size = 10,
    shape = "-"
  ) + 
  scale_x_continuous(breaks = c(36, 48, 60))
Intelligibility data with points from the 10th percentile line
Intelligibility data with points from the 10th percentile line

If we ask GAMLSS to give us centiles for those gaps—herein lies the problem—we might get a warning message like the following:

centiles2 <- data_intel %>% 
  select(age_months) %>% 
  tidyr::expand(age_months = tidyr::full_seq(age_months, 1)) %>% 
  wisclabmisc::predict_centiles(model)
#> Warning in predict.gamlss(obj, what = "mu", newdata = newx, type = "response", : There is a discrepancy  between the original and the re-fit 
#>  used to achieve 'safe' predictions 
#> 

I don’t want to mess around in unsafe territory with GAMLSS. (What if they change how they handle this problem in the future?) So I would like to estimate values for those missing x values from the spline by hand.

Problem solution

Use predict() on the spline object to get the values of the basis functions at the new x values. I stashed the original spline object inside of the model.

mu_basis <- model$.user$mu_basis
ages_to_predict <- data_intel %>% 
  tidyr::expand(age_months = tidyr::full_seq(age_months, 1))

mu_basis2 <- predict(mu_basis, newx = ages_to_predict$age_months)

library(patchwork)
tjmisc::ggmatplot(mu_basis) +
  tjmisc::ggmatplot(mu_basis2)
Basis functions for observed data (left) and the age range (right). The full age range is a little bit smoother than the observed data.
Basis functions for observed data (left) and the age range (right). The full age range is a little bit smoother than the observed data.

We do the following to get model parameters at each age. This code is the point of this note.

mu_basis <- model$.user$mu_basis
sigma_basis <- model$.user$sigma_basis
newx <- ages_to_predict$age_months

mu_basis2 <- cbind(1, predict(mu_basis, newx = newx))
sigma_basis2 <- cbind(1, predict(sigma_basis, newx = newx))

manual_centile <- tibble::tibble(
  age_months = newx,
  # multiply coefficients and undo link function
  mu = plogis(mu_basis2 %*% coef(model, what = "mu"))[, 1],
  sigma = plogis(sigma_basis2 %*% coef(model, what = "sigma"))[, 1],
  # now take centiles
  c10 = gamlss.dist::qBE(.1, mu, sigma)
)
manual_centile
#> # A tibble: 25 × 4
#>    age_months    mu sigma   c10
#>         <dbl> <dbl> <dbl> <dbl>
#>  1         36 0.522 0.341 0.294
#>  2         37 0.560 0.334 0.336
#>  3         38 0.596 0.328 0.378
#>  4         39 0.630 0.321 0.418
#>  5         40 0.661 0.315 0.456
#>  6         41 0.688 0.309 0.491
#>  7         42 0.712 0.304 0.522
#>  8         43 0.731 0.298 0.548
#>  9         44 0.745 0.293 0.569
#> 10         45 0.755 0.288 0.584
#> # ℹ 15 more rows

Let’s look at our manually computed 10th percentile versus the package-provided one.

ggplot(data_intel) + 
  aes(x = age_months, y = mean_intelligibility) + 
  geom_point() +
  geom_point(
    aes(y = c10), 
    data = manual_centile, 
    color = "orange",
    size = 10,
    shape = "+"
  ) + 
  geom_point(
    aes(y = c10), 
    data = centiles, 
    color = "skyblue3",
    size = 10,
    shape = "-"
  ) + 
  scale_x_continuous(breaks = c(36, 48, 60))
Intelligibility data with points from the 10th percentile line provided by
`centiles.pred()` (blue hyphens) and computed by hand (orange pluses).
Intelligibility data with points from the 10th percentile line provided by centiles.pred() (blue hyphens) and computed by hand (orange pluses).

Count number of function calls in some R code

Do data analysis on getParseData() result.

code <- quote({
  check_out <- my(fancy(code))
  here_is <- some(fancy(code))
})

parse_data <- getParseData(parse(text = deparse(code), keep.source = TRUE))
head(parse_data)
#>    line1 col1 line2 col2 id parent       token terminal      text
#> 55     1    1     4    1 55      0        expr    FALSE          
#> 1      1    1     1    1  1     55         '{'     TRUE         {
#> 25     2    5     2   32 25     55        expr    FALSE          
#> 3      2    5     2   13  3      5      SYMBOL     TRUE check_out
#> 5      2    5     2   13  5     25        expr    FALSE          
#> 4      2   15     2   16  4     25 LEFT_ASSIGN     TRUE        <-

Count tokens using results of all.names() and all.vars().

names <- all.names(code)
vars <- all.vars(code)

# Just the non-variable tokens
names[!names %in% vars]
#> [1] "{"     "<-"    "my"    "fancy" "<-"    "some"  "fancy"

December 2021

Source: 2021-12-07.Rmd

Quick hits

rlang provides done() to break out of a loop.

This paper uses by-word entropy as a measure of intelligibility. https://doi.org/10.1017/S0305000921000714

Leslie matrix. Advent of Code day 6 (simulating population growth) can be done with matrix multiplication on a Leslie matrix which is a transition matrix on age counts:

The Leslie matrix is a square matrix with the same number of rows and columns as the population vector has elements. The (i,j)th cell in the matrix indicates how many individuals will be in the age class i at the next time step for each individual in stage j.

Monads in one sentence.

A monad is the minimum amount of structure needed to overload function composition in a way that “performs an extra computation” on the intermediate value. – https://www.youtube.com/watch?v=Nq-q2USYetQ&feature=youtu.be

November 2021

Source: 2021-11-22.Rmd

Quick hits

Base graphics in knitr. I learned while porting over some old notes that par() options are not preserved between knitr chunks unless opts_knit$get("global.par") is TRUE. It’s FALSE by default. No funny business using old <- par(...)

Sampling a swimming pool depth. Jorbs on his Slay the Spire stream commented problems with big data. He used a nice analogy for sampling bias. Imagine you have a swimming pool and we can’t see the bottom of it. We want to measure the average depth of it. No amount of big data will accurately estimate the depth with all the samples come from the shallow end of the pool. But if you smartly sample the pool with a grid of points, then you can come up with an accurate estimate from relatively few points. This example is nice because it can illustrate accuracy versus precision.

The analogy also funnily resembles an analogy used for Hamilitonian Monte Carlo: a puck skating around a smooth curved surface. If the pool is the posterior distribution and the depth is the negative log posterior density, then we want to explore that pool too.


Paul recommended using Frank Harrell’s rms::orm() function for ordinal regression.

eval() is stack inconsistent (whatever that means)

base::eval() and rlang::eval_bare() do things differently.

Here are the examples from ?eval_bare()

# eval_bare() works just like base::eval() but you have to create
# the evaluation environment yourself:
eval_bare(quote(foo), env(foo = "bar"))
#> [1] "bar"

# eval() has different evaluation semantics than eval_bare(). It
# can return from the supplied environment even if its an
# environment that is not on the call stack (i.e. because you've
# created it yourself). The following would trigger an error with
# eval_bare():
ret <- quote(return("foo"))
eval(ret, env())
#> [1] "foo"
# eval_bare(ret, env())  # "no function to return from" error

# Another feature of eval() is that you can control surround loops:
bail <- quote(break)
while (TRUE) {
  eval(bail)
  # eval_bare(bail)  # "no loop for break/next" error
}

# To explore the consequences of stack inconsistent semantics, let's
# create a function that evaluates `parent.frame()` deep in the call
# stack, in an environment corresponding to a frame in the middle of
# the stack. For consistency with R's lazy evaluation semantics, we'd
# expect to get the caller of that frame as result:
fn <- function(eval_fn) {
  list(
    returned_env = middle(eval_fn),
    actual_env = current_env()
  )
}
middle <- function(eval_fn) {
  deep(eval_fn, current_env())
}
deep <- function(eval_fn, eval_env) {
  expr <- quote(parent.frame())
  eval_fn(expr, eval_env)
}

# With eval_bare(), we do get the expected environment:
fn(rlang::eval_bare)
#> $returned_env
#> <environment: 0x55cac3595df8>
#> 
#> $actual_env
#> <environment: 0x55cac3595df8>
#> 

# But that's not the case with base::eval():
fn(base::eval)
#> $returned_env
#> <environment: 0x55cac1cc7980>
#> 
#> $actual_env
#> <environment: 0x55cac1cc2170>
#> 

The Gaussian KDE is a sum of baby Gaussians?!

Today I learned from here that the Gaussian KDE is the sum of a bunch of little Gaussian curves. Here look:

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)

d <- penguins %>% 
  filter(!is.na(bill_length_mm)) %>% 
  sample_n(size = 20)

# grid of xs
x <- seq(30, 63, length.out = 200)

# compute density of xs using each observed value
# as the mean.
l <- lapply(
  d$bill_length_mm,
  function(m) {
    data.frame(
      x = x,
      y = dnorm(x, mean = m, sd = bw.nrd0(d$bill_length_mm)) 
    )
  }
) 
dl <- bind_rows(l, .id = "obs")

# plot them, their sum, and the default density curve
ggplot(d) + 
  aes(x = bill_length_mm) + 
  geom_density(aes(color = "geom_density()"), size = 2, show.legend = FALSE) +
  geom_rug() +
  geom_line(
    aes(x = x, y = y / nrow(d), group = obs),
    data = dl, alpha = .1
  ) + 
  stat_summary(
    aes(x = x, y = y / nrow(d), color = "sum of pointwise densities"),
    data = dl, 
    # alpha = .1, 
    geom = "line",
    fun = "sum", 
    # color = "orange"
  ) +
  scale_color_manual(values = c("black", "orange")) +
  labs(color = NULL)
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

The R Open Sci newsletter linked to a mailing list thread about how to stash data inside of an R package.

Duncan Murdoch suggests using a function factory to create closures that can access data in the parent environment.

Your local() call can create several functions and return them in a list; then just those functions have access to the local variables. For example,

createFns <- function() {

   .fooInfo <- NULL

   fn1 <- function (...) { ... }
   fn2 <- function (...) { ... }

   list(fn1 = fn1, fn2 = fn2)
}

fns <- createFns()
fn1 <- fns$fn1
fn2 <- fns$fn2

Now fn1 and fn2 are functions that can see .fooInfo, and nobody else can (without going through contortions).

(I had to fix the code to create a real function factory.)

Then Martin Mächler replies to say yup, this is what we do in base R.

Note that the above approach has been how nls() has been implemented for R … a very long time ago {before R 1.0.0}

e.g. from example(nls):

DNase1 <- subset(DNase, Run == 1)
fm1 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1)
str(fm1$m)
#> List of 16
#>  $ resid     :function ()  
#>  $ fitted    :function ()  
#>  $ formula   :function ()  
#>  $ deviance  :function ()  
#>  $ lhs       :function ()  
#>  $ gradient  :function ()  
#>  $ conv      :function ()  
#>  $ incr      :function ()  
#>  $ setVarying:function (vary = rep_len(TRUE, np))  
#>  $ setPars   :function (newPars)  
#>  $ getPars   :function ()  
#>  $ getAllPars:function ()  
#>  $ getEnv    :function ()  
#>  $ trace     :function ()  
#>  $ Rmat      :function ()  
#>  $ predict   :function (newdata = list(), qr = FALSE)  
#>  - attr(*, "class")= chr "nlsModel"

# so 16 functions, all sharing the *same* environment very
# efficiently and nicely

# this is *the* environment for the fitted model:
fmE <- environment(fm1$m[[1]])
ls.str(fmE)
#> convCrit : function ()  
#> dev :  num 0.00479
#> env : <environment: 0x0000028e34566e48> 
#> form : Class 'formula'  language density ~ SSlogis(log(conc), Asym, xmid, scal)
#> getPars : function ()  
#> getPars.noVarying : function ()  
#> getPars.varying : function ()  
#> getRHS : function ()  
#> getRHS.noVarying : function ()  
#> getRHS.varying : function ()  
#> gradCall :  language attr(ans, "gradient")[c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,  TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE| __truncated__ ...
#> ind : List of 3
#>  $ Asym: int 1
#>  $ xmid: int 2
#>  $ scal: int 3
#> internalPars :  Named num [1:3] 2.35 1.48 1.04
#> lhs :  num [1:16] 0.017 0.018 0.121 0.124 0.206 0.215 0.377 0.374 0.614 0.609 ...
#> nDcentral :  logi FALSE
#> npar :  int 3
#> QR : List of 4
#>  $ qr   : num [1:16, 1:3] -1.5221 0.0086 0.0314 0.0314 0.0584 ...
#>  $ rank : int 3
#>  $ qraux: num [1:3] 1.01 1.03 1.28
#>  $ pivot: int [1:3] 1 2 3
#> resid :  num [1:16] -0.01368 -0.01268 0.00895 0.01195 -0.00258 ...
#> rhs :  num [1:16] 0.0307 0.0307 0.1121 0.1121 0.2086 ...
#> scaleOffset :  num 0
#> setPars : function (newPars)  
#> setPars.noVarying : function (newPars)  
#> setPars.varying : function (newPars)  
#> upper :  NULL
#> useParams :  logi [1:3] TRUE TRUE TRUE
#> wts :  num [1:16] 1 1 1 1 1 1 1 1 1 1 ...

so the environment “contains” the functions themselves (but quite a few more things) and for an environment that means it only has pointers to the same function objects which are also in fm1$m.

So, there has been a nice convincing and important example on how to do this - inside R for more than two decennia.

Final emphasis was added by me.

OneNote archive

Source: 2021-11-01-archival-notes.Rmd

Here is a bunch of pre-2021 items from OneNote and elsewhere.

Added variable plots (2019-05-14)

AV plots by hand and by car::avPlot()

par(mar = c(4, 4, 1, 1))

m <- lm(mpg ~ wt + hp + am, mtcars)
m_y1 <- lm(mpg ~ hp + am, mtcars)
m_x1 <- lm(wt ~ hp + am, mtcars)
m_partial <- lm(resid(m_y1) ~ resid(m_x1))

library(patchwork)
p1 <- wrap_elements(
  full = ~ plot(resid(m_x1), resid(m_y1))
) 
p2 <- wrap_elements(
  full = ~ car::avPlot(m, "wt")
)

p1 + p2
all.equal(residuals(m_partial), residuals(m))
#> [1] TRUE

coef(m)["wt"]
#>        wt 
#> -2.878575
coef(m_partial)[2]
#> resid(m_x1) 
#>   -2.878575

What about interactions?

R 3.6.0 released! (2019-04-26)

Highlights for me

Standardized gamma distributions (2019-04-18)

A gamma times a positive non-zero constant is still a gamma. Paul and I used this fact to standardize results from different models onto a single “z” scale. Here’s a made-up example.

Simulate some gamma-distributed data.

library(patchwork)
set.seed(20211122)
par(mar = c(4, 4, 1, 1))

y <- rgamma(n = 100, shape = 20)
y2 <- rgamma(n = 100, shape = 200)
wrap_elements(full = ~ hist(y)) + 
  wrap_elements(full = ~ hist(y2))

Fit a GLM

get_shape <- function(model) { 
  1 / summary(model)[["dispersion"]]
}

m <- glm(y ~ 1, Gamma(link = "identity"))
m2 <- glm(y2 ~ 1, Gamma(link = "identity"))

get_shape(m)
#> [1] 21.09197
get_shape(m2)
#> [1] 178.6259

The two z values have similar scales.

par(mar = c(4, 4, 1, 1))
r <- residuals(m)
r2 <- residuals(m2)

z <- y / fitted(m)
z2 <- y2 / fitted(m2)

wrap_elements(
  full = ~ car::qqPlot(y, distribution = "gamma", shape = get_shape(m))
) + 
  wrap_elements(
    full = ~ car::qqPlot(y2, distribution = "gamma", shape = get_shape(m2))
  )

wrap_elements(
  full = ~ car::qqPlot(
    z, 
    distribution = "gamma", 
    shape = get_shape(m), 
    scale = 1 / get_shape(m))
) + 
  wrap_elements(
    full = ~ car::qqPlot(
      z2, 
      distribution = "gamma", 
      shape = get_shape(m2), 
      scale = 1 / get_shape(m2))
  )

Jaccard similarity (2019-03-14)

Jaccard similarity is the size of intersection of A and B divided by size of union of A and B.

Suppose a listener types of a few sentences twice. These are probes for intra-listener transcription reliability. Suppose that one of these sentences is 6 words long. If a listener typed 8 unique words for the two sentences, and 4 of them appear in both sentences, then they have Jaccard similarity of 4 / 8 = 0.5. Then we take the weighted mean of the Jaccard scores. We use the sentence length as the weight, so this sentence would have a weight of 6.

Asked twitter if they knew any IRR tutorials

Phylogenetic regression (2019-02-26)

Listened to bits of phylogenetic regression lecture.

multicomp (2019-02-25)

Rolling list of bookmarks (2019-10-01)

dtool: Manage scientific data https://dtool.readthedocs.io/en/latest/

A very first introduction to Hamiltonian Monte Carlo https://blogs.rstudio.com/tensorflow/posts/2019-10-03-intro-to-hmc/

Some things you maybe didn’t know about linear regression https://ryxcommar.com/2019/09/06/some-things-you-maybe-didnt-know-about-linear-regression/

dbx database tools for R https://github.com/ankane/dbx

Dash for R https://medium.com/@plotlygraphs/announcing-dash-for-r-82dce99bae13

How to interpret F-statistic https://stats.stackexchange.com/questions/12398/how-to-interpret-f-and-p-value-in-anova

The origin of statistically significant https://www.johndcook.com/blog/2008/11/17/origin-of-statistically-significant/

tidymv: Tidy Model Visualisation for Generalised Additive Models https://cran.r-project.org/web/packages/tidymv/index.html

Step-by-step examples of building publication-quality figures in ggplot2 https://github.com/clauswilke/practical_ggplot2

From data to viz https://www.data-to-viz.com/

Shapley model explanation https://github.com/slundberg/shap

JavaScript versus Data Science https://software-tools-in-javascript.github.io/js-vs-ds/en/

Penalized likelihood estimation https://modernstatisticalworkflow.blogspot.com/2017/11/what-is-likelihood-anyway.html

UTF-8 everywhere https://utf8everywhere.org/

Unicode programming https://begriffs.com/posts/2019-05-23-unicode-icu.html

R package to simulate colorblindness https://github.com/clauswilke/colorblindr

Data version control https://dvc.org/

Email tips https://twitter.com/LucyStats/status/1131285346455625734?s=20

colorcet library (python) https://colorcet.pyviz.org/

HCL wizard http://hclwizard.org/hclwizard/

Coloring for colorblindness. Has 8 palettes of color pairs https://davidmathlogic.com/colorblind/

5 things to consider when creating your CSS style guide by (malimirkeccita?) https://medium.com/p/5-things-to-consider-when-creating-your-css-style-guide-7b85fa70039d

Tesseract OCR engine for R https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html

Lua filters for rmarkdown documents https://github.com/crsh/rmdfiltr

An NIH Rmd template https://github.com/tgerke/nih-rmd-template

Commit message guide https://github.com/RomuloOliveira/commit-messages-guide

Linear regression diagnostic plots in ggplot2 https://github.com/yeukyul/lindia

A graphical introduction to dynamic programming https://avikdas.com/2019/04/15/a-graphical-introduction-to-dynamic-programming.html

Why software projects take longer than you think https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model.html

Automatic statistical reporting https://github.com/easystats/report

Multilevel models and CSD https://pubs.asha.org/doi/pdf/10.1044/2018_JSLHR-S-18-0075

Map of cognitive science http://www.riedlanna.com/cognitivesciencemap.html

An additive Gaussian process regression model for interpretable non-parametric analysis of longitudinal data https://www.nature.com/articles/s41467-019-09785-8

Common statistical tests are linear models (or: how to teach stats) https://lindeloev.github.io/tests-as-linear/

Monte Carlo sampling does not “explore” the posterior https://statmodeling.stat.columbia.edu/2019/03/25/mcmc-does-not-explore-posterior/

How to develop the five skills that will make you a great analyst https://mode.com/blog/how-to-develop-the-five-soft-skills-that-will-make-you-a-great-analyst

Confidence intervals are a ring toss https://twitter.com/epiellie/status/1073385427317465089

Mathematics for Machine Learning https://mml-book.github.io/

20 Tips for Senior Thesis Writers http://hwpi.harvard.edu/files/complit/files/twenty_tips_for_senior_thesis_writers_revised_august_2012.pdf

Comparing common analysis strategies for repeated measures data http://eshinjolly.com/2019/02/18/rep_measures/

Cosine similarity, Pearson correlation, and OLS coefficients https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/

qqplotr is a nice package for plotting qqplots https://cran.r-project.org/web/packages/qqplotr/index.html

Multidimensional item response theory https://github.com/philchalmers/mirt

Aki’s tutorials/materials on model selection https://github.com/avehtari/modelselection_tutorial

An Introverts Guide to Conferences https://laderast.github.io/2018/05/17/a-introvert-s-survival-guide-to-conferences/

Best practice guidance for linear mixed-effects models in psychological science https://psyarxiv.com/h3duq/

Viewing matrices and probabilities as graphs https://www.math3ma.com/blog/matrices-probability-graphs

Cross-validation for hierarchical models https://avehtari.github.io/modelselection/rats_kcv.html

All of Aki’s tutorials https://avehtari.github.io/modelselection/

User-friendly p values http://thenode.biologists.com/user-friendly-p-values/research/

Iodide is a Javascript notebook https://alpha.iodide.io/

Interesting question about what do when transformation changes the “test” of a highest-density interval. https://discourse.mc-stan.org/t/exponentiation-or-transformation-of-point-estimates/7848

Some ways to rethink statistical rules https://allendowney.blogspot.com/2015/12/many-rules-of-statistics-are-wrong.html

Toward a Principled Bayesian Workflow https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html

Stumbled across an article on mixed models and effect sizes: https://www.journalofcognition.org/articles/10.5334/joc.10/

I had these quotes in the old notes

In so complex a thing as human nature, we must consider, it is hard to find rules without exception. — George Eliot

If any one faculty of our nature may be called more wonderful than the rest, I do think it is memory…The memory is sometimes so retentive, so serviceable, so obedient; at others, so bewildered and so weak. We are, to be sure, a miracle every way; but our powers of recollecting and of forgetting do seem peculiarly past finding out. — Jane Austen

Colophon

Notebook built using notestar, a notebook system that uses targets to create a bookdown document. Theme is water.css as provided by the cleanrmd package.

References