Unevaluated expressions
Updated on Tuesday, January 30, 2024 09:44 AM
-
January 2024
- One-off knitr engine for blockquoting things
- Simulation for a probability paradox
- Sentence similarity
- How I got targets to handle a bunch of bootstrapping
- targets + crew
- MFA note: don’t touch the solver
- Revelation while reading withr 3.0.0 release
- A silly function
- The pipe placeholder works better now
purrr::pluck()
is better than I had realized- Mutual recursion demo
- Speaking of tail calls, how to write an “iterative” recursive function
- May 2023
- December 2022
- September 2022
- June 2022
- April, May 2022
- February 2022
- January 2022
- December 2021
- November 2021
- OneNote archive
- Colophon
- References
Notes to self and other noodlings. Main site.
January 2024
Source: 2024-01-19.Rmd
One-off knitr engine for blockquoting things
Here’s how easy it is to create a knitr language engine:
# create a custom language block that blockquotes its contents
knitr::knit_engines$set("blockquote" = function(options) {
if (isFALSE(options$echo)) return()
output <- paste0("> ", options$code, collapse = "\n")
if (!is.null(options$source_url)) {
output <- sprintf("%s\n[[source]](%s)", output, options$source_url)
}
output
})
Writing a chunk like:
```{blockquote}
#| source_url: https://tjmahr.com
Look at this **markdown**!
Here is a nested blockquote:
> Cool!
Yes. ```
Will yield the following output:
Look at this markdown!
Here is a nested blockquote:
Cool!
Yes. [source]
I wanted to make this a hook function but couldn’t figure it out.
Simulation for a probability paradox
This paradox is a flavor of the Sleeping Beauty paradox or the Boy or Girl paradox.
You are given an urn containing 100 balls; n of them are red, and 100-n are green, where n is chosen uniformly at random in [0, 100]. You take a random ball out of the urn—it’s red—and discard it. The next ball you pick (out of the 99 remaining) is: [poll omitted] [source]
And here is some code I wrote to run a simulation to get the unexpected solution.
f <- function(n = 100) {
n_red <- sample.int(n, 1)
p1 <- n_red / n
p2 <- (n_red - 1) / (n - 1)
draw <- sample(c("r", "g"), 1, prob = c(p1, 1 - p1))
if (draw == "g") {
"skip"
} else {
sample(c("r", "g"), 1, prob = c(p2, 1 - p2))
}
}
counts <- replicate(100000, f()) |>
table()
counts
#>
#> g r skip
#> 16808 33780 49412
# p(color | first red)
counts[1:2] |> proportions()
#>
#> g r
#> 0.3322527 0.6677473
Sentence similarity
I read an article I had bookmarked for a while. Suppose you have word embedding (vector) for each word. Or suppose you have a sentence embedding.
The Cohere embedding assigns a vector of length 4096 (i.e., a list of 4096 numbers) to each sentence. Furthermore, the multilingual embedding does this for sentences in more than 100 languages. In this way, the sentence “Hello, how are you?” and its corresponding French translation, “Bonjour, comment ça va?” will be assigned very similar numbers, as they have the same semantic meaning. [article]
They have toy data about where some movies are assigned vectors:
df_movies <- data.frame(
row.names = c("you've got mail", "rush hour", "rush hour 2", "taken"),
action = c(0, 6, 7, 7),
comedy = c(5, 5, 4, 0)
)
movies <- df_movies|> as.matrix()
movies
#> action comedy
#> you've got mail 0 5
#> rush hour 6 5
#> rush hour 2 7 4
#> taken 7 0
One option for similarity is the dot product of the vectors. (Why?)
# the two rush hour movies
sum(movies[2, ] * movies[3, ])
#> [1] 62
# all at once
dot_products <- movies %*% t(movies)
Check this out. as.dist()
will clean up the redundant
output, but it discards the diagonal. Suggestion
from StackOverflow.
as.dist(dot_products)
#> you've got mail rush hour rush hour 2
#> rush hour 25
#> rush hour 2 20 62
#> taken 0 42 49
Another option is cosine similarity. They plot the points on a coordinate space and observe:
[…] we want a measure of similarity that is high for sentences that are close to each other, and low for sentences that are far away from each other. [Euclidean] Distance does the exact opposite. So in order to tweak this metric, let’s look at the angle between the rays from the origin (the point with coordinates [0,0]), and each sentence. Notice that this angle is small if the points are close to each other, and large if the points are far away from each other. Now we need the help of another function, the cosine. The cosine of angles close to zero is close to 1 […] [article]
According to Wikipedia, we can derive the cosine similarity from the dot-product:
\[ \begin{aligned} \mathbf{A}\cdot\mathbf{B}&=\left\|\mathbf{A}\right\|\left\|\mathbf{B}\right\|\cos\theta \\ {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} &= \cos(\theta) \\ \|\mathbf{m}\| &:= \sqrt{\mathbf{m} \cdot \mathbf{m}} ~~~~~\text{(Euclidean norm)} \end{aligned} \]
We have a matrix with all of the dot products ready to go, so we need a matrix of all norms multiplied together.
dot_products
#> you've got mail rush hour rush hour 2 taken
#> you've got mail 25 25 20 0
#> rush hour 25 61 62 42
#> rush hour 2 20 62 65 49
#> taken 0 42 49 49
norms <- sqrt(rowSums(movies ^ 2))
norm_products <- norms %*% t(norms)
(dot_products / norm_products) |> as.dist() |> round(3)
#> you've got mail rush hour rush hour 2
#> rush hour 0.640
#> rush hour 2 0.496 0.985
#> taken 0.000 0.768 0.868
This value is cosine similarity. If we use 1 - similarity, it’s
cosine distance. That would be more appropriate for anything piped into
as.dist()
.
Still, two points could be very different distances from the origin, but they could have an angle of 0.
How I got targets to handle a bunch of bootstrapping
Here is a summary:
- transient memory
- not letting errors on one branch stop the whole pipeline
tarchetypes::tar_group_by()
-
format = "fst_tbl"
orformat = "qs"
tar_read(..., 1)
My bootstraps would instantly fill up my computer’s memory during a
build. So now I dump stuff out of memory with "transient"
.
(You can do this on a by-target level too.) I also keep running the
build if one target/branch fails:
tar_option_set(
error = "null",
memory = "transient"
)
I also split a single target into smaller ones. I declare a global partition value like
N_PARTITIONS <- 80
Then I used tarchetypes::tar_group_by()
to tell targets
to split a giant dataframe into groups that can be dynamically branched
over. Here is some actual code in the _targets.R
file. The
first targets defines 2000 bootstrap replicates to be split into 80
partitions/branches and the second fits the models on each
partition/branch.
tar_group_by(
straps,|>
data_iwpm filter(!is.na(mean_wpm)) |>
prepare_bootstrap_data() |>
bootstrap_by_slpg(
times = n_boot,
col_child_id = subject_id,
seed = 20220621
|>
) mutate(partition = assign_partition_numbers(id, N_PARTITIONS)),
partition
),tar_target(
models_intelligibility,fit_bootstrap_models(straps, "mean_intel", "intelligibility", 3, 2),
pattern = map(straps),
format = "qs"
),# btw the targets here themselves are built serially bc fit_bootstrap_models()
# and friends use furrr to fit the models in each branch in parallel
But to be honest, as I’m looking at this code, I’m realizing that this “making a giant thing and split into branches” have been built as “make a small table of branch IDs and grow them within each branch”. Analogy would be to read in a dozen files at once and split up the work so each file is processed separately versus take a dozen filenames and handle from each filename separately.
I have to use tarchetypes::tar_group_by()
instead of
tarchetypes::tar_group_count()
because I have multiple
datasets that I am bootstrapping and they need to be kept together
within a branch. Like, to split the following into two branches, I would
want bootstrap_number
1 and 2 in one branch and 3 in a
separate branch so that I can compare a
, b
and
c
within each bootstrap.
data_set,bootstrap_number
a,1
b,1
c,1
a,2
b,2
c,2
a,3
b,3
c,3
For serializing the data, I used format = "qs"
on models
or lists of models and format = "fst_tbl"
on plain,
inoffensive data tables.
I also started using targets::tar_read(blah, n)
where
n
is some selection of branches like 1
or
c(1, 2, 4)
, to read a subset of the overall data. This
speeds up interactive development by giving me a smaller subset of data
to write code around without having to load gigantic objects into R.
targets + crew
In targets, by default, targets are dispatched to be built one at a
time, but we can use crew so that multiple workers can build targets in
parallel. Here is the code to add to _targets.R
to get 2
workers.
targets::tar_option_set(
controller = crew::crew_controller_local(workers = 2)
)
And then you can view how much work each worker did. (I’m hard-coding the returns to show what happened with a full sit build.)
targets::tar_crew()
#> controller worker seconds targets
#> <chr> <int> <dbl> <int>
#> 1 8f28e966dd6adaf60895eb01 1 49.4 32
#> 2 8f28e966dd6adaf60895eb01 2 25.1 9
MFA note: don’t touch the solver
My colleague and I independently ran into an issue where the Montreal Forced Aligner installation with conda seemed to take forever during the solving environment stage. We both tried to fix this problem by changing the default solver. This change did lead to improved installation, but the MFA could not run at all. The correct thing to do is do nothing. Just use the default solver and wait it out.
Revelation while reading withr 3.0.0 release
I was reading the withr 3.0.0
release notes and noticed a pleasing symmetry the with_
and local_
functions.
The local_
functions set a temporary value within some
local scope, like a local()
or a
function()
.
local({
withr::local_language("fr")
plot[[1]]
})
#> Error in plot[[1]]: objet de type 'closure' non indiçable
plot[[1]]
#> Error in plot[[1]]: object of type 'closure' is not subsettable
The with_
functions set a temporary value around some
block of code.
withr::with_language("fr", {
plot[[1]]
})
#> Error in plot[[1]]: objet de type 'closure' non indiçable
plot[[1]]
#> Error in plot[[1]]: object of type 'closure' is not subsettable
So where are we working from?
-
local_
: within the code (scope) -
with_
: around the code
Oh, and the notes for withr 3.0.0, indicates that I should keep an
eye on how source()
interacts with withr.
A silly function
I got sick of writing null-guard clauses like:
my_cool_function <- function(data, ...) {
if (is.null(data)) {
return(NULL)
}
# rest of the function
nrow(data)
}
So why can’t I just make that a higher order function:
nullsafe_map <- function(x, f, ...) {
if (is.null(x)) NULL else f(x, ...)
}
mtcars |>
nullsafe_map(my_cool_function) |>
nullsafe_map(log)
#> [1] 3.465736
NULL |>
nullsafe_map(my_cool_function) |>
nullsafe_map(log)
#> NULL
It feels like I reinvented something vaguely monad-y here, and I
posted the question: “What have I reinvented here?” Tan suggested
“purrr::map_if()
with .p = Negate(is.null)
?”
which is cool because I had forgotten about
purrr::map_if()
.
The pipe placeholder works better now
Andrew Heiss pointed out to me that the placeholder since 4.3.0 is a lot better.
quote(
model |> _$coef |> _[1]
)
#> model$coef[1]
Previously, I had been circumventing these functions with things like
getElement()
or making my own pipe-friendly wrapper
functions like:
vec_element <- function(xs, i) {
xs[[i]]
}
vec_index <- function(xs, i) {
xs[i]
}
vec_index_into <- function(i, xs) {
xs[i]
}
vec_replace_na <- function(xs, replacement) {
xs[is.na(xs)] <- replacement
xs
}
vec_remove_na <- function(xs) {
xs[!is.na(xs)]
}
vec_set_names <- function(xs, ns) {
names(xs) <- ns
xs
}
There were two other functions alongside these helpers that weren’t meant to solve any specific piping problem. I came to adore them when I was doing some Advent of Code puzzles for fun in base R.
vec_which_value <- function(xs, value, negate = FALSE) {
(xs %in% value) |>
xor(negate) |>
which()
}
vec_which_name <- function(xs, name, negate = FALSE) {
vec_which_value(names(xs), name, negate)
}
letters |>
vec_which_value(c("a", "e", "i", "o", "u"), negate = TRUE) |>
vec_index_into(letters)
#> [1] "b" "c" "d" "f" "g" "h" "j" "k" "l" "m" "n" "p" "q" "r" "s" "t" "v" "w" "x"
#> [20] "y" "z"
purrr::pluck()
is better than I had realized
From ?purrr::pluck()
:
[
pluck()
] also accepts arbitrary accessor functions, i.e. functions that take an object and return some internal piece. […] Compare:accessor(x[[1]])$foo
topluck(x, 1, accessor, "foo")
.
Mutual recursion demo
I toyed with the experimental Tailcall
feature in
R-devel. I can’t actually run this code when rendering the notebook, but
the basic idea is that sys.nframe()
shows the height of the
stack when the recursion bottoms out and it is much smaller in tha
Tailcall
examples.
is_even <- function(n) {
if (n == 0) list(TRUE, sys.nframe()) else Tailcall(is_odd, n - 1)
}
is_odd <- function(n) {
if (n == 0) list(FALSE, sys.nframe()) else Tailcall(is_even, n - 1)
}
is_odd(30)
naive_is_even <- function(n) {
if (n == 0) list(TRUE, sys.nframe()) else naive_is_odd(n - 1)
}
naive_is_odd <- function(n) {
if (n == 0) list(FALSE, sys.nframe()) else naive_is_even(n - 1)
}
naive_is_odd(30)
Speaking of tail calls, how to write an “iterative” recursive function
I revisited SICP during the Christmas break. I don’t know
why this stuck out to me this time—it’s a chapter 1 topic, I have
definitely seen it—but SICP differentiates between two kinds of
recursive functions. Here is the example in R with factorial. First,
let’s make sure that multiplication grows the stack by making it a
function call instead of whatever kind of primitive *
is.
`%times%` <- function(a, b) a * b
fact1 <- function(n) {
if (n == 0) {
message("nframe of fact1: ", sys.nframe())
1
} else {
n %times% fact1(n - 1)
}
}
fact2 <- function(n) {
fact_iter <- function(n, accumulator) {
if (n == 0) {
message("nframe of fact2: ", sys.nframe())
1 %times% accumulator
} else {
accumulator <- n %times% accumulator
n <- n - 1
fact_iter(n, accumulator)
}
}
fact_iter(n, 1)
}
message("nframe of this code block: ", sys.nframe())
#> nframe of this code block: 39
fact1(10)
#> nframe of fact1: 60
#> [1] 3628800
fact2(10)
#> nframe of fact2: 51
#> [1] 3628800
The difference is that n %times% fact(n - 1)
in
fact()
grows into a giant single multiplication but
fact_iter(n - 1, n %times% accumulator)
in
fact2()
does not.
The SICP authors describe the difference between the two like so:
Consider the first process. […] The expansion occurs as the process builds up a chain of deferred operations (in this case, a chain of multiplications). […] This type of process, characterized by a chain of deferred operations, is called a recursive process. Carrying out this process requires that the interpreter keep track of the operations to be performed later on. In the computation of \(n!\), the length of the chain of deferred multiplications, and hence the amount of information needed to keep track of it, grows linearly with \(n\) (is proportional to \(n\)), just like the number of steps. […]
[In the second process, at] each step, all we need to keep track of, for any \(n\), are the current values of the variables [
n
andaccumulator
in my code]. We call this an iterative process. In general, an iterative process is one whose state can be summarized by a fixed number of state variables, together with a fixed rule that describes how the state variables should be updated as the process moves from state to state and an (optional) end test that specifies conditions under which the process should terminate.
With something like Fibonacci, where two recursively defined things are added together, means that the number of deferred additions grows exponentially.
May 2023
Source: 2023-05-04.Rmd
Gompertz functions
The Gompertz function “describes growth as being slowest at the start and end of a given time period. The right-side or future value asymptote of the function is approached much more gradually by the curve than the left-side or lower valued asymptote.”
\[ f(t) = a\mathrm{e}^{-b\mathrm{e}^{-ct}} \\ a: \textrm{asymptote} \\ b: \textrm{translation left or right} \\ c: \textrm{growth factor} \\ \]
So we have two different growth rates but a single parameter relating
to growth rate. R’s implementation (stats::SSgompertz()
)
“simplifies” the double exponential into
Asym * exp(-b2 * b3 ^ x)
.
The scaling factor c or b3 makes sense in the original Gompertz equation. I couldn’t understand what the R version was doing because it was changing the base of exponentiation.
library(tidyverse)
gompertz <- function(xs, asym, b2, b3) {
asym * exp(-b2 * exp(-b3 * xs))
}
# Make a quick ggplot2 layer of the gompertz function
stat_gompertz <- function(focus, asym, b2, b3, ...) {
# show what i typed, not the values
b2 <- rlang::enexpr(b2)
asym <- rlang::enexpr(asym)
b3 <- rlang::enexpr(b3)
labs <- list(
asym = rlang::expr_deparse(asym),
b2 = rlang::expr_deparse(b2),
b3 = rlang::expr_deparse(b3)
)
rlang::dots_list(asym, b2, b3, .named = TRUE)
color <- sprintf("%s = %s", focus, labs[[focus]])
stat_function(
aes(color = color),
fun = gompertz,
args = rlang::inject(list(asym = !! asym, b2 = !! b2, b3 = !! b3)),
...
)
}
ggplot() +
xlim(-5, 5) +
stat_gompertz("b3", 1, 1, 1) +
stat_gompertz("b3", 1, 1, 0) +
stat_gompertz("b3", 1, 1, .1) +
stat_gompertz("b3", 1, 1, -.5) +
ggtitle(
expression(asym %*% exp(-b[2] %*% exp(-b[3] %*% x)))
)
We have this weird feature where \(b = 1/2\) and \(b = 2\) are the same distance from \(b = 1\), and the distance between the \(b = 2, 3, 4\) gets smaller.
ggplot() +
xlim(-5, 5) +
stat_gompertz("b2", 1, 1/2, 1) +
stat_gompertz("b2", 1, 2, 1) +
stat_gompertz("b2", 1, 3, 1) +
stat_gompertz("b2", 1, 4, 1) +
stat_gompertz("b2", 1, 5, 1) +
stat_gompertz("b2", 1, 1, 1) +
ggtitle(
expression(asymptote %*% exp(-b[2] %*% exp(-b[3] %*% x)))
) +
labs(
y = "gompertz(x)",
color = expression(b[2]),
subtitle = expression(paste(asymptote == 1, ", ", b[3] == 1))
)
But the horizontal spacing of the lines is more regular when I pass
in exp()
expressions.
ggplot() +
xlim(-5, 5) +
stat_gompertz("b2", 1, exp(-3), 1) +
stat_gompertz("b2", 1, exp(-2), 1) +
stat_gompertz("b2", 1, exp(-1), 1) +
stat_gompertz("b2", 1, exp(0), 1) +
stat_gompertz("b2", 1, exp(1), 1) +
stat_gompertz("b2", 1, exp(2), 1) +
stat_gompertz("b2", 1, exp(3), 1) +
ggtitle(
expression(asymptote %*% exp(-b[2] %*% exp(-b[3] %*% x)))
) +
labs(
y = "gompertz(x)",
color = expression(b[2]),
subtitle = expression(paste(asymptote == 1, ", ", b[3] == 1))
)
Suppose I want to set priors for intelligibility for this function. We want the age of steepest growth to be around age 4–5.
library(tidyverse)
gompertz <- function(xs, asym, b2, b3) {
asym * exp(-b2 * exp(-b3 * xs))
}
asym <- 1
b2 <- rnorm(50, 5, 2)
b3 <- 1
gompertz_grid <- function(xs, asym, b2, b3, expand = TRUE) {
if (expand) {
d <- expand.grid(asym = asym, b2 = b2, b3 = b3) |>
tibble::rowid_to_column(".draw")
} else {
d <- data.frame(asym = asym, b2 = b2, b3 = b3) |>
tibble::rowid_to_column(".draw")
}
expand.grid(.draw = d$.draw, x = xs) |>
dplyr::left_join(d, by = ".draw", relationship = "many-to-many") |>
dplyr::mutate(
y = gompertz(x, asym, b2, b3)
)
}
xs <- 0:192
x <- (xs / 12)
d <- gompertz_grid(
x,
asym = plogis(rnorm(500, 0, 1.5)),
b2 = rlnorm(500, 8, 1),
b3 = rgamma(500, 2, 1),
expand = FALSE
)
d_draw <- d |>
distinct(.draw, asym, b2, b3) |>
mutate(
steepest_growth = log(b2) / b3,
`log(b2)` = log(b2)
) |>
select(-b2) |>
tidyr::pivot_longer(-.draw)
library(patchwork)
ggplot(d) +
aes(x = x, y = y) +
geom_line(aes(group = .draw), alpha = .1) +
xlim(0, 18)
ggplot(d_draw) +
aes(x = value) +
geom_histogram() +
facet_wrap("name", scales = "free_x")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
b2 and b3 jointly determine the age of steepest growth, so I can’t put a simple prior on it.
Postscript from 2024. I never did use this function to model intelligibility.
December 2022
Source: 2022-12-01.Rmd
duckdb
duckdb is awesome. I used it in a project to read in all of control files (tsv files) in a folder into a database with the following code. There is a lot to unpack here, but here are the main ideas:
dplyr is used to generate the queries that sent to the database.
The functions that start with
regexp_
are duckdb functions, not R functions, but that’s not a problem because the code that will be run is SQL code in the database.Normally,
tbl()
is used to specify a table in a database, but here, the database table is constructed from the tsv files usingread_csv_auto()
with wildcard characters.The code in
run_control_file_query()
produces a SQL query. The line withcompute()
is the one that actually has the database execute the query and store the results in a database table.
library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(dbdir = "data-raw/data.duckdb"))
# Generates a SQL query from all the control files in the
# working directory
run_control_file_query <- function() {
tbl(con, "read_csv_auto('*Control*File-*.txt', FILENAME = TRUE)") |>
# this all gets converted to SQL statements that will be executed by the
# database, so there is a mix of R functions and duckdb functions here.
select(
control_file = filename,
repeat_no = Repeat,
sentence = Sentence,
response = Response,
s_word = SWord,
m_word = MWord,
m_word_a = MWordA,
phonetic_sentence = `Phonetic Sentence`,
phonetic_response = `Phonetic Response`,
audio_file = File,
) |>
mutate(
control_file_number = control_file |>
regexp_extract('\\d{7}\\d+'),
item = audio_file |>
toupper() |>
# fix files with a subscript letter (S7T10a)
regexp_replace("[ABCDE].WAV", ".WAV") |>
# fix files with an extra 0(WT010)
regexp_replace("0(\\d\\d).WAV", "\\1.WAV") |>
regexp_extract('(S|S\\d|W)(T)\\d\\d.WAV') |>
left(-4L),
tocs_type = ifelse(
substr(item, 1L, 1L) == "W",
"single-word",
"multiword"
)
)
}
# Create a table from the first batch of files
withr::with_dir("../files-to-import/2022-10-27-batch-1/", {
DBI::dbRemoveTable(con, "control_files")
run_control_file_query() |>
compute(name = "control_files", temporary = FALSE)
})
September 2022
Source: 2022-06-06.Rmd
The BIC
I was wondering about the so-called Bayesian information criterion (BIC) today. What’s the deal with it? Both the AIC and BIC have the form:
\[\text{penalty}\times k\ \text{parameters} - 2\times \text{model log-likelihood}\]
For the AIC, the penalty is 2, and for the BIC, the penalty is \(\ln(n\
\text{observations})\). If we note that ln(20)
=
2.996 (the penalty for 20 observations), then the BIC penalty is always
larger than the AIC penalty in practice.
See Aho, Derryberry, & Peterson (2014): The AIC prioritizes out-of-sample prediction (asymptotic efficiency). BIC is an approximation of a Bayesian hypothesis test (so, asymptotic consistency). They note that the BIC is ideal when “(1) only a few potential hypotheses are considered and (2) one of the hypotheses is (essentially) correct,” and the AIC worldview is for when there are “(1) numerous hypotheses and (2) the conviction that all of them are to differing degrees wrong.”
June 2022
Source: 2022-06-06.Rmd
Getting downlit to work on this notebook
notestar, which I use to create this notebook, produces an HTML file.
I can then post-process this HTML file using
downlit::downlit_html_path()
so that R function calls are
automatically linked to documentation pages. This feature is great, but
out of the box, it ignores the syntax highlighting of the output
document. The solution, it turns out, is buried in
?rmarkdown::html_document
(look at how nicely that is
autolinked). I make sure that the following css file is included in the
build process, so that links are underlined and the style defers to the
default syntax highlighting coloring.
:any-link {
code acolor: inherit; /* use colour from syntax highlighting */
text-decoration: underline;
text-decoration-color: #ccc;
}
Windows Terminal settings for Miniconda
Here is how to add the Miniconda prompt to Windows Terminal. Analogous steps probably will work for an Anaconda prompt.
First, we need to know where the Miniconda prompt lives.
Find the desired Anaconda/Miniconda prompt in the Start Menu. Right click > Open file location
Select the shortcut for the Anaconda/Miniconda prompt. Right click > Properties.
Look at the target field. For my Miniconda shortcut, I have:
%windir%\System32\cmd.exe "/K" C:\Users\trist\miniconda3\Scripts\activate.bat C:\Users\trist\miniconda3
We don’t need to hardcode in my user profile path. I also don’t think that last path is necessary, so this can simplify to:
%windir%\System32\cmd.exe /K %USERPROFILE%\miniconda3\Scripts\activate.bat
In Windows Terminal, create a terminal profile:
- open Settings > Add a new profile > New empty profile.
Interactive setup
We can use the Windows Terminal app to set up the miniconda prompt. Things to change include:
- Name
- Command line: Use the target field from above. In my case, I paste
in
%windir%\System32\cmd.exe /K %USERPROFILE%\miniconda3\Scripts\activate.bat
- Starting directory: I select “Use parent process directory”.
- Icon: I use the snake emoji 🐍 but this page has a hint about using a .ico file.
JSON
Instead of the interactive setup, we can open the JSON settings file
(Settings > Open JSON file) and modify/paste in the settings. Here
are my settings. Here the guid
field was created by Windows
Terminal so we should use the one it provides for us.
{
"commandline": "%windir%\\System32\\cmd.exe /K %USERPROFILE%\\miniconda3\\Scripts\\activate.bat",
"guid": "{2679ff34-f6b9-5fcd-9b81-08b50df61bae}",
"icon": "\ud83d\udc0d",
"name": "Miniconda",
"startingDirectory": null
}
April, May 2022
Source: 2022-04-02.Rmd
.Rprofile contents
I am now adding the following to my .Rprofile file. By default,
renv::restore()
will undo the entire restore attempt if any
of the package installs fails. (It wants the the whole
transaction to succeed.) The undoes the default so that
successful package installs are kept.
options(renv.config.install.transactional = FALSE)
My full .Rprofile tends to be something like:
source("renv/activate.R")
options(
repos = c(
tjmahr = 'https://tjmahr.r-universe.dev',
# stan = "https://mc-stan.org/r-packages/",
getOption("repos")
)
)
options(WrapRmd.width = 72)
options(styler.addins_style_transformer = "grkstyle::grk_style_transformer()")
options(renv.config.install.transactional = FALSE)
Note that I put the renv activation first because it does some work with the repos option and I don’t want to get in the way of it.
R 4.2.0 alpha notes
These are some notes about workarounds etc. for the devel/alpha version of R 4.2.0 that I have been using on Windows for the past few weeks.
to fix the text shadow in the syntax highlighting in the docs in RStudio, one workaround is to replace the prism.css file in the folder
C:\Program Files\R\R-4.2.0alpha\doc\html
with a modified css file. I just download the Tomorrow Night theme from https://prismjs.com/ and use that one because it doesn’t have a shadow by default.for a clean break, I removed Rtools40 and remove any use of it in the PATH. Rtools42/R 4.2.0 do not depend on the path.
Rtools42 adds Rtools42 Bash to the start menu. I used
pacman -Syuu
to update the Msys2 stuff in Rtools. If I run this again, it will update other packages.Here is the documentation page for Rtools42 https://cran.r-project.org/bin/windows/base/howto-R-4.2.html
February 2022
Source: 2022-02-01.Rmd
Phonotactic probability
I asked the PhonCorps Slack what the state of the art was for computing phonotactic probability. I said I was using the IPhod database with SUBTLEXus frequencies. Some response:
- It’s not too hard to compute probabilities by hand if you have a pronunciation dictionary.
-
KU Phonotactic Probability
Calculator.
- One person advised that the KU uses the Webster Pocket Dictionary which makes it a subset of the corpus used by IPhod.
- For neighborhood density, CLEARPOND and the English Lexicon Project were mentioned.
R code to parse the KU phonotactic probability calculator output
# Parse the output of https://calculator.ku.edu/phonotactic/about
# in base R
# TJ Mahr
test_lines <- c(
"b@d - English", "0.0512 0.0794 0.0380", "0.0059 0.0024", "1.1686 1.0083",
"bcl - English", "0.0512 0.0165 0.0737", "0.0008 0.0035", "1.1414 1.0043",
"bi - English", "0.0512 0.0318", "0.0022", "1.0830 1.0022",
"bIg - English", "0.0512 0.0962 0.0179", "0.0041 0.0035", "1.1653 1.0076",
"but - English", "0.0512 0.0221 0.0660", "0.0012 0.0027", "1.1393 1.0039",
"bW - English", "0.0512 0.0097", "0.0006", "1.0609 1.0006",
"bO - English", "0.0512 0.0034", "0.0003", "1.0546 1.0003",
"b^di - English",
"0.0512 0.0392 0.0380 0.0432",
"0.0034 0.0010 0.0037",
"1.1716 1.0081",
"b^s - English", "0.0512 0.0392 0.0788", "0.0034 0.0039", "1.1692 1.0073",
"Cu - English", "0.0089 0.0221", "0.0002", "1.0310 1.0002"
)
split_ku_chunks <- function(lines) {
starts <- grep(" - ", lines)
ends <- c(starts[-1] - 1, length(lines))
Map(seq, starts, ends) |>
lapply(function(i) lines[i])
}
parse_ku_chunk <- function(lines) {
# How the output is documented in the paper.
# Vitevitch, M.S. & Luce, P.A. (2004) A web-based interface to calculate
# phonotactic probability for words and nonwords in English. Behavior Research
# Methods, Instruments, and Computers, 36, 481-487.
# "The output of the Phonotactic Probability Calculator
# (which appears in the field on the right side of your
# browser) typically consists of four lines of information
# for each item entered in the left field. The first line con-
# tains the phonemic transcription you originally entered.
# The second line contains the position-specific probabil-
# ity for each segment in the item entered by the user. If the
# (non)word you entered contains more than seven pho-
# nemes, the position-specific probabilities for each pho-
# neme in the item will wrap around to the next line.
#
# "The third line (assuming the entry is seven phonemes
# or less) contains the position-specific biphone probabil-
# ities for each of the biphones in the word."
pronunciation <- lines[1] |>
strsplit(" - ") |>
unlist() |>
utils::head(1)
phones <- pronunciation |>
strsplit("") |>
unlist()
biphones <- paste0(phones[-length(phones)], phones[-1])
values <- lines[-1] |>
paste0(collapse = " ") |>
strsplit(" ") |>
unlist() |>
as.numeric()
i_position_probs <- seq(1, length(phones))
i_biphone_probs <- seq(1, length(biphones)) + length(phones)
positional <- data.frame(
pronunciation = pronunciation,
class = "positional_prob",
token = phones,
value = values[i_position_probs]
)
biphone <- data.frame(
pronunciation = pronunciation,
class = "biphone_prob",
token = biphones,
value = values[i_biphone_probs]
)
rbind(positional, biphone)
}
do_call_rbind <- function(xs, ...) do.call(rbind, xs, ...)
test_lines |>
split_ku_chunks() |>
lapply(parse_ku_chunk) |>
do_call_rbind()
#> pronunciation class token value
#> 1 b@d positional_prob b 0.0512
#> 2 b@d positional_prob @ 0.0794
#> 3 b@d positional_prob d 0.0380
#> 4 b@d biphone_prob b@ 0.0059
#> 5 b@d biphone_prob @d 0.0024
#> 6 bcl positional_prob b 0.0512
#> 7 bcl positional_prob c 0.0165
#> 8 bcl positional_prob l 0.0737
#> 9 bcl biphone_prob bc 0.0008
#> 10 bcl biphone_prob cl 0.0035
#> 11 bi positional_prob b 0.0512
#> 12 bi positional_prob i 0.0318
#> 13 bi biphone_prob bi 0.0022
#> 14 bIg positional_prob b 0.0512
#> 15 bIg positional_prob I 0.0962
#> 16 bIg positional_prob g 0.0179
#> 17 bIg biphone_prob bI 0.0041
#> 18 bIg biphone_prob Ig 0.0035
#> 19 but positional_prob b 0.0512
#> 20 but positional_prob u 0.0221
#> 21 but positional_prob t 0.0660
#> 22 but biphone_prob bu 0.0012
#> 23 but biphone_prob ut 0.0027
#> 24 bW positional_prob b 0.0512
#> 25 bW positional_prob W 0.0097
#> 26 bW biphone_prob bW 0.0006
#> 27 bO positional_prob b 0.0512
#> 28 bO positional_prob O 0.0034
#> 29 bO biphone_prob bO 0.0003
#> 30 b^di positional_prob b 0.0512
#> 31 b^di positional_prob ^ 0.0392
#> 32 b^di positional_prob d 0.0380
#> 33 b^di positional_prob i 0.0432
#> 34 b^di biphone_prob b^ 0.0034
#> 35 b^di biphone_prob ^d 0.0010
#> 36 b^di biphone_prob di 0.0037
#> 37 b^s positional_prob b 0.0512
#> 38 b^s positional_prob ^ 0.0392
#> 39 b^s positional_prob s 0.0788
#> 40 b^s biphone_prob b^ 0.0034
#> 41 b^s biphone_prob ^s 0.0039
#> 42 Cu positional_prob C 0.0089
#> 43 Cu positional_prob u 0.0221
#> 44 Cu biphone_prob Cu 0.0002
Some old random walk code
I had some old code that asked what a 12-step random walk looked like if it started at the ceiling and could on go left or stay put on the first move.
par(mar = c(4, 4, 1, 1))
set.seed(20220209)
take_one_step <- function(x, y, ceiling = 0) {
this_step <- if (x == ceiling) {
sample(c(-1, 0), 1)
} else {
sample(c(-1, 0, 1), 1)
}
x + this_step
}
no_ceiling <- replicate(
30000,
Reduce(function(x, y) take_one_step(x, y, ceiling = 1000), 0:12, init = 0)
)
yes_ceiling <- replicate(
30000,
Reduce(function(x, y) take_one_step(x, y, ceiling = 0), 0:12, init = 0)
)
library(patchwork)
p1 <- wrap_elements(
full = ~ plot(table(no_ceiling), lwd = 4)
)
p2 <- wrap_elements(
full = ~ plot(table(yes_ceiling), lwd = 4)
)
p1 + p2
That ain’t a truncated Gaussian. 👀
January 2022
Source: 2022-01-21.Rmd
I have decided to get rid of “how to” from heading names. I’m thinking about creating a separate links/bookmarks section.
New interesting R packages.
- ggdensity An R package for interpretable visualizations of density
estimates https://github.com/jamesotto852/ggdensity
- the README has a good figure showing a multivariate normal distribution will mess up the density for a bimodal region
- khroma 🎨 Colour Schemes for Scientific Data Visualization https://github.com/tesselle/khroma
- geomtextpath Create curved text paths in ggplot2 https://github.com/AllanCameron/geomtextpath
Easy way to sample from a multinomial.
sample() takes a prob argument, so you can skip rmultinom entirely!
— David Robinson ((drob?)) October 27, 2021
sample(n_comp, n, prob = mixture_probs, replace = TRUE)
Conditionally provide tibbles. The newsletter by rOpenSci highlights this trick by the palmerpenguins package.
delayedAssign("penguins", {
if (requireNamespace("tibble", quietly = TRUE)) {
tibble::as_tibble(readRDS("data/penguins.rds"))
} else {
readRDS("data/penguins.rds")
}
})
delayedAssign("penguins_raw", {
if (requireNamespace("tibble", quietly = TRUE)) {
tibble::as_tibble(readRDS("data/penguins_raw.rds"))
} else {
readRDS("data/penguins_raw.rds")
}
})
This rOpenSci newsletter also detailed how to get new words past the package description spellchecker.
Manually delete a facet cell
Suppose you have some ragged data.
library(tidyverse)
data <- tibble(
id = letters[1:14],
group = rep(c("g1", "g2", "g3"), times = c(5, 4, 5)),
data = lapply(1:14, function(x) {
# just making some data
n_points <- sample(c(2:5), size = 1)
ages <- sort(sample(c(10:24), size = n_points))
data.frame(age = ages, score = rbeta(n_points, 1, 4))
})
) %>%
tidyr::unnest(data)
ggplot(data, aes(x = age, y = score)) +
geom_point() +
geom_line(aes(color = group)) +
facet_wrap("id", ncol = 5)
We would like to have one row per group. So we actually would want four panels on the middle row. If we add an extra panel to fill that space and then remove it, we can achieve this goal.
First add the dummy panel.
data <- data %>%
tibble::add_row(id = "i2", group = "g2", age = median(data$age), score = 0) %>%
# two points so that geom_line() doesn't warn
tibble::add_row(id = "i2", group = "g2", age = median(data$age), score = 0)
ggplot(data, aes(x = age, y = score)) +
geom_point(na.rm = TRUE) +
geom_line(aes(color = group), na.rm = TRUE) +
facet_wrap("id", ncol = 5)
My overly complicated approach was to manually
delete the grob using the following. Finding the pattern
name is very tedious because you basically have to guess the strip and
panel numbers.
p <- last_plot()
g <- ggplotGrob(p)
# guess and check
pattern <- "panel-4-3|strip-t-5-2"
to_drop <- str_which(g$layout$name, pattern)
to_keep_names <- str_subset(g$layout$name, pattern, negate = TRUE)
# delete grob data, clean up layout table
g$grobs[to_drop] <- NULL
g$layout <- g$layout[g$layout$name %in% to_keep_names, ]
grid::grid.newpage()
grid::grid.draw(g)
Although I am going to give better solution, I did write this entry precisely to have this code here for my reference.
A better approach is to blank the cell with a ribbon. We have to eliminate strip titles from the theme for this approach to work.
ggplot(data, aes(x = age, y = score)) +
geom_point(na.rm = TRUE) +
geom_line(aes(color = group), na.rm = TRUE) +
geom_ribbon(
aes(ymin = -Inf, ymax = Inf),
data = tibble(
id = "i2",
age = c(-Inf, Inf),
group = "g2",
score = NA_real_
),
fill = "white",
color = "white"
) +
facet_wrap("id", ncol = 5) +
theme(strip.text.x = element_blank())
Brenton
on Twitter suggested numbering IDs within groups and using
facet_grid()
. That works, and it also can give labeled
rows.
data <- data %>%
group_by(group) %>%
mutate(col_num = match(id, unique(id))) %>%
ungroup()
ggplot(data, aes(x = age, y = score)) +
geom_point(na.rm = TRUE) +
geom_line(aes(color = group), na.rm = TRUE) +
geom_ribbon(
aes(ymin = -Inf, ymax = Inf),
data = tibble(
col_num = 5,
age = c(-Inf, Inf),
group = "g2",
score = NA_real_
),
fill = "white",
color = "white"
) +
facet_grid(group ~ col_num) +
theme(strip.text.x = element_blank())
Compute spline predictions manually from GAMLSS
My goal is write down some reusable code for how to get spline predictions for unobserved x values by hand from a GAMLSS model.
Problem setup
Here are 98 random points from a subrange of the intelligibility dataset.
library(tidyverse)
suppressMessages(library(gamlss))
select <- dplyr::select
data_intel <- tibble::tribble(
~id, ~age_months, ~mean_intelligibility,
1L, 44, 0.74,
2L, 49, 0.779,
3L, 56, 0.686,
4L, 51, 0.675,
5L, 42, 0.785,
6L, 42, 0.708,
7L, 38, 0.557,
8L, 44, 0.756,
9L, 56, 0.64,
10L, 50, 0.767,
11L, 48, 0.858,
12L, 53, 0.94,
13L, 49, 0.673,
14L, 56, 0.928,
15L, 54, 0.827,
16L, 51, 0.868,
17L, 53, 0.784,
18L, 50, 0.908,
19L, 37, 0.458,
20L, 37, 0.184,
21L, 51, 0.688,
22L, 53, 0.753,
23L, 43, 0.802,
24L, 47, 0.841,
25L, 45, 0.523,
26L, 37, 0.288,
27L, 42, 0.672,
28L, 50, 0.777,
29L, 54, 0.749,
30L, 53, 0.901,
31L, 55, 0.806,
32L, 45, 0.646,
33L, 45, 0.689,
34L, 45, 0.806,
35L, 46, 0.859,
36L, 36, 0.557,
37L, 38, 0.606,
38L, 50, 0.804,
39L, 44, 0.853,
40L, 51, 0.747,
41L, 37, 0.659,
42L, 48, 0.883,
43L, 45, 0.798,
44L, 39, 0.483,
45L, 59, 0.976,
46L, 58, 0.786,
48L, 48, 0.775,
49L, 47, 0.896,
50L, 38, 0.921,
51L, 41, 0.686,
53L, 54, 0.72,
54L, 51, 0.965,
55L, 44, 0.777,
56L, 42, 0.53,
57L, 36, 0.707,
58L, 60, 0.908,
59L, 37, 0.566,
60L, 51, 0.598,
61L, 58, 0.938,
62L, 54, 0.925,
63L, 47, 0.757,
64L, 42, 0.63,
65L, 45, 0.764,
66L, 59, 0.85,
67L, 47, 0.786,
68L, 38, 0.781,
69L, 44, 0.929,
70L, 36, 0.356,
71L, 55, 0.802,
72L, 54, 0.864,
73L, 43, 0.864,
74L, 49, 0.848,
75L, 38, 0.574,
76L, 53, 0.916,
77L, 41, 0.85,
78L, 59, 0.851,
79L, 46, 0.444,
80L, 55, 0.778,
81L, 54, 0.931,
82L, 50, 0.32,
83L, 58, 0.855,
84L, 39, 0.64,
85L, 50, 0.727,
86L, 50, 0.857,
87L, 39, 0.731,
88L, 46, 0.624,
89L, 53, 0.951,
90L, 46, 0.873,
91L, 47, 0.897,
92L, 46, 0.748,
93L, 59, 0.936,
94L, 50, 0.881,
95L, 43, 0.815,
96L, 44, 0.768,
97L, 43, 0.789,
98L, 50, 0.728,
99L, 38, 0.794,
100L, 44, 0.509
)
data_intel <- data_intel %>%
arrange(age_months)
But note that we are missing data from three ages (x values).
ggplot(data_intel) +
aes(x = age_months, y = mean_intelligibility) +
geom_point() +
geom_vline(xintercept = c(40, 52, 57), linetype = "dashed") +
scale_x_continuous(breaks = c(36, 48, 60))
Here is how we would fit a beta-regression model as we did in the growth curve paper.
fit_beta_gamlss <- function(data, mu_df = 3, sigma_df = 2) {
BE <- gamlss.dist::BE
gamlss.control <- gamlss::gamlss.control
ns <- splines::ns
model <- wisclabmisc::mem_gamlss(
mean_intelligibility ~ ns(age_months, df = mu_df),
sigma.formula = ~ ns(age_months, df = sigma_df),
family = BE(),
data = data,
control = gamlss.control(trace = FALSE)
)
model$.user$mu_df <- mu_df
model$.user$sigma_df <- sigma_df
model$.user$mu_basis <- ns(data$age_months, mu_df)
model$.user$sigma_basis <- ns(data$age_months, sigma_df)
model
}
model <- fit_beta_gamlss(data_intel, mu_df = 3, sigma_df = 2)
Then we can get centiles pretty easily for the existing x values. But note the three gaps on the percentile line.
centiles <- data_intel %>%
select(age_months) %>%
wisclabmisc::predict_centiles(model)
ggplot(data_intel) +
aes(x = age_months, y = mean_intelligibility) +
geom_point() +
geom_point(
aes(y = c10),
data = centiles,
color = "skyblue3",
size = 10,
shape = "-"
) +
scale_x_continuous(breaks = c(36, 48, 60))
If we ask GAMLSS to give us centiles for those gaps—herein lies the problem—we might get a warning message like the following:
centiles2 <- data_intel %>%
select(age_months) %>%
tidyr::expand(age_months = tidyr::full_seq(age_months, 1)) %>%
wisclabmisc::predict_centiles(model)
#> Warning in predict.gamlss(obj, what = "mu", newdata = newx, type = "response", : There is a discrepancy between the original and the re-fit
#> used to achieve 'safe' predictions
#>
I don’t want to mess around in unsafe territory with GAMLSS. (What if they change how they handle this problem in the future?) So I would like to estimate values for those missing x values from the spline by hand.
Problem solution
Use predict()
on the spline object to get the values of
the basis functions at the new x values. I stashed the original spline
object inside of the model.
mu_basis <- model$.user$mu_basis
ages_to_predict <- data_intel %>%
tidyr::expand(age_months = tidyr::full_seq(age_months, 1))
mu_basis2 <- predict(mu_basis, newx = ages_to_predict$age_months)
library(patchwork)
tjmisc::ggmatplot(mu_basis) +
tjmisc::ggmatplot(mu_basis2)
We do the following to get model parameters at each age. This code is the point of this note.
mu_basis <- model$.user$mu_basis
sigma_basis <- model$.user$sigma_basis
newx <- ages_to_predict$age_months
mu_basis2 <- cbind(1, predict(mu_basis, newx = newx))
sigma_basis2 <- cbind(1, predict(sigma_basis, newx = newx))
manual_centile <- tibble::tibble(
age_months = newx,
# multiply coefficients and undo link function
mu = plogis(mu_basis2 %*% coef(model, what = "mu"))[, 1],
sigma = plogis(sigma_basis2 %*% coef(model, what = "sigma"))[, 1],
# now take centiles
c10 = gamlss.dist::qBE(.1, mu, sigma)
)
manual_centile
#> # A tibble: 25 × 4
#> age_months mu sigma c10
#> <dbl> <dbl> <dbl> <dbl>
#> 1 36 0.522 0.341 0.294
#> 2 37 0.560 0.334 0.336
#> 3 38 0.596 0.328 0.378
#> 4 39 0.630 0.321 0.418
#> 5 40 0.661 0.315 0.456
#> 6 41 0.688 0.309 0.491
#> 7 42 0.712 0.304 0.522
#> 8 43 0.731 0.298 0.548
#> 9 44 0.745 0.293 0.569
#> 10 45 0.755 0.288 0.584
#> # ℹ 15 more rows
Let’s look at our manually computed 10th percentile versus the package-provided one.
ggplot(data_intel) +
aes(x = age_months, y = mean_intelligibility) +
geom_point() +
geom_point(
aes(y = c10),
data = manual_centile,
color = "orange",
size = 10,
shape = "+"
) +
geom_point(
aes(y = c10),
data = centiles,
color = "skyblue3",
size = 10,
shape = "-"
) +
scale_x_continuous(breaks = c(36, 48, 60))
Count number of function calls in some R code
Do data analysis on getParseData()
result.
code <- quote({
check_out <- my(fancy(code))
here_is <- some(fancy(code))
})
parse_data <- getParseData(parse(text = deparse(code), keep.source = TRUE))
head(parse_data)
#> line1 col1 line2 col2 id parent token terminal text
#> 55 1 1 4 1 55 0 expr FALSE
#> 1 1 1 1 1 1 55 '{' TRUE {
#> 25 2 5 2 32 25 55 expr FALSE
#> 3 2 5 2 13 3 5 SYMBOL TRUE check_out
#> 5 2 5 2 13 5 25 expr FALSE
#> 4 2 15 2 16 4 25 LEFT_ASSIGN TRUE <-
Count tokens using results of all.names()
and
all.vars()
.
names <- all.names(code)
vars <- all.vars(code)
# Just the non-variable tokens
names[!names %in% vars]
#> [1] "{" "<-" "my" "fancy" "<-" "some" "fancy"
December 2021
Source: 2021-12-07.Rmd
Quick hits
rlang provides done()
to break out of a loop.
This paper uses by-word entropy as a measure of intelligibility. https://doi.org/10.1017/S0305000921000714
Leslie matrix. Advent of Code day 6 (simulating population growth) can be done with matrix multiplication on a Leslie matrix which is a transition matrix on age counts:
The Leslie matrix is a square matrix with the same number of rows and columns as the population vector has elements. The (i,j)th cell in the matrix indicates how many individuals will be in the age class i at the next time step for each individual in stage j.
Monads in one sentence.
A monad is the minimum amount of structure needed to overload function composition in a way that “performs an extra computation” on the intermediate value. – https://www.youtube.com/watch?v=Nq-q2USYetQ&feature=youtu.be
November 2021
Source: 2021-11-22.Rmd
Quick hits
Base graphics in knitr. I learned while porting over
some old notes that par()
options are not preserved between
knitr chunks unless opts_knit$get("global.par")
is
TRUE
. It’s FALSE
by default. No funny business
using old <- par(...)
Sampling a swimming pool depth. Jorbs on his Slay the Spire stream commented problems with big data. He used a nice analogy for sampling bias. Imagine you have a swimming pool and we can’t see the bottom of it. We want to measure the average depth of it. No amount of big data will accurately estimate the depth with all the samples come from the shallow end of the pool. But if you smartly sample the pool with a grid of points, then you can come up with an accurate estimate from relatively few points. This example is nice because it can illustrate accuracy versus precision.
The analogy also funnily resembles an analogy used for Hamilitonian Monte Carlo: a puck skating around a smooth curved surface. If the pool is the posterior distribution and the depth is the negative log posterior density, then we want to explore that pool too.
Paul recommended using Frank Harrell’s rms::orm()
function for ordinal regression.
eval()
is stack inconsistent (whatever that means)
base::eval()
and rlang::eval_bare()
do things
differently.
Here are the examples from ?eval_bare()
# eval_bare() works just like base::eval() but you have to create
# the evaluation environment yourself:
eval_bare(quote(foo), env(foo = "bar"))
#> [1] "bar"
# eval() has different evaluation semantics than eval_bare(). It
# can return from the supplied environment even if its an
# environment that is not on the call stack (i.e. because you've
# created it yourself). The following would trigger an error with
# eval_bare():
ret <- quote(return("foo"))
eval(ret, env())
#> [1] "foo"
# eval_bare(ret, env()) # "no function to return from" error
# Another feature of eval() is that you can control surround loops:
bail <- quote(break)
while (TRUE) {
eval(bail)
# eval_bare(bail) # "no loop for break/next" error
}
# To explore the consequences of stack inconsistent semantics, let's
# create a function that evaluates `parent.frame()` deep in the call
# stack, in an environment corresponding to a frame in the middle of
# the stack. For consistency with R's lazy evaluation semantics, we'd
# expect to get the caller of that frame as result:
fn <- function(eval_fn) {
list(
returned_env = middle(eval_fn),
actual_env = current_env()
)
}
middle <- function(eval_fn) {
deep(eval_fn, current_env())
}
deep <- function(eval_fn, eval_env) {
expr <- quote(parent.frame())
eval_fn(expr, eval_env)
}
# With eval_bare(), we do get the expected environment:
fn(rlang::eval_bare)
#> $returned_env
#> <environment: 0x55cac3595df8>
#>
#> $actual_env
#> <environment: 0x55cac3595df8>
#>
# But that's not the case with base::eval():
fn(base::eval)
#> $returned_env
#> <environment: 0x55cac1cc7980>
#>
#> $actual_env
#> <environment: 0x55cac1cc2170>
#>
The Gaussian KDE is a sum of baby Gaussians?!
Today I learned from here that the Gaussian KDE is the sum of a bunch of little Gaussian curves. Here look:
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
d <- penguins %>%
filter(!is.na(bill_length_mm)) %>%
sample_n(size = 20)
# grid of xs
x <- seq(30, 63, length.out = 200)
# compute density of xs using each observed value
# as the mean.
l <- lapply(
d$bill_length_mm,
function(m) {
data.frame(
x = x,
y = dnorm(x, mean = m, sd = bw.nrd0(d$bill_length_mm))
)
}
)
dl <- bind_rows(l, .id = "obs")
# plot them, their sum, and the default density curve
ggplot(d) +
aes(x = bill_length_mm) +
geom_density(aes(color = "geom_density()"), size = 2, show.legend = FALSE) +
geom_rug() +
geom_line(
aes(x = x, y = y / nrow(d), group = obs),
data = dl, alpha = .1
) +
stat_summary(
aes(x = x, y = y / nrow(d), color = "sum of pointwise densities"),
data = dl,
# alpha = .1,
geom = "line",
fun = "sum",
# color = "orange"
) +
scale_color_manual(values = c("black", "orange")) +
labs(color = NULL)
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
Using closures to make 16 related functions
The R Open Sci newsletter linked to a mailing list thread about how to stash data inside of an R package.
Duncan Murdoch suggests using a function factory to create closures that can access data in the parent environment.
Your
local()
call can create several functions and return them in a list; then just those functions have access to the local variables. For example,createFns <- function() { .fooInfo <- NULL fn1 <- function (...) { ... } fn2 <- function (...) { ... } list(fn1 = fn1, fn2 = fn2) } fns <- createFns() fn1 <- fns$fn1 fn2 <- fns$fn2
Now
fn1
andfn2
are functions that can see.fooInfo
, and nobody else can (without going through contortions).
(I had to fix the code to create a real function factory.)
Then Martin Mächler replies to say yup, this is what we do in base R.
Note that the above approach has been how
nls()
has been implemented for R … a very long time ago {before R 1.0.0}e.g. from
example(nls)
:DNase1 <- subset(DNase, Run == 1) fm1 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1) str(fm1$m) #> List of 16 #> $ resid :function () #> $ fitted :function () #> $ formula :function () #> $ deviance :function () #> $ lhs :function () #> $ gradient :function () #> $ conv :function () #> $ incr :function () #> $ setVarying:function (vary = rep_len(TRUE, np)) #> $ setPars :function (newPars) #> $ getPars :function () #> $ getAllPars:function () #> $ getEnv :function () #> $ trace :function () #> $ Rmat :function () #> $ predict :function (newdata = list(), qr = FALSE) #> - attr(*, "class")= chr "nlsModel" # so 16 functions, all sharing the *same* environment very # efficiently and nicely # this is *the* environment for the fitted model: fmE <- environment(fm1$m[[1]]) ls.str(fmE) #> convCrit : function () #> dev : num 0.00479 #> env : <environment: 0x0000028e34566e48> #> form : Class 'formula' language density ~ SSlogis(log(conc), Asym, xmid, scal) #> getPars : function () #> getPars.noVarying : function () #> getPars.varying : function () #> getRHS : function () #> getRHS.noVarying : function () #> getRHS.varying : function () #> gradCall : language attr(ans, "gradient")[c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE| __truncated__ ... #> ind : List of 3 #> $ Asym: int 1 #> $ xmid: int 2 #> $ scal: int 3 #> internalPars : Named num [1:3] 2.35 1.48 1.04 #> lhs : num [1:16] 0.017 0.018 0.121 0.124 0.206 0.215 0.377 0.374 0.614 0.609 ... #> nDcentral : logi FALSE #> npar : int 3 #> QR : List of 4 #> $ qr : num [1:16, 1:3] -1.5221 0.0086 0.0314 0.0314 0.0584 ... #> $ rank : int 3 #> $ qraux: num [1:3] 1.01 1.03 1.28 #> $ pivot: int [1:3] 1 2 3 #> resid : num [1:16] -0.01368 -0.01268 0.00895 0.01195 -0.00258 ... #> rhs : num [1:16] 0.0307 0.0307 0.1121 0.1121 0.2086 ... #> scaleOffset : num 0 #> setPars : function (newPars) #> setPars.noVarying : function (newPars) #> setPars.varying : function (newPars) #> upper : NULL #> useParams : logi [1:3] TRUE TRUE TRUE #> wts : num [1:16] 1 1 1 1 1 1 1 1 1 1 ...
so the environment “contains” the functions themselves (but quite a few more things) and for an environment that means it only has pointers to the same function objects which are also in
fm1$m
.So, there has been a nice convincing and important example on how to do this - inside R for more than two decennia.
Final emphasis was added by me.
OneNote archive
Source: 2021-11-01-archival-notes.Rmd
Here is a bunch of pre-2021 items from OneNote and elsewhere.
Added variable plots (2019-05-14)
- Let
m = y ~ x1 + ...
be the full regression model - Let
m_y1 = y ~ ...
(all of the predictors except x1) - Let
m_x1 = x1 ~ ...
(regress x1 on those same predictors) - Let
m_partial = resid(m_y1) ~ resid(m_x1)
- The plot of
resid(m_x1)
versusresid(m_y1)
is the added variable plot. It shows “X1 | others)” and “y | others” - Residuals in the final
m_partial
model will be same as residuals from full modelm
- Coefficient of
x1
in finalm_partial
model will be same as coefficient from full modelm
.
AV plots by hand and by car::avPlot()
par(mar = c(4, 4, 1, 1))
m <- lm(mpg ~ wt + hp + am, mtcars)
m_y1 <- lm(mpg ~ hp + am, mtcars)
m_x1 <- lm(wt ~ hp + am, mtcars)
m_partial <- lm(resid(m_y1) ~ resid(m_x1))
library(patchwork)
p1 <- wrap_elements(
full = ~ plot(resid(m_x1), resid(m_y1))
)
p2 <- wrap_elements(
full = ~ car::avPlot(m, "wt")
)
p1 + p2
all.equal(residuals(m_partial), residuals(m))
#> [1] TRUE
coef(m)["wt"]
#> wt
#> -2.878575
coef(m_partial)[2]
#> resid(m_x1)
#> -2.878575
What about interactions?
R 3.6.0 released! (2019-04-26)
Highlights for me
- New
sample()
implementation - New function
asplit()
allow splitting an array or matrix by its margins. - Functions mentioned I didn’t know about:
lengths()
,trimws()
,extendrange()
,convertColor()
,strwidth()
Standardized gamma distributions (2019-04-18)
A gamma times a positive non-zero constant is still a gamma. Paul and I used this fact to standardize results from different models onto a single “z” scale. Here’s a made-up example.
Simulate some gamma-distributed data.
library(patchwork)
set.seed(20211122)
par(mar = c(4, 4, 1, 1))
y <- rgamma(n = 100, shape = 20)
y2 <- rgamma(n = 100, shape = 200)
wrap_elements(full = ~ hist(y)) +
wrap_elements(full = ~ hist(y2))
Fit a GLM
get_shape <- function(model) {
1 / summary(model)[["dispersion"]]
}
m <- glm(y ~ 1, Gamma(link = "identity"))
m2 <- glm(y2 ~ 1, Gamma(link = "identity"))
get_shape(m)
#> [1] 21.09197
get_shape(m2)
#> [1] 178.6259
The two z
values have similar scales.
par(mar = c(4, 4, 1, 1))
r <- residuals(m)
r2 <- residuals(m2)
z <- y / fitted(m)
z2 <- y2 / fitted(m2)
wrap_elements(
full = ~ car::qqPlot(y, distribution = "gamma", shape = get_shape(m))
) +
wrap_elements(
full = ~ car::qqPlot(y2, distribution = "gamma", shape = get_shape(m2))
)
wrap_elements(
full = ~ car::qqPlot(
z,
distribution = "gamma",
shape = get_shape(m),
scale = 1 / get_shape(m))
) +
wrap_elements(
full = ~ car::qqPlot(
z2,
distribution = "gamma",
shape = get_shape(m2),
scale = 1 / get_shape(m2))
)
Jaccard similarity (2019-03-14)
Jaccard similarity is the size of intersection of A and B divided by size of union of A and B.
Suppose a listener types of a few sentences twice. These are probes for intra-listener transcription reliability. Suppose that one of these sentences is 6 words long. If a listener typed 8 unique words for the two sentences, and 4 of them appear in both sentences, then they have Jaccard similarity of 4 / 8 = 0.5. Then we take the weighted mean of the Jaccard scores. We use the sentence length as the weight, so this sentence would have a weight of 6.
Asked twitter if they knew any IRR tutorials
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/
Phylogenetic regression (2019-02-26)
Listened to bits of phylogenetic regression lecture.
- You can model a simple linear regression as a multivariate regression.
- Make the covariance matrix the identity matrix and multiple it by the error term sigma.
- In this formulation, you can swap out the identity correlation matrix with something estimated using a correlation/distance matrix. If the distances were age, you can have units with similar ages have correlated errors. Now you have a gaussian process regression.
multicomp (2019-02-25)
- We are using the multcomp package and
glht()
to test hypotheses from fitted models. Never used this package before. Something to learn. -
glht()
will compute a group difference (like asymptote of SMI-LCT vs SMI-LCI) from a fitted model and give you a standard error, z statistic and p value for that difference. - I can get very similar results by sampling the multivariate normal distribution of the model coefficients/variance-covariance matrix and computing the group differences from the samples, using the standard deviation of the samples to get the standard error.
Rolling list of bookmarks (2019-10-01)
dtool: Manage scientific data https://dtool.readthedocs.io/en/latest/
A very first introduction to Hamiltonian Monte Carlo https://blogs.rstudio.com/tensorflow/posts/2019-10-03-intro-to-hmc/
Some things you maybe didn’t know about linear regression https://ryxcommar.com/2019/09/06/some-things-you-maybe-didnt-know-about-linear-regression/
dbx database tools for R https://github.com/ankane/dbx
Dash for R https://medium.com/@plotlygraphs/announcing-dash-for-r-82dce99bae13
How to interpret F-statistic https://stats.stackexchange.com/questions/12398/how-to-interpret-f-and-p-value-in-anova
The origin of statistically significant https://www.johndcook.com/blog/2008/11/17/origin-of-statistically-significant/
tidymv: Tidy Model Visualisation for Generalised Additive Models https://cran.r-project.org/web/packages/tidymv/index.html
Step-by-step examples of building publication-quality figures in ggplot2 https://github.com/clauswilke/practical_ggplot2
From data to viz https://www.data-to-viz.com/
Shapley model explanation https://github.com/slundberg/shap
JavaScript versus Data Science https://software-tools-in-javascript.github.io/js-vs-ds/en/
Penalized likelihood estimation https://modernstatisticalworkflow.blogspot.com/2017/11/what-is-likelihood-anyway.html
UTF-8 everywhere https://utf8everywhere.org/
Unicode programming https://begriffs.com/posts/2019-05-23-unicode-icu.html
R package to simulate colorblindness https://github.com/clauswilke/colorblindr
Data version control https://dvc.org/
Email tips https://twitter.com/LucyStats/status/1131285346455625734?s=20
colorcet library (python) https://colorcet.pyviz.org/
HCL wizard http://hclwizard.org/hclwizard/
Coloring for colorblindness. Has 8 palettes of color pairs https://davidmathlogic.com/colorblind/
5 things to consider when creating your CSS style guide by (malimirkeccita?) https://medium.com/p/5-things-to-consider-when-creating-your-css-style-guide-7b85fa70039d
Tesseract OCR engine for R https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html
Lua filters for rmarkdown documents https://github.com/crsh/rmdfiltr
An NIH Rmd template https://github.com/tgerke/nih-rmd-template
Commit message guide https://github.com/RomuloOliveira/commit-messages-guide
Linear regression diagnostic plots in ggplot2 https://github.com/yeukyul/lindia
A graphical introduction to dynamic programming https://avikdas.com/2019/04/15/a-graphical-introduction-to-dynamic-programming.html
Why software projects take longer than you think https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model.html
Automatic statistical reporting https://github.com/easystats/report
Multilevel models and CSD https://pubs.asha.org/doi/pdf/10.1044/2018_JSLHR-S-18-0075
Map of cognitive science http://www.riedlanna.com/cognitivesciencemap.html
An additive Gaussian process regression model for interpretable non-parametric analysis of longitudinal data https://www.nature.com/articles/s41467-019-09785-8
Common statistical tests are linear models (or: how to teach stats) https://lindeloev.github.io/tests-as-linear/
Monte Carlo sampling does not “explore” the posterior https://statmodeling.stat.columbia.edu/2019/03/25/mcmc-does-not-explore-posterior/
How to develop the five skills that will make you a great analyst https://mode.com/blog/how-to-develop-the-five-soft-skills-that-will-make-you-a-great-analyst
Confidence intervals are a ring toss https://twitter.com/epiellie/status/1073385427317465089
Mathematics for Machine Learning https://mml-book.github.io/
20 Tips for Senior Thesis Writers http://hwpi.harvard.edu/files/complit/files/twenty_tips_for_senior_thesis_writers_revised_august_2012.pdf
Comparing common analysis strategies for repeated measures data http://eshinjolly.com/2019/02/18/rep_measures/
Cosine similarity, Pearson correlation, and OLS coefficients https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/
qqplotr is a nice package for plotting qqplots https://cran.r-project.org/web/packages/qqplotr/index.html
Multidimensional item response theory https://github.com/philchalmers/mirt
Aki’s tutorials/materials on model selection https://github.com/avehtari/modelselection_tutorial
An Introverts Guide to Conferences https://laderast.github.io/2018/05/17/a-introvert-s-survival-guide-to-conferences/
Best practice guidance for linear mixed-effects models in psychological science https://psyarxiv.com/h3duq/
Viewing matrices and probabilities as graphs https://www.math3ma.com/blog/matrices-probability-graphs
Cross-validation for hierarchical models https://avehtari.github.io/modelselection/rats_kcv.html
All of Aki’s tutorials https://avehtari.github.io/modelselection/
User-friendly p values http://thenode.biologists.com/user-friendly-p-values/research/
Iodide is a Javascript notebook https://alpha.iodide.io/
Interesting question about what do when transformation changes the “test” of a highest-density interval. https://discourse.mc-stan.org/t/exponentiation-or-transformation-of-point-estimates/7848
Some ways to rethink statistical rules https://allendowney.blogspot.com/2015/12/many-rules-of-statistics-are-wrong.html
Toward a Principled Bayesian Workflow https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html
Stumbled across an article on mixed models and effect sizes: https://www.journalofcognition.org/articles/10.5334/joc.10/
I had these quotes in the old notes
In so complex a thing as human nature, we must consider, it is hard to find rules without exception. — George Eliot
If any one faculty of our nature may be called more wonderful than the rest, I do think it is memory…The memory is sometimes so retentive, so serviceable, so obedient; at others, so bewildered and so weak. We are, to be sure, a miracle every way; but our powers of recollecting and of forgetting do seem peculiarly past finding out. — Jane Austen
Colophon
Notebook built using notestar, a notebook system that uses targets to create a bookdown document. Theme is water.css as provided by the cleanrmd package.