An example notebook created using notestar
Updated on Tuesday, March 01, 2022 01:56 PM
- Apr. 12, 2021 (Mixed models)
- Apr. 15, 2021 (Data exploration)
- Apr. 14, 2021 (About the notestar demo)
- References
Source code for demo available here: https://github.com/tjmahr/notestar-demo
Apr. 12, 2021 (Mixed models)
Source: 2021-04-12-models.Rmd
We fit two mixed models on the sleep dataset using brms.
One with random intercepts:
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
#> ✓ tibble 3.1.6 ✓ dplyr 1.0.8
#> ✓ tidyr 1.2.0 ✓ stringr 1.4.0
#> ✓ readr 2.1.2 ✓ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.16.3). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#>
#> Attaching package: 'brms'
#> The following object is masked from 'package:stats':
#>
#> ar
<- targets::tar_read(model_1)
model_1 #> Warning in file_diff_dbl(chr): NAs introduced by coercion
<- targets::tar_read(model_2)
model_2 #> Warning in file_diff_dbl(chr): NAs introduced by coercion
model_1#> Family: gaussian
#> Links: mu = identity; sigma = identity
#> Formula: Reaction ~ Days + (1 | Subject)
#> Data: data (Number of observations: 180)
#> Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
#> total post-warmup draws = 4000
#>
#> Group-Level Effects:
#> ~Subject (Number of levels: 18)
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sd(Intercept) 39.20 7.35 27.30 56.13 1.00 798 1339
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 251.59 10.20 231.61 272.66 1.00 585 1159
#> Days 10.46 0.82 8.89 12.01 1.00 3901 2748
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma 31.23 1.73 28.11 34.84 1.00 3480 3041
#>
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
And one with random slopes.
model_2#> Family: gaussian
#> Links: mu = identity; sigma = identity
#> Formula: Reaction ~ Days + (Days | Subject)
#> Data: data (Number of observations: 180)
#> Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
#> total post-warmup draws = 4000
#>
#> Group-Level Effects:
#> ~Subject (Number of levels: 18)
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sd(Intercept) 26.95 6.97 15.55 43.18 1.00 1835 2362
#> sd(Days) 6.62 1.55 4.11 10.12 1.00 1394 2015
#> cor(Intercept,Days) 0.07 0.30 -0.49 0.66 1.00 1082 1815
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 251.37 7.31 236.65 265.89 1.00 2050 2477
#> Days 10.44 1.73 7.06 13.83 1.00 1412 2051
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma 25.92 1.56 23.16 29.24 1.00 3671 2728
#>
#> Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
We might look at the overall, subject-level effects in the two models.
<- model_1 %>%
summary_1 posterior_summary() %>%
as_tibble(rownames = "parameter") %>%
mutate(model = "model_1")
<- model_2 %>%
summary_2 posterior_summary() %>%
as_tibble(rownames = "parameter") %>%
mutate(model = "model_2")
head(summary_2)
#> # A tibble: 6 × 6
#> parameter Estimate Est.Error Q2.5 Q97.5 model
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 b_Intercept 251. 7.31 237. 266. model_2
#> 2 b_Days 10.4 1.73 7.06 13.8 model_2
#> 3 sd_Subject__Intercept 26.9 6.97 15.6 43.2 model_2
#> 4 sd_Subject__Days 6.62 1.55 4.11 10.1 model_2
#> 5 cor_Subject__Intercept__Days 0.0746 0.300 -0.489 0.659 model_2
#> 6 sigma 25.9 1.56 23.2 29.2 model_2
<- bind_rows(summary_1, summary_2) %>%
summary filter(!stringr::str_detect(parameter, "^r_Subject"))
ggplot(summary) +
aes(x = Estimate, y = model) +
geom_pointrange(aes(xmin = Q2.5, xmax = Q97.5), position = position_dodge()) +
facet_wrap("parameter", scales = "free_x", ncol = 2)
#> Warning: Width not defined. Set with `position_dodge(width = ?)`
Model comparison
As part of the build pipeline, we computed a leave-one-out information criteria (LOOIC) and LOO-weighted Bayesian R-squared statistic. Let’s look at the R2 first.
loo_R2(model_1)
#> Estimate Est.Error Q2.5 Q97.5
#> R2 0.6621241 0.04764755 0.5606662 0.7428179
loo_R2(model_2)
#> Estimate Est.Error Q2.5 Q97.5
#> R2 0.7457722 0.05443348 0.6222576 0.8316524
This comparison would favor the second model.
Let’s compare the LOOIC values. The second model has lower LOOIC values.
loo(model_1)
#>
#> Computed from 4000 by 180 log-likelihood matrix
#>
#> Estimate SE
#> elpd_loo -884.7 14.2
#> p_loo 19.1 3.2
#> looic 1769.3 28.4
#> ------
#> Monte Carlo SE of elpd_loo is 0.1.
#>
#> Pareto k diagnostic values:
#> Count Pct. Min. n_eff
#> (-Inf, 0.5] (good) 179 99.4% 714
#> (0.5, 0.7] (ok) 1 0.6% 189
#> (0.7, 1] (bad) 0 0.0% <NA>
#> (1, Inf) (very bad) 0 0.0% <NA>
#>
#> All Pareto k estimates are ok (k < 0.7).
#> See help('pareto-k-diagnostic') for details.
loo(model_2)
#>
#> Computed from 4000 by 180 log-likelihood matrix
#>
#> Estimate SE
#> elpd_loo -861.0 22.3
#> p_loo 34.3 8.6
#> looic 1722.0 44.6
#> ------
#> Monte Carlo SE of elpd_loo is NA.
#>
#> Pareto k diagnostic values:
#> Count Pct. Min. n_eff
#> (-Inf, 0.5] (good) 172 95.6% 835
#> (0.5, 0.7] (ok) 5 2.8% 1046
#> (0.7, 1] (bad) 2 1.1% 25
#> (1, Inf) (very bad) 1 0.6% 32
#> See help('pareto-k-diagnostic') for details.
We could report a difference.
loo_compare(model_1, model_2)
#> elpd_diff se_diff
#> model_2 0.0 0.0
#> model_1 -23.7 11.7
These would have to multiplied by -2 to be on the deviance scale.
Apr. 15, 2021 (Data exploration)
Source: 2021-04-11-data-exploration.Rmd
Let’s read in our cached dataset.
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
#> ✓ tibble 3.1.6 ✓ dplyr 1.0.8
#> ✓ tidyr 1.2.0 ✓ stringr 1.4.0
#> ✓ readr 2.1.2 ✓ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
<- targets::tar_read(sleep_data)
data #> Warning in file_diff_dbl(chr): NAs introduced by coercion
This is the sleepstudy
dataset (Belenky et al., 2003) from the lme4 package (Bates, Mächler, Bolker, & Walker, 2015). It had repeated measurements of reaction times nested in participants.
head(data)
#> # A tibble: 6 × 3
#> Reaction Days Subject
#> <dbl> <dbl> <dbl>
#> 1 250. 0 308
#> 2 259. 1 308
#> 3 251. 2 308
#> 4 321. 3 308
#> 5 357. 4 308
#> 6 415. 5 308
count(data, Subject)
#> # A tibble: 18 × 2
#> Subject n
#> <dbl> <int>
#> 1 308 10
#> 2 309 10
#> 3 310 10
#> 4 330 10
#> 5 331 10
#> 6 332 10
#> 7 333 10
#> 8 334 10
#> 9 335 10
#> 10 337 10
#> 11 349 10
#> 12 350 10
#> 13 351 10
#> 14 352 10
#> 15 369 10
#> 16 370 10
#> 17 371 10
#> 18 372 10
count(data, Days)
#> # A tibble: 10 × 2
#> Days n
#> <dbl> <int>
#> 1 0 18
#> 2 1 18
#> 3 2 18
#> 4 3 18
#> 5 4 18
#> 6 5 18
#> 7 6 18
#> 8 7 18
#> 9 8 18
#> 10 9 18
We can plot all the participants’ data.
ggplot(data) +
aes(x = Days, y = Reaction) +
geom_line(aes(group = Subject)) +
stat_smooth(se = FALSE, method = "lm", size = 2, formula = y ~ x)
Apr. 14, 2021 (About the notestar demo)
Source: 2021-04-10-about-demo.Rmd
This note is a small demo. It should illustrate using targets to fit a model and using notestar to write up some data notes and modeling notes. It should use feature an example of using pandoc citations.