Here’s the problem: I have some data with nested time series. Lots of them. It’s like there’s many, many little datasets inside my data. There are too many groups to plot all of the time series at once, so I just want to preview a handful of them.
For a working example, suppose we want to visualize the top 50 American female baby names over time. I start by adding up the total number of births for each name, finding the overall top 50 most populous names, and then keeping just the time series from those top names.
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
babynames <- babynames::babynames %>%
filter(sex == "F")
top50 <- babynames %>%
group_by(name) %>%
summarise(total = sum(n)) %>%
top_n(50, total)
# keep just rows in babynames that match a row in top50
top_names <- babynames %>%
semi_join(top50, by = "name")
Hmm, so what does this look like?
ggplot(top_names) +
aes(x = year, y = n) +
geom_line() +
facet_wrap("name")
Aaack, I can’t read anything! Can’t I just see a few of them?
This is a problem I face frequently, so frequently that I wrote a helper
function to handle this problem: sample_n_of()
. This is not a very
clever name, but it works. Below I call the function from my personal R
package and plot just the data from four names.
# For reproducible blogging
set.seed(20180524)
top_names %>%
tjmisc::sample_n_of(4, name) %>%
ggplot() +
aes(x = year, y = n) +
geom_line() +
facet_wrap("name")
In this post, I walk through how this function works. It’s not very complicated: It relies on some light tidy evaluation plus one obscure dplyr function.
Working through the function
As usual, let’s start by sketching out the function we want to write:
sample_n_of <- function(data, size, ...) {
# quote the dots
dots <- quos(...)
# ...now make things happen...
}
where size
are the number of groups to sample and ...
are the
columns names that define the groups. We use quos(...)
to capture and
quote those column names. (As I wrote
before, quotation is
how we bottle up R code so we can deploy it for later.)
For interactive testing, suppose our dataset are the time series from the top 50 names and we want data from a sample of 5 names. In this case, the values for the arguments would be:
data <- top_names
size <- 5
dots <- quos(name)
A natural way to think about this problem is that we want to sample
subgroups of the dataframe. First, we create a grouped version of the
dataframe using group_by()
. The function group_by()
also takes a
...
argument where the dots are typically names of columns in the
dataframe. We want to take the names inside of our dots
, unquote them
and plug them in to where the ...
goes in group_by()
. This is what
the tidy evaluation world calls
splicing.
Think of splicing as doing this:
# Demo function that counts the number of arguments in the dots
count_args <- function(...) length(quos(...))
example_dots <- quos(var1, var2, var2)
# Splicing turns the first form into the second one
count_args(!!! example_dots)
#> [1] 3
count_args(var1, var2, var2)
#> [1] 3
So, we create a grouped dataframe by splicing our dots into the
group_by()
function.
grouped <- data %>%
group_by(!!! dots)
There is a helper function buried in dplyr called group_indices()
which returns the grouping index for each row in a grouped dataframe.
grouped %>%
tibble::add_column(group_index = group_indices(grouped))
#> # A tibble: 6,507 × 6
#> # Groups: name [50]
#> year sex name n prop group_index
#> <dbl> <chr> <chr> <int> <dbl> <int>
#> 1 1880 F Mary 7065 0.0724 33
#> 2 1880 F Anna 2604 0.0267 5
#> 3 1880 F Emma 2003 0.0205 19
#> 4 1880 F Elizabeth 1939 0.0199 17
#> 5 1880 F Margaret 1578 0.0162 32
#> 6 1880 F Alice 1414 0.0145 1
#> 7 1880 F Sarah 1288 0.0132 45
#> 8 1880 F Laura 1012 0.0104 29
#> 9 1880 F Catherine 688 0.00705 11
#> 10 1880 F Helen 636 0.00652 22
#> # … with 6,497 more rows
We can randomly sample five of the group indices and keep the rows for just those groups.
unique_groups <- unique(group_indices(grouped))
sampled_groups <- sample(unique_groups, size)
sampled_groups
#> [1] 34 25 30 19 32
subset_of_the_data <- data %>%
filter(group_indices(grouped) %in% sampled_groups)
subset_of_the_data
#> # A tibble: 684 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1880 F Emma 2003 0.0205
#> 2 1880 F Margaret 1578 0.0162
#> 3 1880 F Melissa 33 0.000338
#> 4 1880 F Linda 27 0.000277
#> 5 1881 F Emma 2034 0.0206
#> 6 1881 F Margaret 1658 0.0168
#> 7 1881 F Melissa 40 0.000405
#> 8 1881 F Linda 38 0.000384
#> 9 1881 F Karen 6 0.0000607
#> 10 1882 F Emma 2303 0.0199
#> # … with 674 more rows
# Confirm that only five names are in the dataset
subset_of_the_data %>%
distinct(name)
#> # A tibble: 5 × 1
#> name
#> <chr>
#> 1 Emma
#> 2 Margaret
#> 3 Melissa
#> 4 Linda
#> 5 Karen
Putting these steps together, we get:
sample_n_of <- function(data, size, ...) {
dots <- quos(...)
group_ids <- data %>%
group_by(!!! dots) %>%
group_indices()
sampled_groups <- sample(unique(group_ids), size)
data %>%
filter(group_ids %in% sampled_groups)
}
We can test that the function works as we might expect. Sampling 10 names returns the data for 10 names.
ten_names <- top_names %>%
sample_n_of(10, name) %>%
print()
#> # A tibble: 1,271 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1880 F Margaret 1578 0.0162
#> 2 1880 F Alice 1414 0.0145
#> 3 1880 F Rebecca 236 0.00242
#> 4 1880 F Ruth 234 0.00240
#> 5 1880 F Betty 117 0.00120
#> 6 1880 F Samantha 21 0.000215
#> 7 1881 F Margaret 1658 0.0168
#> 8 1881 F Alice 1308 0.0132
#> 9 1881 F Ruth 275 0.00278
#> 10 1881 F Rebecca 226 0.00229
#> # … with 1,261 more rows
ten_names %>%
distinct(name)
#> # A tibble: 10 × 1
#> name
#> <chr>
#> 1 Margaret
#> 2 Alice
#> 3 Rebecca
#> 4 Ruth
#> 5 Betty
#> 6 Samantha
#> 7 Patricia
#> 8 Lisa
#> 9 Brenda
#> 10 Nicole
We can sample based on multiple columns too. Ten combinations of names and years should return just ten rows.
top_names %>%
sample_n_of(10, name, year)
#> # A tibble: 10 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1882 F Evelyn 125 0.00108
#> 2 1887 F Sarah 1436 0.00924
#> 3 1890 F Elizabeth 3112 0.0154
#> 4 1899 F Amanda 326 0.00132
#> 5 1911 F Shirley 362 0.000819
#> 6 1939 F Margaret 14952 0.0132
#> 7 1946 F Samantha 33 0.0000205
#> 8 1956 F Mary 61750 0.0300
#> 9 1961 F Lisa 42702 0.0206
#> 10 2000 F Helen 890 0.000446
Next steps
There are a few tweaks we could make to this function. For example, in my package’s version, I warn the user when the number of groups is too large.
too_many <- top_names %>%
tjmisc::sample_n_of(100, name)
#> Warning: Sample size (100) is larger than number of groups (50). Using size =
#> 50.
My version also randomly samples n of the rows when there are no grouping variables provided.
top_names %>%
tjmisc::sample_n_of(2)
#> # A tibble: 2 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1882 F Donna 19 0.000164
#> 2 1930 F Shirley 14776 0.0127
One open question is how to handle data that’s already grouped. The function we wrote above fails.
top_names %>%
group_by(name) %>%
sample_n_of(2, year)
#> Error in `filter()`:
#> ! Problem while computing `..1 = group_ids %in% sampled_groups`.
#> ✖ Input `..1` must be of size 138 or 1, not size 6507.
#> ℹ The error occurred in group 1: name = "Alice".
Is this a problem?
Here I think failure is okay because what do we think should happen? It’s not obvious. It should randomly choose 2 of the years for each name. Should it be the same two years? Then this should be fine.
top_names %>%
sample_n_of(2, year)
#> # A tibble: 90 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1895 F Mary 13446 0.0544
#> 2 1895 F Anna 5950 0.0241
#> 3 1895 F Helen 4023 0.0163
#> 4 1895 F Margaret 3931 0.0159
#> 5 1895 F Elizabeth 3603 0.0146
#> 6 1895 F Ruth 3551 0.0144
#> 7 1895 F Emma 2952 0.0119
#> 8 1895 F Alice 2457 0.00994
#> 9 1895 F Frances 1834 0.00742
#> 10 1895 F Sarah 1777 0.00719
#> # … with 80 more rows
Or, should those two years be randomly selected for each name? Then, we
should let do()
handle that. do()
takes some code that returns a
dataframe, applies it to each group, and returns the combined result.
top_names %>%
group_by(name) %>%
do(sample_n_of(., 2, year))
#> # A tibble: 100 × 5
#> # Groups: name [50]
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1889 F Alice 2145 0.0113
#> 2 1983 F Alice 699 0.000391
#> 3 1902 F Amanda 301 0.00107
#> 4 1997 F Amanda 12242 0.00641
#> 5 1913 F Amy 387 0.000591
#> 6 2013 F Amy 2233 0.00116
#> 7 1933 F Angela 577 0.000552
#> 8 1936 F Angela 595 0.000552
#> 9 1881 F Anna 2698 0.0273
#> 10 1965 F Anna 3921 0.00215
#> # … with 90 more rows
I think raising an error and forcing the user to clarify their code is a better than choosing one of these options and not doing what the user expects.
Last knitted on 2022-05-27. Source code on GitHub.1
-
.session_info #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22 ucrt) #> os Windows 10 x64 (build 22000) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.utf8 #> ctype English_United States.utf8 #> tz America/Chicago #> date 2022-05-27 #> pandoc NA #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> babynames 1.0.1 2021-04-12 [1] CRAN (R 4.2.0) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> farver 2.1.0 2021-02-28 [1] CRAN (R 4.2.0) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0) #> ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0) #> git2r 0.30.1 2022-03-16 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0) #> here 1.0.1 2020-12-13 [1] CRAN (R 4.2.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> knitr * 1.39 2022-04-26 [1] CRAN (R 4.2.0) #> labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> ragg 1.2.2 2022-02-21 [1] CRAN (R 4.2.0) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0) #> rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0) #> scales 1.2.0 2022-04-13 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0) #> systemfonts 1.0.4 2022-02-11 [1] CRAN (R 4.2.0) #> textshaping 0.3.6 2021-10-13 [1] CRAN (R 4.2.0) #> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.2.0) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0) #> tjmisc 0.0.0.9000 2022-03-21 [1] Github (tjmahr/tjmisc@6724405) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0) #> #> [1] C:/Users/Tristan/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.0/library #> #> ──────────────────────────────────────────────────────────────────────────────
Leave a comment