my experimental quarto blog - Recode values with character subsetting

Do you ever have to recode many values at once? It’s a frequent chore when preparing data. For example, suppose we had to replace state abbreviations with the full names:

abbs <- c("AL", "AK", "AZ", "AZ", "WI", "WS")

You could write several ifelse() statements.

ifelse(abbs == "AL", "Alabama", 
       ifelse(abbs == "AK", "Alaska", 
              ifelse(abbs == "AZ", "Arizona",

Actually, never mind! That gets out of hand very quickly.

case_when() is nice, especially when the replacement rules are more complex than 1-to-1 matching.

dplyr::case_when(
  # Syntax: logical test ~ value to use when test is TRUE
  abbs == "AL" ~ "Alabama",
  abbs == "AK" ~ "Alaska",
  abbs == "AZ" ~ "Arizona",
  abbs == "WI" ~ "Wisconsin",
  # a fallback/default value
  TRUE ~ "No match"
  # 2023-08-09: Alternatively, use `.default`
  # .default = "No match"
)
#> [1] "Alabama"   "Alaska"    "Arizona"   "Arizona"   "Wisconsin" "No match"

Update: Another dplyr tool [Aug. 9, 2023]

case_match() was added in dplyr 1.1.0 in January 2023, and it uses a formula syntax like case_when() but 1) it takes an input vector and 2) uses values instead of logical expressions. Let’s pretend that some users always write "WIS" for "Wisconsin". Then, we can handle both with c("WI", "WIS"):

dplyr::case_match(
  c(abbs, "WIS"),
  # Syntax: values_to_check ~ value_to_use
  "AL" ~ "Alabama",
  "AK" ~ "Alaska",
  "AZ" ~ "Arizona",
  c("WI", "WIS") ~ "Wisconsin", 
  .default = "No match"
)
#> [1] "Alabama"   "Alaska"    "Arizona"   "Arizona"   "Wisconsin" "No match" 
#> [7] "Wisconsin"

We could also use one of my very favorite R tricks: Character subsetting. We create a named vector where the names are the data we have and the values are the data we want. I use the mnemonic old_value = new_value. In this case, we make a lookup table like so:

lookup <- c(
  # Syntax: name = value
  "AL" = "Alabama",
  "AK" = "Alaska",
  "AZ" = "Arizona",
  "WI" = "Wisconsin"
)

For example, subsetting with the string "AL" will retrieve the value with the name "AL".

lookup["AL"]
#>        AL 
#> "Alabama"

With a vector of names, we can look up the values all at once.

lookup[abbs]
#>          AL          AK          AZ          AZ          WI        <NA> 
#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"          NA

If the names and the replacement values are stored in vectors, we can construct the lookup table programmatically using stats::setNames(). In our case, the datasets package provides vectors with state names and state abbreviations.

full_lookup <- setNames(datasets::state.name, datasets::state.abb)
head(full_lookup)
#>           AL           AK           AZ           AR           CA           CO 
#>    "Alabama"     "Alaska"    "Arizona"   "Arkansas" "California"   "Colorado"

full_lookup[abbs]
#>          AL          AK          AZ          AZ          WI        <NA> 
#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"          NA

One complication is that the character subsetting yields NA when the lookup table doesn’t have a matching name. That’s what’s happening above with the illegal abbreviation "WS". We can fix this by replacing the NA values with some default value.

matches <- full_lookup[abbs]
matches[is.na(matches)] <- "No match"
matches
#>          AL          AK          AZ          AZ          WI        <NA> 
#>   "Alabama"    "Alaska"   "Arizona"   "Arizona" "Wisconsin"  "No match"

Finally, to clean away any traces of the matching process, we can unname() the results.

unname(matches)
#> [1] "Alabama"   "Alaska"    "Arizona"   "Arizona"   "Wisconsin" "No match"

Many-to-one lookup tables

By the way, the lookup tables can be many-to-one. That is, different names can retrieve the same value. For example, we can handle this example that has synonymous names and differences in capitalization with many-to-one matching.

lookup <- c(
  "python" = "Python", 
  "r" = "R", 
  "node" = "Javascript", 
  "js" = "Javascript", 
  "javascript" = "Javascript"
)

languages <- c("JS", "js", "Node", "R", "Python", "r", "JAvascript")

# Use tolower() to normalize the language names so 
# e.g., "R" and "r" can both match R
lookup[tolower(languages)]
#>           js           js         node            r       python            r 
#> "Javascript" "Javascript" "Javascript"          "R"     "Python"          "R" 
#>   javascript 
#> "Javascript"

Character by character string replacement

I’m motivated to write about character subsetting today because I used it in a Stack Overflow answer. Here is my paraphrasing of the problem.

Let’s say I have a long character string, and I’d like to use stringr::str_replace_all to replace certain letters with others. According to the documentation, str_replace_all can take a named vector and replaces the name with the value. That works fine for 1 replacement, but for multiple, it seems to do the replacements iteratively, so that one replacement can replace another one.
library(tidyverse)
text_string = "developer"

# This works fine
text_string |>
  str_replace_all(c(e ="X")) 
#> [1] "dXvXlopXr"

# But this is not what I want
text_string |>
  str_replace_all(c(e ="p", p = "e"))
#> [1] "develoeer"

# Desired result would be "dpvploepr"

The iterative behavior here is that str_replace_all("developer", c(e ="p", p = "e")) first replaces e with p (yielding "dpvploppr") and then it applies the second rule on the output of the first rule, replacing p with e (yielding "develoeer").

When I read this question, the replacement rules looked a lot like the lookup tables that I use in character subsetting so I presented a function that handles this problem by using character subsetting.

Let’s work through the question’s example. First, let’s break the string into characters.

input <- "developer"
rules <- c(e = "p", p = "e")

chars <- unlist(strsplit(input, ""))
chars
#> [1] "d" "e" "v" "e" "l" "o" "p" "e" "r"

To avoid the issue of NAs, we create default rules so that every character in the input is replaced by itself.

unique_chars <- unique(chars)
complete_rules <- setNames(unique_chars, unique_chars)
complete_rules
#>   d   e   v   l   o   p   r 
#> "d" "e" "v" "l" "o" "p" "r"

Now, we overwrite the default rules with the specific ones we are interested in.

# Find rules with the names as the real rules. 
# Replace them with the real rules.
complete_rules[names(rules)] <- rules
complete_rules
#>   d   e   v   l   o   p   r 
#> "d" "p" "v" "l" "o" "e" "r"

Then lookup with character subsetting will effectively apply all the replacement rules. We glue the characters back together again to finish the transformation

replaced <- unname(complete_rules[chars])
paste0(replaced, collapse = "")
#> [1] "dpvploepr"

Here is everything combined into a single function, with some additional steps needed to handle multiple strings at once.

str_replace_chars <- function(string, rules) {
  # Expand rules to replace characters with themselves 
  # if those characters do not have a replacement rule
  chars <- unique(unlist(strsplit(string, "")))
  complete_rules <- setNames(chars, chars)
  complete_rules[names(rules)] <- rules

  # Split each string into characters, replace and unsplit
  for (string_i in seq_along(string)) {
    chars_i <- unlist(strsplit(string[string_i], ""))
    string[string_i] <- paste0(complete_rules[chars_i], collapse = "")
  }
  string
}

rules <- c(a = "X", p = "e", e = "p")
strings <- c("application", "developer")

str_replace_chars(strings, rules)
#> [1] "XeelicXtion" "dpvploepr"

Session info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/Chicago
 date     2023-08-09
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
 quarto   1.3.353

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
 dplyr         1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
 evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.1)
 knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 ragg          1.2.5   2023-01-12 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.23    2023-07-01 [1] CRAN (R 4.3.0)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 systemfonts   1.0.4   2022-02-11 [1] CRAN (R 4.3.0)
 textshaping   0.3.6   2021-10-13 [1] CRAN (R 4.3.0)
 tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.1)
 xfun          0.39    2023-04-20 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] C:/Users/Tristan/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.1/library

──────────────────────────────────────────────────────────────────────────────