abbs <- c("AL", "AK", "AZ", "AZ", "WI", "WS")
Do you ever have to recode many values at once? It’s a frequent chore when preparing data. For example, suppose we had to replace state abbreviations with the full names:
You could write several ifelse()
statements.
ifelse(abbs == "AL", "Alabama",
ifelse(abbs == "AK", "Alaska",
ifelse(abbs == "AZ", "Arizona",
Actually, never mind! That gets out of hand very quickly.
case_when()
is nice, especially when the replacement rules are more complex than 1-to-1 matching.
dplyr::case_when(
# Syntax: logical test ~ value to use when test is TRUE
abbs == "AL" ~ "Alabama",
abbs == "AK" ~ "Alaska",
abbs == "AZ" ~ "Arizona",
abbs == "WI" ~ "Wisconsin",
# a fallback/default value
TRUE ~ "No match"
# 2023-08-09: Alternatively, use `.default`
# .default = "No match"
)
#> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
case_match()
was added in dplyr 1.1.0 in January 2023, and it uses a formula syntax like case_when()
but 1) it takes an input vector and 2) uses values instead of logical expressions. Let’s pretend that some users always write "WIS"
for "Wisconsin"
. Then, we can handle both with c("WI", "WIS")
:
dplyr::case_match(
c(abbs, "WIS"),
# Syntax: values_to_check ~ value_to_use
"AL" ~ "Alabama",
"AK" ~ "Alaska",
"AZ" ~ "Arizona",
c("WI", "WIS") ~ "Wisconsin",
.default = "No match"
)
#> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
#> [7] "Wisconsin"
We could also use one of my very favorite R tricks: Character subsetting. We create a named vector where the names are the data we have and the values are the data we want. I use the mnemonic old_value = new_value
. In this case, we make a lookup table like so:
lookup <- c(
# Syntax: name = value
"AL" = "Alabama",
"AK" = "Alaska",
"AZ" = "Arizona",
"WI" = "Wisconsin"
)
For example, subsetting with the string "AL"
will retrieve the value with the name "AL"
.
lookup["AL"]
#> AL
#> "Alabama"
With a vector of names, we can look up the values all at once.
lookup[abbs]
#> AL AK AZ AZ WI <NA>
#> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" NA
If the names and the replacement values are stored in vectors, we can construct the lookup table programmatically using stats::setNames()
. In our case, the datasets
package provides vectors with state names and state abbreviations.
full_lookup <- setNames(datasets::state.name, datasets::state.abb)
head(full_lookup)
#> AL AK AZ AR CA CO
#> "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado"
full_lookup[abbs]
#> AL AK AZ AZ WI <NA>
#> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" NA
One complication is that the character subsetting yields NA
when the lookup table doesn’t have a matching name. That’s what’s happening above with the illegal abbreviation "WS"
. We can fix this by replacing the NA
values with some default value.
matches <- full_lookup[abbs]
matches[is.na(matches)] <- "No match"
matches
#> AL AK AZ AZ WI <NA>
#> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
Finally, to clean away any traces of the matching process, we can unname()
the results.
unname(matches)
#> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
Many-to-one lookup tables
By the way, the lookup tables can be many-to-one. That is, different names can retrieve the same value. For example, we can handle this example that has synonymous names and differences in capitalization with many-to-one matching.
lookup <- c(
"python" = "Python",
"r" = "R",
"node" = "Javascript",
"js" = "Javascript",
"javascript" = "Javascript"
)
languages <- c("JS", "js", "Node", "R", "Python", "r", "JAvascript")
# Use tolower() to normalize the language names so
# e.g., "R" and "r" can both match R
lookup[tolower(languages)]
#> js js node r python r
#> "Javascript" "Javascript" "Javascript" "R" "Python" "R"
#> javascript
#> "Javascript"
Character by character string replacement
I’m motivated to write about character subsetting today because I used it in a Stack Overflow answer. Here is my paraphrasing of the problem.
Let’s say I have a long character string, and I’d like to use
stringr::str_replace_all
to replace certain letters with others. According to the documentation,str_replace_all
can take a named vector and replaces the name with the value. That works fine for 1 replacement, but for multiple, it seems to do the replacements iteratively, so that one replacement can replace another one.library(tidyverse) text_string = "developer" # This works fine text_string |> str_replace_all(c(e ="X")) #> [1] "dXvXlopXr" # But this is not what I want text_string |> str_replace_all(c(e ="p", p = "e")) #> [1] "develoeer" # Desired result would be "dpvploepr"
The iterative behavior here is that str_replace_all("developer", c(e ="p", p = "e"))
first replaces e
with p
(yielding "dpvploppr"
) and then it applies the second rule on the output of the first rule, replacing p
with e
(yielding "develoeer"
).
When I read this question, the replacement rules looked a lot like the lookup tables that I use in character subsetting so I presented a function that handles this problem by using character subsetting.
Let’s work through the question’s example. First, let’s break the string into characters.
To avoid the issue of NA
s, we create default rules so that every character in the input is replaced by itself.
Now, we overwrite the default rules with the specific ones we are interested in.
# Find rules with the names as the real rules.
# Replace them with the real rules.
complete_rules[names(rules)] <- rules
complete_rules
#> d e v l o p r
#> "d" "p" "v" "l" "o" "e" "r"
Then lookup with character subsetting will effectively apply all the replacement rules. We glue the characters back together again to finish the transformation
Here is everything combined into a single function, with some additional steps needed to handle multiple strings at once.
str_replace_chars <- function(string, rules) {
# Expand rules to replace characters with themselves
# if those characters do not have a replacement rule
chars <- unique(unlist(strsplit(string, "")))
complete_rules <- setNames(chars, chars)
complete_rules[names(rules)] <- rules
# Split each string into characters, replace and unsplit
for (string_i in seq_along(string)) {
chars_i <- unlist(strsplit(string[string_i], ""))
string[string_i] <- paste0(complete_rules[chars_i], collapse = "")
}
string
}
rules <- c(a = "X", p = "e", e = "p")
strings <- c("application", "developer")
str_replace_chars(strings, rules)
#> [1] "XeelicXtion" "dpvploepr"