Run (scaled) k-means on a dataset. — fit

Observations are scale()-ed before clustering.

Usage

fit_kmeans(data, k, vars, args_kmeans = list())

Arguments

data: a dataframe
k: number of clusters to create
vars: variable selection for clustering. Select multiple variables with c(), e.g., c(x, y). The selection supports tidyselect semantics tidyselect::select_helpers, e.g., c(x, starts_with("mean_").
args_kmeans: additional arguments passed to stats::kmeans().

Value

the original data but augmented with additional columns for clustering details. including .kmeans_cluster (cluster number of each observation, as a factor) and .kmeans_k (selected number of clusters).

Cluster-level information is also included. For example, suppose that we cluster using the variable x. Then the output will have a column .kmeans_x giving the cluster mean for x and .kmeans_rank_x giving the cluster labels reordered using the cluster means for x. The column .kmeans_sort contains the cluster sorted using the first principal component of the scaled variables. All columns of cluster indices are a factor() so that they can be plotted as discrete variables.

Details

Note that each variable is scaled() before clustering and then cluster means are unscaled to match the original data scale.

This function provides the original kmeans labels as .kmeans_cluster but other alternative labeling based on different sortings of the data. These are provided in order to deal with label-swapping in Bayesian models. See bootstrapping example below.

Examples

data_kmeans <- fit_kmeans(mtcars, 3, c(mpg, wt, hp))

library(ggplot2)
ggplot(data_kmeans) +
  aes(x = wt, y = mpg) +
  geom_point(aes(color = .kmeans_cluster))


ggplot(data_kmeans) +
  aes(x = wt, y = mpg) +
  geom_point(aes(color = .kmeans_rank_wt))


# Example of label swapping
set.seed(123)
data_boots <- lapply(
  1:10,
  function(x) {
    rows <- sample(seq_len(nrow(mtcars)), replace = TRUE)
    data <- mtcars[rows, ]
    data$.bootstrap <- x
    data
  }
) |>
  lapply(fit_kmeans, k = 3, c(mpg, wt, hp)) |>
  dplyr::bind_rows() |>
  dplyr::select(.bootstrap, dplyr::starts_with(".kmeans_")) |>
  dplyr::distinct()

# Clusters start off in random locations and move to center, so the labels
# differ between model runs and across bootstraps.
ggplot(data_boots) +
  aes(x = .kmeans_wt, y = .kmeans_mpg) +
  geom_point(aes(color = .kmeans_cluster)) +
  labs(title = "k-means centers on 10 bootstraps")


# Labels sorted using first principal component
# so the labels are more consistent.
ggplot(data_boots) +
  aes(x = .kmeans_wt, y = .kmeans_mpg) +
  geom_point(aes(color = .kmeans_sort)) +
  labs(title = "k-means centers on 10 bootstraps")